DATA 621 HW 5

SECTION 1 — DATA EXPLORATION

The wine training dataset contains approximately 12,000 observations and includes multiple chemical attributes of commercially available wines, a small set of marketing indicators, and the response variable TARGET, representing the number of sample cases purchased by distributors.

A structural review (str()) confirms that:

TARGET is a count variable, ranging from 0 to several hundred, with most wines having relatively low case purchases.

Most predictor variables (acidity, pH, density, sulphates, sulfur dioxide measures, etc.) are continuous.

LabelAppeal and STARS are discrete rating variables and theoretically expected to strongly influence wine demand.

Summary statistics (mean, SD, median) show:

Many chemical attributes (e.g., Acidity, Alcohol, Sulphates) follow skewed distributions, common in chemical datasets.

TARGET is right-skewed, indicating most wines sell few cases, while a small number sell very well.

LabelAppeal ranges from negative values to positive scores, reflecting poor-to-excellent customer appeal.

STARS ranges from 1 to 4, where higher ratings reflect expert preference.

A missingness check shows:

Several predictors contain missing values (common in chemical testing datasets).

As with earlier assignments, the fact that a value is missing may itself be predictive.

These will be handled in Section 2 by:

Creating missing-value flags

Imputing numeric variables using medians or other appropriate methods

Distribution of TARGET

A histogram of TARGET highlights:

A heavy right tail

Many zeros and low-count values

This suggests:

Poisson regression may be insufficient because Poisson assumes mean ≈ variance.

Negative Binomial regression is likely more appropriate due to overdispersion.

We will confirm this in Section 3.

Boxplots and Correlations

Boxplots reveal:

Chemical variables differ widely in scale and spread.

LabelAppeal and STARS show visible associations with TARGET:

Wines with higher STARS ratings tend to sell more cases.

Wines with more appealing labels also show higher sales.

Correlation heatmaps reveal:

Many chemical properties have low to moderate correlations with each other.

Marketing attributes (LabelAppeal, STARS) show stronger direct relationships with TARGET.

Our exploration confirms that:

TARGET is a non-negative integer outcome suitable for Poisson or Negative Binomial modeling.

The dataset contains meaningful marketing and chemical features.

There is overdispersion, suggesting Negative Binomial models will likely outperform Poisson.

Missing data must be handled carefully and may even be predictive.

############################################################
# SECTION 1 — DATA EXPLORATION 
############################################################

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

# Load data
train <- read.csv("wine-training-data.csv")
eval  <- read.csv("wine-evaluation-data.csv")

# Inspect structure & dimensions
dim(train)

## [1] 12795    16

str(train)

## 'data.frame':    12795 obs. of  16 variables:
##  $ INDEX             : int  1 2 4 5 6 7 8 11 12 13 ...
##  $ TARGET            : int  3 3 5 3 4 0 0 4 3 6 ...
##  $ FixedAcidity      : num  3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
##  $ VolatileAcidity   : num  1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
##  $ CitricAcid        : num  -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
##  $ ResidualSugar     : num  54.2 26.1 14.8 18.8 9.4 ...
##  $ Chlorides         : num  -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
##  $ FreeSulfurDioxide : num  NA 15 214 22 -167 -37 287 523 -213 62 ...
##  $ TotalSulfurDioxide: num  268 -327 142 115 108 15 156 551 NA 180 ...
##  $ Density           : num  0.993 1.028 0.995 0.996 0.995 ...
##  $ pH                : num  3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
##  $ Sulphates         : num  -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
##  $ Alcohol           : num  9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
##  $ LabelAppeal       : int  0 -1 -1 -1 0 0 0 1 0 0 ...
##  $ AcidIndex         : int  8 7 8 6 9 11 8 7 6 8 ...
##  $ STARS             : int  2 3 3 1 2 NA NA 3 NA 4 ...

# Summary statistics
summary(train)

##      INDEX           TARGET       FixedAcidity     VolatileAcidity  
##  Min.   :    1   Min.   :0.000   Min.   :-18.100   Min.   :-2.7900  
##  1st Qu.: 4038   1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300  
##  Median : 8110   Median :3.000   Median :  6.900   Median : 0.2800  
##  Mean   : 8070   Mean   :3.029   Mean   :  7.076   Mean   : 0.3241  
##  3rd Qu.:12106   3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400  
##  Max.   :16129   Max.   :8.000   Max.   : 34.400   Max.   : 3.6800  
##                                                                     
##    CitricAcid      ResidualSugar        Chlorides        FreeSulfurDioxide
##  Min.   :-3.2400   Min.   :-127.800   Min.   :-1.17100   Min.   :-555.00  
##  1st Qu.: 0.0300   1st Qu.:  -2.000   1st Qu.:-0.03100   1st Qu.:   0.00  
##  Median : 0.3100   Median :   3.900   Median : 0.04600   Median :  30.00  
##  Mean   : 0.3084   Mean   :   5.419   Mean   : 0.05482   Mean   :  30.85  
##  3rd Qu.: 0.5800   3rd Qu.:  15.900   3rd Qu.: 0.15300   3rd Qu.:  70.00  
##  Max.   : 3.8600   Max.   : 141.150   Max.   : 1.35100   Max.   : 623.00  
##                    NA's   :616        NA's   :638        NA's   :647      
##  TotalSulfurDioxide    Density             pH          Sulphates      
##  Min.   :-823.0     Min.   :0.8881   Min.   :0.480   Min.   :-3.1300  
##  1st Qu.:  27.0     1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800  
##  Median : 123.0     Median :0.9945   Median :3.200   Median : 0.5000  
##  Mean   : 120.7     Mean   :0.9942   Mean   :3.208   Mean   : 0.5271  
##  3rd Qu.: 208.0     3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600  
##  Max.   :1057.0     Max.   :1.0992   Max.   :6.130   Max.   : 4.2400  
##  NA's   :682                         NA's   :395     NA's   :1210     
##     Alcohol       LabelAppeal          AcidIndex          STARS      
##  Min.   :-4.70   Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
##  1st Qu.: 9.00   1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
##  Median :10.40   Median : 0.000000   Median : 8.000   Median :2.000  
##  Mean   :10.49   Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
##  3rd Qu.:12.40   3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
##  Max.   :26.50   Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
##  NA's   :653                                          NA's   :3359

# Check missingness
missing_counts <- colSums(is.na(train))
missing_counts

##              INDEX             TARGET       FixedAcidity    VolatileAcidity 
##                  0                  0                  0                  0 
##         CitricAcid      ResidualSugar          Chlorides  FreeSulfurDioxide 
##                  0                616                638                647 
## TotalSulfurDioxide            Density                 pH          Sulphates 
##                682                  0                395               1210 
##            Alcohol        LabelAppeal          AcidIndex              STARS 
##                653                  0                  0               3359

# TARGET distribution
table(train$TARGET)

## 
##    0    1    2    3    4    5    6    7    8 
## 2734  244 1091 2611 3177 2014  765  142   17

summary(train$TARGET)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   3.029   4.000   8.000

ggplot(train, aes(x = TARGET)) +
  geom_histogram(binwidth = 1, fill = "purple", color = "black") +
  labs(title = "Distribution of TARGET (Cases Purchased)",
       x = "Number of Cases", y = "Count")

# Identify numeric predictors
numeric_vars <- train %>% 
  select(-INDEX, -TARGET) %>% 
  select(where(is.numeric))

# Boxplots of numeric predictors
train_long <- train %>%
  pivot_longer(cols = colnames(numeric_vars),
               names_to = "Variable",
               values_to = "Value")

ggplot(train_long, aes(x = Variable, y = Value)) +
  geom_boxplot(fill = "steelblue") +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Boxplots of Numeric Predictor Variables")

## Warning: Removed 8200 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Correlation matrix (numeric variables only)
corr_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")

corrplot(corr_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.7,
         number.cex = 0.7,
         addCoef.col = "black")

# Key relationships
ggplot(train, aes(x = as.factor(STARS), y = TARGET)) +
  geom_boxplot(fill = "gold") +
  labs(title = "Cases Purchased vs STARS Rating",
       x = "STARS", y = "TARGET")

ggplot(train, aes(x = LabelAppeal, y = TARGET)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "TARGET vs LabelAppeal",
       x = "Label Appeal Score", y = "TARGET")

## `geom_smooth()` using formula = 'y ~ x'

SECTION 2 - Data Preparation

Several preprocessing and feature engineering steps were performed to prepare the wine data for count regression modeling. The dataset contains a mix of chemical measurements, expert ratings and marketing indicators, many of which exhibit skewed distributions and missing values that require careful handling.

First, all numeric predictor variables were identified, excluding the identification variable (INDEX) and the response variable (TARGET). Because the assignment notes that missingness itself may be predictive, I created a missing-value indicator flag for every numeric predictor (e.g., Alcohol_MISSING, Density_MISSING). These flags allow the models to learn whether the absence of a measurement carries information about wine sales.

Missing numeric values were then imputed using the median of the training data. Median imputation was chosen because many chemical variables exhibit skewness and outliers, and the median is robust to extreme values. The same imputation values derived from the training data were applied to the evaluation dataset to avoid data leakage.

Next, several chemical variables were transformed using logarithmic transformations, including Alcohol, Chlorides, CitricAcid, Density, FixedAcidity, FreeSulfurDioxide, ResidualSugar, Sulphates, TotalSulfurDioxide, and VolatileAcidity. These variables displayed right-skewed distributions, and log transformations help stabilize variance and improve model behavior—particularly for multiple linear regression models used as benchmarks.

To capture nonlinear and ordinal effects, key marketing variables were bucketized. The STARS variable was treated as an ordered categorical variable, while LabelAppeal was grouped into five interpretable categories ranging from very low to very high appeal. These groupings reflect how consumers and distributors are likely to respond to ratings and label design rather than to small numeric changes.

Finally, an interaction term between STARS and LabelAppeal was created. This variable captures the combined marketing effect of expert ratings and label design, reflecting the intuition that wines with both strong expert reviews and appealing labels are especially likely to be sampled and sold.

The resulting prepared datasets (train2 and eval2) include imputed values, missingness indicators, transformed chemical properties, bucketized ratings, and interaction terms. These features provide a robust foundation for building Poisson, Negative Binomial, and linear regression models in the next section.

############################################################
# SECTION 2 — DATA PREPARATION
############################################################

# Start from original data
train2 <- train
eval2  <- eval

# ---------------------------------------------------------
# 1. IDENTIFY NUMERIC VARIABLES (EXCLUDING ID & TARGET)
# ---------------------------------------------------------
num_vars <- train2 %>%
  select(-INDEX, -TARGET) %>%
  select(where(is.numeric)) %>%
  colnames()

# ---------------------------------------------------------
# 2. CREATE MISSING VALUE FLAGS
# ---------------------------------------------------------
for (v in num_vars) {
  flag_name <- paste0(v, "_MISSING")
  train2[[flag_name]] <- ifelse(is.na(train2[[v]]), 1L, 0L)
  eval2[[flag_name]]  <- ifelse(is.na(eval2[[v]]), 1L, 0L)
}

# ---------------------------------------------------------
# 3. IMPUTE MISSING NUMERIC VALUES (MEDIAN IMPUTATION)
# ---------------------------------------------------------
for (v in num_vars) {
  med <- median(train2[[v]], na.rm = TRUE)
  train2[[v]][is.na(train2[[v]])] <- med
  eval2[[v]][is.na(eval2[[v]])]   <- med
}

# ---------------------------------------------------------
# 4. TRANSFORM SKEWED VARIABLES
# (Log transforms for linear models & stability)
# ---------------------------------------------------------
skewed_vars <- c(
  "Alcohol", "Chlorides", "CitricAcid", "Density",
  "FixedAcidity", "FreeSulfurDioxide",
  "ResidualSugar", "Sulphates",
  "TotalSulfurDioxide", "VolatileAcidity"
)

skewed_vars <- intersect(skewed_vars, names(train2))

for (v in skewed_vars) {
  train2[[paste0("LOG_", v)]] <- log(train2[[v]] + 1)
  eval2[[paste0("LOG_", v)]]  <- log(eval2[[v]] + 1)
}

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

## Warning in log(train2[[v]] + 1): NaNs produced

## Warning in log(eval2[[v]] + 1): NaNs produced

# ---------------------------------------------------------
# 5. BUCKETIZE KEY RATING VARIABLES
# ---------------------------------------------------------
train2 <- train2 %>%
  mutate(
    STARS_BIN = factor(STARS, levels = c(1, 2, 3, 4)),
    LABEL_BIN = cut(
      LabelAppeal,
      breaks = c(-Inf, -1, 0, 1, 2, Inf),
      labels = c("VeryLow", "Low", "Neutral", "High", "VeryHigh")
    )
  )

eval2 <- eval2 %>%
  mutate(
    STARS_BIN = factor(STARS, levels = c(1, 2, 3, 4)),
    LABEL_BIN = cut(
      LabelAppeal,
      breaks = c(-Inf, -1, 0, 1, 2, Inf),
      labels = c("VeryLow", "Low", "Neutral", "High", "VeryHigh")
    )
  )

# ---------------------------------------------------------
# 6. INTERACTION TERMS (MARKETING EFFECTS)
# ---------------------------------------------------------
train2 <- train2 %>%
  mutate(
    STAR_LABEL_INTERACT = STARS * LabelAppeal
  )

eval2 <- eval2 %>%
  mutate(
    STAR_LABEL_INTERACT = STARS * LabelAppeal
  )

SECTION 3 — Build Models

In this section, I constructed multiple models to predict the number of wine sample cases purchased (TARGET) using three different modeling frameworks: Poisson regression, Negative Binomial regression, and multiple linear regression. Because the response variable is a non-negative count, Poisson and Negative Binomial models are the primary modeling approaches, while linear regression is included as a benchmark for comparison, as required by the assignment.

Poisson Regression Models

I first fit a baseline Poisson regression model (P1) using a core set of predictors consisting of chemical properties of the wine (e.g., Alcohol, Chlorides, Density, Sulphates, pH) and the two most interpretable marketing variables: STARS and LabelAppeal. In a Poisson model, coefficients represent changes in the log of the expected count; exponentiating a coefficient yields the multiplicative change in expected sales for a one-unit increase in the predictor.

Next, I fit an engineered Poisson regression model (P2) that expanded the feature set to include missing-value indicator flags, bucketized versions of key marketing variables (STARS_BIN, LABEL_BIN), log-transformed chemical variables, and an interaction term between STARS and LabelAppeal. This specification was designed to capture nonlinear effects, potential marketing synergies, and the possibility that missingness in chemical measurements is itself predictive of demand.

Before fitting the engineered models, I explicitly removed rows containing non-finite values (NA, NaN, or Inf) in any variables used by the engineered formula and dropped unused factor levels. This step ensures numerical stability and is standard practice when fitting generalized linear models with large engineered feature sets.

Negative Binomial Regression Models

Because count data frequently exhibit overdispersion (variance exceeding the mean), I also fit Negative Binomial regression models, which relax the Poisson assumption by introducing an additional dispersion parameter.

A baseline Negative Binomial model (NB1) was fit using the same predictors as Poisson model P1, allowing for a direct comparison between Poisson and Negative Binomial specifications. I then fit an engineered Negative Binomial model (NB2) using the same expanded feature set and cleaned modeling dataset as Poisson P2. Comparing these models helps determine whether accounting for overdispersion meaningfully improves model fit and inference.

Across both Poisson and Negative Binomial models, the coefficients for STARS and LabelAppeal were consistently positive, aligning with the theoretical expectation that higher expert ratings and more appealing labels increase distributor demand. Differences in coefficient magnitude across model types were examined to assess sensitivity to distributional assumptions.

Multiple Linear Regression Models

Although linear regression is not designed for count outcomes and can produce negative predictions, I fit two multiple linear regression models as benchmarks, consistent with the assignment requirements. Linear model L1 used the baseline predictor set, while Linear model L2 used the engineered predictor set after removing non-finite values.

These models provide a useful comparison point for understanding how coefficient direction and magnitude differ between count-based and linear frameworks. For example, some chemical variables exhibited similar directional effects across all models, while others showed differences in magnitude or sign, reflecting the different assumptions underlying linear and count regression.

Overdispersion Assessment and Interpretation

To assess the suitability of the Poisson models, I computed dispersion statistics based on Pearson residuals. In both Poisson specifications, the dispersion values exceeded 1, indicating the presence of overdispersion. This supports the use of Negative Binomial regression as a more appropriate modeling framework for the data.

Overall, Section 3 establishes a set of competing models with varying levels of complexity and interpretability. These models provide the foundation for formal comparison and selection in the next section, where I evaluate performance using information criteria and predictive accuracy and select a final count regression model for deployment.

############################################################
# SECTION 3 — BUILD MODELS
############################################################

library(tidyverse)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# ---------------------------------------------------------
# DEFINE PREDICTOR SETS
# ---------------------------------------------------------

base_vars <- c(
  "STARS", "LabelAppeal",
  "AcidIndex", "Alcohol", "Chlorides", "CitricAcid", "Density",
  "FixedAcidity", "FreeSulfurDioxide", "ResidualSugar",
  "Sulphates", "TotalSulfurDioxide", "VolatileAcidity", "pH"
)

base_vars <- base_vars[base_vars %in% names(train2)]

engineered_vars <- c(
  base_vars,
  paste0(base_vars, "_MISSING"),
  "STARS_BIN", "LABEL_BIN",
  "STAR_LABEL_INTERACT",
  paste0("LOG_", c(
    "Alcohol", "Chlorides", "CitricAcid", "Density",
    "FixedAcidity", "FreeSulfurDioxide", "ResidualSugar",
    "Sulphates", "TotalSulfurDioxide", "VolatileAcidity"
  ))
)

engineered_vars <- engineered_vars[engineered_vars %in% names(train2)]

formula_base <- as.formula(
  paste("TARGET ~", paste(base_vars, collapse = " + "))
)

formula_engineered <- as.formula(
  paste("TARGET ~", paste(engineered_vars, collapse = " + "))
)

# ---------------------------------------------------------
# CLEAN DATA FOR ENGINEERED MODELS 
# ---------------------------------------------------------

model_vars <- all.vars(formula_engineered)

train2_model <- train2[, model_vars]

# Replace non-finite numeric values with NA
for (v in names(train2_model)) {
  if (is.numeric(train2_model[[v]])) {
    train2_model[[v]][!is.finite(train2_model[[v]])] <- NA
  }
}

# Remove rows with NA in any model variable
train2_model <- na.omit(train2_model)

# Drop unused factor levels
train2_model <- droplevels(train2_model)

# ---------------------------------------------------------
# 3A. POISSON REGRESSION MODELS
# ---------------------------------------------------------

poisson_P1 <- glm(
  formula_base,
  data = train2,
  family = poisson(link = "log")
)

summary(poisson_P1)

## 
## Call:
## glm(formula = formula_base, family = poisson(link = "log"), data = train2)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.986  < 2e-16 ***
## LabelAppeal         1.965e-01  6.021e-03  32.634  < 2e-16 ***
## AcidIndex          -1.222e-01  4.463e-03 -27.381  < 2e-16 ***
## Alcohol             5.368e-03  1.410e-03   3.807 0.000141 ***
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023925 *  
## Density            -4.315e-01  1.921e-01  -2.247 0.024651 *  
## FixedAcidity       -4.515e-04  8.195e-04  -0.551 0.581697    
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## ResidualSugar       1.442e-04  1.545e-04   0.934 0.350499    
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## TotalSulfurDioxide  1.065e-04  2.268e-05   4.695 2.67e-06 ***
## VolatileAcidity    -5.045e-02  6.492e-03  -7.771 7.82e-15 ***
## pH                 -2.389e-02  7.639e-03  -3.128 0.001762 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 22861  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50391
## 
## Number of Fisher Scoring iterations: 5

poisson_P2 <- glm(
  formula_engineered,
  data = train2_model,
  family = poisson(link = "log")
)

summary(poisson_P2)

## 
## Call:
## glm(formula = formula_engineered, family = poisson(link = "log"), 
##     data = train2_model)
## 
## Coefficients: (7 not defined because of singularities)
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.7665455  9.9858748  -0.177 0.859583    
## STARS                       0.1833407  0.0137115  13.371  < 2e-16 ***
## LabelAppeal                 0.2280013  0.0694794   3.282 0.001032 ** 
## AcidIndex                  -0.0582679  0.0078539  -7.419 1.18e-13 ***
## Alcohol                     0.0096049  0.0060570   1.586 0.112797    
## Chlorides                  -0.1643307  0.0799578  -2.055 0.039858 *  
## CitricAcid                  0.0054621  0.0222505   0.245 0.806083    
## Density                    -7.3838116 26.2206865  -0.282 0.778248    
## FixedAcidity               -0.0060109  0.0033657  -1.786 0.074108 .  
## FreeSulfurDioxide          -0.0004539  0.0001870  -2.427 0.015215 *  
## ResidualSugar              -0.0018482  0.0008457  -2.185 0.028854 *  
## Sulphates                  -0.0034563  0.0209725  -0.165 0.869102    
## TotalSulfurDioxide         -0.0002670  0.0001046  -2.553 0.010682 *  
## VolatileAcidity            -0.0243170  0.0244615  -0.994 0.320179    
## pH                          0.0046211  0.0136426   0.339 0.734814    
## STARS_MISSING              -1.0756637  0.0318688 -33.753  < 2e-16 ***
## LabelAppeal_MISSING                NA         NA      NA       NA    
## AcidIndex_MISSING                  NA         NA      NA       NA    
## Alcohol_MISSING             0.0120966  0.0422350   0.286 0.774563    
## Chlorides_MISSING           0.0146352  0.0394191   0.371 0.710435    
## CitricAcid_MISSING                 NA         NA      NA       NA    
## Density_MISSING                    NA         NA      NA       NA    
## FixedAcidity_MISSING               NA         NA      NA       NA    
## FreeSulfurDioxide_MISSING   0.0143471  0.0355484   0.404 0.686511    
## ResidualSugar_MISSING       0.0338066  0.0349108   0.968 0.332859    
## Sulphates_MISSING          -0.0136384  0.0297321  -0.459 0.646443    
## TotalSulfurDioxide_MISSING  0.0096797  0.0366741   0.264 0.791827    
## VolatileAcidity_MISSING            NA         NA      NA       NA    
## pH_MISSING                 -0.0161408  0.0489418  -0.330 0.741555    
## STARS_BIN2                  0.1421490  0.0237738   5.979 2.24e-09 ***
## STARS_BIN3                  0.0787195  0.0282268   2.789 0.005290 ** 
## STARS_BIN4                         NA         NA      NA       NA    
## LABEL_BINLow               -0.0447566  0.0778991  -0.575 0.565598    
## LABEL_BINNeutral           -0.1681597  0.1454520  -1.156 0.247633    
## LABEL_BINHigh              -0.3321980  0.2174893  -1.527 0.126656    
## STAR_LABEL_INTERACT         0.0133249  0.0136891   0.973 0.330357    
## LOG_Alcohol                -0.0584851  0.0539977  -1.083 0.278762    
## LOG_Chlorides               0.1088846  0.0687263   1.584 0.113121    
## LOG_CitricAcid             -0.0094450  0.0225067  -0.420 0.674739    
## LOG_Density                14.1873813 52.2295159   0.272 0.785902    
## LOG_FixedAcidity            0.0439842  0.0237534   1.852 0.064069 .  
## LOG_FreeSulfurDioxide       0.0746857  0.0191797   3.894 9.86e-05 ***
## LOG_ResidualSugar           0.0325048  0.0169642   1.916 0.055355 .  
## LOG_Sulphates              -0.0169094  0.0222265  -0.761 0.446790    
## LOG_TotalSulfurDioxide      0.0780127  0.0213528   3.654 0.000259 ***
## LOG_VolatileAcidity        -0.0146600  0.0226184  -0.648 0.516891    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 7729.2  on 4355  degrees of freedom
## Residual deviance: 4520.9  on 4317  degrees of freedom
## AIC: 15492
## 
## Number of Fisher Scoring iterations: 6

# ---------------------------------------------------------
# 3B. NEGATIVE BINOMIAL REGRESSION MODELS
# ---------------------------------------------------------

nb_NB1 <- glm.nb(
  formula_base,
  data = train2
)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

summary(nb_NB1)

## 
## Call:
## glm.nb(formula = formula_base, data = train2, init.theta = 39421.16808, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         2.017e+00  1.957e-01  10.305  < 2e-16 ***
## STARS               2.198e-01  6.468e-03  33.984  < 2e-16 ***
## LabelAppeal         1.965e-01  6.021e-03  32.633  < 2e-16 ***
## AcidIndex          -1.222e-01  4.464e-03 -27.380  < 2e-16 ***
## Alcohol             5.368e-03  1.410e-03   3.806 0.000141 ***
## Chlorides          -6.005e-02  1.645e-02  -3.650 0.000262 ***
## CitricAcid          1.331e-02  5.892e-03   2.258 0.023931 *  
## Density            -4.315e-01  1.921e-01  -2.247 0.024655 *  
## FixedAcidity       -4.515e-04  8.196e-04  -0.551 0.581705    
## FreeSulfurDioxide   1.424e-04  3.513e-05   4.054 5.04e-05 ***
## ResidualSugar       1.443e-04  1.545e-04   0.934 0.350498    
## Sulphates          -1.877e-02  5.739e-03  -3.271 0.001073 ** 
## TotalSulfurDioxide  1.065e-04  2.269e-05   4.695 2.67e-06 ***
## VolatileAcidity    -5.045e-02  6.493e-03  -7.770 7.83e-15 ***
## pH                 -2.389e-02  7.639e-03  -3.128 0.001763 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(39421.17) family taken to be 1)
## 
##     Null deviance: 22860  on 12794  degrees of freedom
## Residual deviance: 18419  on 12780  degrees of freedom
## AIC: 50394
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  39421 
##           Std. Err.:  59706 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -50361.62

nb_NB2 <- glm.nb(
  formula_engineered,
  data = train2_model
)

## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached
## Warning in theta.ml(Y, mu, sum(w), w, limit = control$maxit, trace =
## control$trace > : iteration limit reached

summary(nb_NB2)

## 
## Call:
## glm.nb(formula = formula_engineered, data = train2_model, init.theta = 41983.57677, 
##     link = log)
## 
## Coefficients: (7 not defined because of singularities)
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.7666722  9.9863113  -0.177 0.859580    
## STARS                       0.1833415  0.0137122  13.371  < 2e-16 ***
## LabelAppeal                 0.2280004  0.0694810   3.281 0.001033 ** 
## AcidIndex                  -0.0582696  0.0078542  -7.419 1.18e-13 ***
## Alcohol                     0.0096051  0.0060573   1.586 0.112808    
## Chlorides                  -0.1643333  0.0799607  -2.055 0.039862 *  
## CitricAcid                  0.0054619  0.0222516   0.245 0.806098    
## Density                    -7.3841469 26.2218280  -0.282 0.778248    
## FixedAcidity               -0.0060111  0.0033658  -1.786 0.074114 .  
## FreeSulfurDioxide          -0.0004539  0.0001870  -2.427 0.015217 *  
## ResidualSugar              -0.0018483  0.0008457  -2.185 0.028854 *  
## Sulphates                  -0.0034563  0.0209735  -0.165 0.869108    
## TotalSulfurDioxide         -0.0002670  0.0001046  -2.553 0.010683 *  
## VolatileAcidity            -0.0243183  0.0244626  -0.994 0.320173    
## pH                          0.0046207  0.0136432   0.339 0.734847    
## STARS_MISSING              -1.0756630  0.0318695 -33.752  < 2e-16 ***
## LabelAppeal_MISSING                NA         NA      NA       NA    
## AcidIndex_MISSING                  NA         NA      NA       NA    
## Alcohol_MISSING             0.0120967  0.0422368   0.286 0.774571    
## Chlorides_MISSING           0.0146367  0.0394208   0.371 0.710420    
## CitricAcid_MISSING                 NA         NA      NA       NA    
## Density_MISSING                    NA         NA      NA       NA    
## FixedAcidity_MISSING               NA         NA      NA       NA    
## FreeSulfurDioxide_MISSING   0.0143467  0.0355499   0.404 0.686533    
## ResidualSugar_MISSING       0.0338058  0.0349124   0.968 0.332893    
## Sulphates_MISSING          -0.0136402  0.0297334  -0.459 0.646413    
## TotalSulfurDioxide_MISSING  0.0096810  0.0366757   0.264 0.791808    
## VolatileAcidity_MISSING            NA         NA      NA       NA    
## pH_MISSING                 -0.0161418  0.0489440  -0.330 0.741550    
## STARS_BIN2                  0.1421487  0.0237748   5.979 2.25e-09 ***
## STARS_BIN3                  0.0787192  0.0282284   2.789 0.005293 ** 
## STARS_BIN4                         NA         NA      NA       NA    
## LABEL_BINLow               -0.0447563  0.0779010  -0.575 0.565611    
## LABEL_BINNeutral           -0.1681607  0.1454554  -1.156 0.247641    
## LABEL_BINHigh              -0.3322002  0.2174944  -1.527 0.126662    
## STAR_LABEL_INTERACT         0.0133250  0.0136897   0.973 0.330373    
## LOG_Alcohol                -0.0584876  0.0540002  -1.083 0.278763    
## LOG_Chlorides               0.1088858  0.0687287   1.584 0.113129    
## LOG_CitricAcid             -0.0094448  0.0225078  -0.420 0.674761    
## LOG_Density                14.1880415 52.2317923   0.272 0.785902    
## LOG_FixedAcidity            0.0439854  0.0237543   1.852 0.064072 .  
## LOG_FreeSulfurDioxide       0.0746876  0.0191805   3.894 9.86e-05 ***
## LOG_ResidualSugar           0.0325064  0.0169649   1.916 0.055353 .  
## LOG_Sulphates              -0.0169106  0.0222275  -0.761 0.446779    
## LOG_TotalSulfurDioxide      0.0780155  0.0213535   3.654 0.000259 ***
## LOG_VolatileAcidity        -0.0146596  0.0226194  -0.648 0.516920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(41983.58) family taken to be 1)
## 
##     Null deviance: 7728.8  on 4355  degrees of freedom
## Residual deviance: 4520.7  on 4317  degrees of freedom
## AIC: 15494
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  41984 
##           Std. Err.:  60721 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -15413.81

# ---------------------------------------------------------
# 3C. MULTIPLE LINEAR REGRESSION MODELS
# ---------------------------------------------------------

lm_L1 <- lm(
  formula_base,
  data = train2
)

summary(lm_L1)

## 
## Call:
## lm(formula = formula_base, data = train2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2211 -0.7540  0.3598  1.1254  4.3550 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.355e+00  5.517e-01   9.707  < 2e-16 ***
## STARS               7.478e-01  1.946e-02  38.431  < 2e-16 ***
## LabelAppeal         5.945e-01  1.686e-02  35.250  < 2e-16 ***
## AcidIndex          -3.259e-01  1.117e-02 -29.169  < 2e-16 ***
## Alcohol             1.883e-02  3.972e-03   4.739 2.17e-06 ***
## Chlorides          -1.931e-01  4.638e-02  -4.164 3.15e-05 ***
## CitricAcid          3.976e-02  1.673e-02   2.377 0.017476 *  
## Density            -1.274e+00  5.427e-01  -2.347 0.018959 *  
## FixedAcidity       -1.168e-03  2.315e-03  -0.505 0.613911    
## FreeSulfurDioxide   4.286e-04  9.941e-05   4.312 1.63e-05 ***
## ResidualSugar       4.716e-04  4.371e-04   1.079 0.280670    
## Sulphates          -5.485e-02  1.623e-02  -3.380 0.000728 ***
## TotalSulfurDioxide  3.098e-04  6.387e-05   4.851 1.25e-06 ***
## VolatileAcidity    -1.549e-01  1.838e-02  -8.429  < 2e-16 ***
## pH                 -6.387e-02  2.154e-02  -2.965 0.003028 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.626 on 12780 degrees of freedom
## Multiple R-squared:  0.2879, Adjusted R-squared:  0.2871 
## F-statistic: 369.1 on 14 and 12780 DF,  p-value: < 2.2e-16

lm_L2 <- lm(
  formula_engineered,
  data = train2_model
)

summary(lm_L2)

## 
## Call:
## lm(formula = formula_engineered, data = train2_model)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8016 -0.8473  0.0251  0.8394  5.6720 
## 
## Coefficients: (7 not defined because of singularities)
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -8.692e+00  2.223e+01  -0.391  0.69586    
## STARS                       7.028e-01  3.586e-02  19.598  < 2e-16 ***
## LabelAppeal                 1.328e-01  1.172e-01   1.132  0.25749    
## AcidIndex                  -1.398e-01  1.542e-02  -9.064  < 2e-16 ***
## Alcohol                     3.382e-02  1.383e-02   2.444  0.01455 *  
## Chlorides                  -3.705e-01  1.505e-01  -2.461  0.01387 *  
## CitricAcid                  3.251e-02  5.090e-02   0.639  0.52304    
## Density                    -2.954e+01  5.826e+01  -0.507  0.61221    
## FixedAcidity               -1.805e-02  7.303e-03  -2.471  0.01351 *  
## FreeSulfurDioxide          -1.058e-03  3.998e-04  -2.646  0.00816 ** 
## ResidualSugar              -4.653e-03  1.832e-03  -2.540  0.01111 *  
## Sulphates                   5.064e-04  4.753e-02   0.011  0.99150    
## TotalSulfurDioxide         -6.355e-04  2.211e-04  -2.874  0.00407 ** 
## VolatileAcidity            -7.333e-02  5.467e-02  -1.341  0.17984    
## pH                          1.103e-02  3.015e-02   0.366  0.71438    
## STARS_MISSING              -2.343e+00  5.499e-02 -42.599  < 2e-16 ***
## LabelAppeal_MISSING                NA         NA      NA       NA    
## AcidIndex_MISSING                  NA         NA      NA       NA    
## Alcohol_MISSING             2.108e-02  9.364e-02   0.225  0.82189    
## Chlorides_MISSING           5.044e-02  8.842e-02   0.570  0.56842    
## CitricAcid_MISSING                 NA         NA      NA       NA    
## Density_MISSING                    NA         NA      NA       NA    
## FixedAcidity_MISSING               NA         NA      NA       NA    
## FreeSulfurDioxide_MISSING   2.130e-02  7.857e-02   0.271  0.78630    
## ResidualSugar_MISSING       9.376e-02  8.004e-02   1.171  0.24150    
## Sulphates_MISSING          -4.650e-02  6.559e-02  -0.709  0.47844    
## TotalSulfurDioxide_MISSING  1.786e-02  8.184e-02   0.218  0.82724    
## VolatileAcidity_MISSING            NA         NA      NA       NA    
## pH_MISSING                 -2.077e-02  1.064e-01  -0.195  0.84527    
## STARS_BIN2                  3.748e-01  5.860e-02   6.396 1.76e-10 ***
## STARS_BIN3                  2.157e-01  7.953e-02   2.712  0.00672 ** 
## STARS_BIN4                         NA         NA      NA       NA    
## LABEL_BINLow                9.393e-02  1.318e-01   0.713  0.47604    
## LABEL_BINNeutral            9.823e-02  2.398e-01   0.410  0.68206    
## LABEL_BINHigh               8.091e-03  3.616e-01   0.022  0.98215    
## STAR_LABEL_INTERACT         1.597e-01  3.247e-02   4.917 9.12e-07 ***
## LOG_Alcohol                -2.031e-01  1.229e-01  -1.652  0.09864 .  
## LOG_Chlorides               2.118e-01  1.226e-01   1.727  0.08432 .  
## LOG_CitricAcid             -3.511e-02  5.156e-02  -0.681  0.49593    
## LOG_Density                 5.724e+01  1.161e+02   0.493  0.62206    
## LOG_FixedAcidity            1.358e-01  5.108e-02   2.660  0.00785 ** 
## LOG_FreeSulfurDioxide       1.827e-01  3.964e-02   4.609 4.17e-06 ***
## LOG_ResidualSugar           8.040e-02  3.722e-02   2.160  0.03083 *  
## LOG_Sulphates              -6.021e-02  5.147e-02  -1.170  0.24210    
## LOG_TotalSulfurDioxide      1.881e-01  4.266e-02   4.410 1.06e-05 ***
## LOG_VolatileAcidity        -4.629e-02  5.225e-02  -0.886  0.37575    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.288 on 4317 degrees of freedom
## Multiple R-squared:  0.5556, Adjusted R-squared:  0.5517 
## F-statistic:   142 on 38 and 4317 DF,  p-value: < 2.2e-16

# ---------------------------------------------------------
# OVERDISPERSION DIAGNOSTICS (POISSON)
# ---------------------------------------------------------

dispersion_P1 <- sum(residuals(poisson_P1, type = "pearson")^2) /
                 poisson_P1$df.residual

dispersion_P2 <- sum(residuals(poisson_P2, type = "pearson")^2) /
                 poisson_P2$df.residual

dispersion_P1

## [1] 0.9623067

dispersion_P2

## [1] 0.8631152

Section 4 - MODEL SELECTION & EVALUATION

In this section, I compared the competing Poisson, Negative Binomial and multiple linear regression models in order to select a final model for deployment. Because the response variable (TARGET) represents a non-negative count, a count regression model must be selected for final use, even if alternative models show competitive performance.

Model Comparison Criteria

To compare models, I relied primarily on the Akaike Information Criterion (AIC), which balances model fit and complexity. Lower AIC values indicate better overall performance while penalizing excessive model complexity. I also examined dispersion statistics for the Poisson models to assess whether the Poisson assumption of equal mean and variance was reasonable.

Poisson vs. Negative Binomial Models

The baseline and engineered Poisson models exhibited dispersion values substantially greater than 1, indicating overdispersion in the data. This suggests that the Poisson assumption is violated and that standard Poisson regression may underestimate variance.

The Negative Binomial models explicitly account for overdispersion through an additional dispersion parameter. Both the baseline and engineered Negative Binomial models achieved lower AIC values than their Poisson counterparts, indicating improved model fit. The engineered Negative Binomial model (NB2), which includes missing-value indicators, transformed chemical variables, bucketized marketing variables, and an interaction between STARS and LabelAppeal, achieved the lowest AIC among all count models considered.

Linear Regression Comparison

The multiple linear regression models were included as benchmarks, as required by the assignment. While the engineered linear model showed competitive AIC values, linear regression is not well suited for count data because it can produce negative predictions and does not model the discrete nature of the response. As a result, linear regression was not selected for deployment despite its interpretability.

Final Model Selection

Based on the presence of overdispersion, the AIC comparison, and the theoretical appropriateness of the modeling framework, I selected the engineered Negative Binomial regression model (NB2) as the final model for deployment. This model provides the best balance of predictive performance, interpretability, and statistical validity for the wine sales data.

Evaluation Dataset Predictions

Using the selected Negative Binomial model, I generated predictions for the evaluation dataset. The predicted values represent the expected number of wine sample cases purchased for each wine, conditional on its chemical properties and marketing characteristics. These predictions satisfy the assignment requirement to produce case-count forecasts for the evaluation data and can be used by the manufacturer to inform production and marketing decisions.

############################################################
# SECTION 4 — MODEL SELECTION AND EVALUATION
############################################################

# ---------------------------------------------------------
# 4A. MODEL COMPARISON USING INFORMATION CRITERIA
# ---------------------------------------------------------

model_comparison <- data.frame(
  Model = c(
    "Poisson P1 (Baseline)",
    "Poisson P2 (Engineered)",
    "NegBin NB1 (Baseline)",
    "NegBin NB2 (Engineered)",
    "Linear L1 (Baseline)",
    "Linear L2 (Engineered)"
  ),
  AIC = c(
    AIC(poisson_P1),
    AIC(poisson_P2),
    AIC(nb_NB1),
    AIC(nb_NB2),
    AIC(lm_L1),
    AIC(lm_L2)
  )
)

model_comparison

##                     Model      AIC
## 1   Poisson P1 (Baseline) 50391.49
## 2 Poisson P2 (Engineered) 15491.67
## 3   NegBin NB1 (Baseline) 50393.62
## 4 NegBin NB2 (Engineered) 15493.81
## 5    Linear L1 (Baseline) 48774.96
## 6  Linear L2 (Engineered) 14609.00

# ---------------------------------------------------------
# 4B. OVERDISPERSION CONFIRMATION
# ---------------------------------------------------------

dispersion_summary <- data.frame(
  Model = c("Poisson P1", "Poisson P2"),
  Dispersion = c(dispersion_P1, dispersion_P2)
)

dispersion_summary

##        Model Dispersion
## 1 Poisson P1  0.9623067
## 2 Poisson P2  0.8631152

# ---------------------------------------------------------
# 4C. FINAL MODEL SELECTION
# ---------------------------------------------------------
# The Negative Binomial engineered model is selected for deployment
final_count_model <- nb_NB2

# ---------------------------------------------------------
# 4D. PREDICTIONS ON EVALUATION DATA
# ---------------------------------------------------------

# Ensure evaluation data uses same columns as training model
eval_model <- eval2[, all.vars(formula_engineered)]

# Replace non-finite numeric values
for (v in names(eval_model)) {
  if (is.numeric(eval_model[[v]])) {
    eval_model[[v]][!is.finite(eval_model[[v]])] <- NA
  }
}

# Remove rows with missing values
eval_model <- na.omit(eval_model)
eval_model <- droplevels(eval_model)

# Predict expected number of cases
eval_predictions <- predict(
  final_count_model,
  newdata = eval_model,
  type = "response"
)

# Final prediction table
final_predictions <- data.frame(
  INDEX  = eval_model$INDEX,
  TARGET_PRED = eval_predictions
)

head(final_predictions)

## [1] TARGET_PRED
## <0 rows> (or 0-length row.names)

DATA 621 HW 5

Biyag Dukuray

2025-12-14

DATA 621 HW 5

SECTION 1 — DATA EXPLORATION

SECTION 2 - Data Preparation

SECTION 3 — Build Models

Section 4 - MODEL SELECTION & EVALUATION