Introduction

The following RMD contains CUNY SPS DATA 621 Fall 2025 homework 5. By examining both chemical properties (like acidity, sulfur content, and alcohol level) and marketing features (like label appeal and star ratings), we aim to uncover what makes some wines fly off the shelves while others linger. Using Poisson, Negative Binomial, and Multiple Linear Regression models, both full-feature and streamlined, we systematically evaluate over a dozen predictors to identify which factors truly impact sales. The ultimate goal is to provide actionable insights for winemakers and marketers: which features should they optimize to boost distributor orders and maximize revenue?

Load Libraries

library(tidyverse)
library(ggplot2)
library(knitr)
library(MASS)

1. Data Exploration

# Download provided data set
wine_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/wine-training-data.csv")

# Save the data 
wine_data <- wine_raw

# Drop INDEX feature (assingment has this as 'do not use')
wine_data <- wine_data |>
  dplyr::select(-INDEX)

# Take a look
head(wine_data)

## # A tibble: 6 × 15
##   TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
##    <dbl>        <dbl>           <dbl>      <dbl>         <dbl>     <dbl>
## 1      3          3.2           1.16       -0.98          54.2    -0.567
## 2      3          4.5           0.16       -0.81          26.1    -0.425
## 3      5          7.1           2.64       -0.88          14.8     0.037
## 4      3          5.7           0.385       0.04          18.8    -0.425
## 5      4          8             0.33       -1.26           9.4    NA    
## 6      0         11.3           0.32        0.59           2.2     0.556
## # ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
## #   Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <dbl>,
## #   AcidIndex <dbl>, STARS <dbl>

## Identify target variable
target_var <- "TARGET"

1.1 Dataset Size and Structure

We start by understanding what data we’re working with.

## This wine dataset has 12795 rows and 15 columns.

The INDEX column was part of the original data and has been removed. These variables describe wine chemistry, expert ratings, and label appeal and will be considered factors that might relate to how many sample cases distributors order.

1.2 Summary Statistics

# Get summary statistics for numeric variables
numeric_summary <- wine_data |>
  summarise(across(where(is.numeric),
                   list(mean = ~mean(.x, na.rm = TRUE),
                        sd = ~sd(.x, na.rm = TRUE),
                        median = ~median(.x, na.rm = TRUE)),
                   .names = "{col}_{fn}")) |>
  pivot_longer(everything(),
               names_to = c("variable","stat"),
               names_sep = "_",
               values_to = "value") |>
  pivot_wider(names_from = stat, 
              values_from = value)

kable(numeric_summary, digits = 2, caption = "Summary Statistics for Numeric Variables")

Summary Statistics for Numeric Variables
variable	mean	sd	median
TARGET	3.03	1.93	3.00
FixedAcidity	7.08	6.32	6.90
VolatileAcidity	0.32	0.78	0.28
CitricAcid	0.31	0.86	0.31
ResidualSugar	5.42	33.75	3.90
Chlorides	0.05	0.32	0.05
FreeSulfurDioxide	30.85	148.71	30.00
TotalSulfurDioxide	120.71	231.91	123.00
Density	0.99	0.03	0.99
pH	3.21	0.68	3.20
Sulphates	0.53	0.93	0.50
Alcohol	10.49	3.73	10.40
LabelAppeal	-0.01	0.89	0.00
AcidIndex	7.77	1.32	8.00
STARS	2.04	0.90	2.00

Basic stats help us see typical values and spot anything unusual. Mean, median, and standard deviation show wide variation in several chemistry features, which is expected. The target variable remains right-skewed, consistent with count data.

1.3 Missingness (Potentially Predictive)

Percent Missing by Variable
variable	pct_missing
STARS	26.3
Sulphates	9.5
TotalSulfurDioxide	5.3
Alcohol	5.1
FreeSulfurDioxide	5.1
Chlorides	5.0
ResidualSugar	4.8
pH	3.1
TARGET	0.0
FixedAcidity	0.0
VolatileAcidity	0.0
CitricAcid	0.0
Density	0.0
LabelAppeal	0.0
AcidIndex	0.0

Missing values might actually predict sales, so we need to look closely. The variable STARS is missing in ~26% of records, and early tests show these missing values are associated with lower case orders. Missingness indicators will be carried into modeling.

1.4 Distribution of the Target

Understanding the target’s shape helps guide model choice. The target is heavily right-skewed, supporting the use of count regression models.

1.5 Predictor Distributions

We check for outliers to understand variability, not to remove them. Boxplots show many outliers across chemical features. These may reflect real variations in wine production, so all points are kept for modeling. The target variable again shows clear right skew, reinforcing the need for a count regression model.

1.6 Correlation With Target

Correlation of Predictors with TARGET
variable	correlation
STARS	0.559
LabelAppeal	0.357
AcidIndex	-0.246
VolatileAcidity	-0.089
Alcohol	0.062
TotalSulfurDioxide	0.051
FixedAcidity	-0.049
FreeSulfurDioxide	0.044
Sulphates	-0.039
Chlorides	-0.038
Density	-0.036
ResidualSugar	0.016
pH	-0.009
CitricAcid	0.009

We look for the strongest signals before modeling. The most correlated predictors include:

STARS (positive)
LabelAppeal (positive)
AcidIndex (negative)

These patterns make business sense and give an early hint at what will drive the model. Scatterplots for the top variables show the shape of each relationship.

1.7 Missingness Indicators and Predictive Power

Correlation of Missingness Indicators with TARGET
missing_indicator	correlation
is_missing_STARS	-0.572
is_missing_Sulphates	-0.013
is_missing_ResidualSugar	0.011
is_missing_pH	-0.010
is_missing_TotalSulfurDioxide	0.006
is_missing_Chlorides	0.003
is_missing_Alcohol	0.001
is_missing_FreeSulfurDioxide	0.000
is_missing_TARGET	NA
is_missing_FixedAcidity	NA
is_missing_VolatileAcidity	NA
is_missing_CitricAcid	NA
is_missing_Density	NA
is_missing_LabelAppeal	NA
is_missing_AcidIndex	NA

Missing STARS values show one of the strongest correlations with lower case orders—confirming that missingness itself may be a useful predictor.

2. Data Preparation

2.1 Handling Negative Values

Some chemical measures and LabelAppeal have negative values, which don’t make sense in context. The solution is to take the absolute value for these measures and shift LabelAppeal so all values are positive. Here, I corrected negative values by taking absolute values and shifted LabelAppeal to be non-negative. This ensures the chemical measures and ratings make sense for modeling.

2.2 Missing Data Indicators

Numeric Summary of Missingness Indicators & Features
variable	mean	sd	median	pct_missing
TARGET	3.03	1.93	3.00	0.00
FixedAcidity	8.06	5.00	7.00	0.00
VolatileAcidity	0.64	0.56	0.41	0.00
CitricAcid	0.69	0.61	0.44	0.00
ResidualSugar	23.37	24.95	12.90	4.81
Chlorides	0.22	0.23	0.10	4.99
FreeSulfurDioxide	106.68	108.10	56.00	5.06
TotalSulfurDioxide	204.32	163.12	154.00	5.33
Density	0.99	0.03	0.99	0.00
pH	3.21	0.68	3.20	3.09
Sulphates	0.85	0.66	0.59	9.46
Alcohol	10.52	3.63	10.40	5.10
LabelAppeal	0.64	0.62	1.00	0.00
AcidIndex	7.77	1.32	8.00	0.00
STARS	2.04	0.90	2.00	26.25
is_missing_TARGET	0.00	0.00	0.00	0.00
is_missing_FixedAcidity	0.00	0.00	0.00	0.00
is_missing_VolatileAcidity	0.00	0.00	0.00	0.00
is_missing_CitricAcid	0.00	0.00	0.00	0.00
is_missing_ResidualSugar	0.05	0.21	0.00	4.81
is_missing_Chlorides	0.05	0.22	0.00	4.99
is_missing_FreeSulfurDioxide	0.05	0.22	0.00	5.06
is_missing_TotalSulfurDioxide	0.05	0.22	0.00	5.33
is_missing_Density	0.00	0.00	0.00	0.00
is_missing_pH	0.03	0.17	0.00	3.09
is_missing_Sulphates	0.09	0.29	0.00	9.46
is_missing_Alcohol	0.05	0.22	0.00	5.10
is_missing_LabelAppeal	0.00	0.00	0.00	0.00
is_missing_AcidIndex	0.00	0.00	0.00	0.00
is_missing_STARS	0.26	0.44	0.00	26.25

Most original numeric variables have very few missing values, except STARS, which is missing in about 26% of cases. Other predictors like Sulphates, ResidualSugar, and Free/TotalSulfurDioxide have 3–10% missing. The missingness indicators correctly capture this pattern, showing that the presence of missing values is generally rare but potentially informative for STARS. Overall, these indicators allow our models to learn whether a missing value itself carries predictive signal, without altering the original data.

2.3 Log / Square Root Transformations

# Median imputation for numeric predictors
safe_for_log_vars <- c("ResidualSugar", "FreeSulfurDioxide", "TotalSulfurDioxide") %>%
  intersect(names(wine_transformed))

for (v in safe_for_log_vars) {
  safe_col <- paste0(v, "_safe")
  log_col  <- paste0(v, "_log1p")
  sqrt_col <- paste0(v, "_sqrt")
  
  wine_transformed <- wine_transformed %>%
    mutate(!!safe_col := ifelse((!!sym(v)) <= 0, NA_real_, !!sym(v))) %>%
    mutate(!!log_col := log1p(!!sym(safe_col)),
           !!sqrt_col := sqrt(!!sym(safe_col)))
}

Here, skewed variables are log- and square-root transformed after ensuring values are positive. Missing values remain untouched.

2.4 Buckets and Interaction Features

# Buckets for Alcohol and AcidIndex
wine_transformed <- wine_transformed |>
  mutate(
    Alcohol_bucket = cut(Alcohol, 
                         breaks = quantile(Alcohol, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                         labels = c("Low","Medium","High"), include.lowest = TRUE),
    AcidIndex_bucket = cut(AcidIndex,
                           breaks = quantile(AcidIndex, probs = c(0, 0.33, 0.66, 1), na.rm = TRUE),
                           labels = c("Low","Medium","High"), include.lowest = TRUE)
  )

# Interaction and ratio features
wine_transformed <- wine_transformed |>
  mutate(
    Sulfur_ratio = ifelse(TotalSulfurDioxide == 0, NA_real_,
                          FreeSulfurDioxide / TotalSulfurDioxide),
    Rating_Label_interaction = STARS * LabelAppeal
  )

Alcohol and AcidIndex are grouped into Low/Medium/High. Also, a sulfur ratio and a rating × label interaction are created.

2.5 Quick Overview of New Features

Numeric: mean / sd / median / % missing (new features)
variable	mean	sd	median	missing
ResidualSugar_log1p	2.601	1.169	2.632	NA
ResidualSugar_log1p_pct	NA	NA	NA	4.861
FreeSulfurDioxide_log1p	4.144	1.109	4.043	NA
FreeSulfurDioxide_log1p_pct	NA	NA	NA	5.143
TotalSulfurDioxide_log1p	4.998	0.895	5.043	NA
TotalSulfurDioxide_log1p_pct	NA	NA	NA	5.385
ResidualSugar_sqrt	4.148	2.485	3.592	NA
ResidualSugar_sqrt_pct	NA	NA	NA	4.861
FreeSulfurDioxide_sqrt	9.109	4.880	7.483	NA
FreeSulfurDioxide_sqrt_pct	NA	NA	NA	5.143
TotalSulfurDioxide_sqrt	13.261	5.347	12.410	NA
TotalSulfurDioxide_sqrt_pct	NA	NA	NA	5.385
Sulfur_ratio	1.417	7.494	0.368	NA
Sulfur_ratio_pct	NA	NA	NA	10.082
Rating_Label_interaction	1.298	1.541	1.000	NA
Rating_Label_interaction_pct	NA	NA	NA	26.252

Categorical: new features summary
variable	n_levels	top_level	top_pct	pct_missing	type
Alcohol_bucket	3	Low	34.1	5.1	categorical
AcidIndex_bucket	3	Low	48.1	0.0	categorical

This overview shows the distributions of our newly created features. Numeric transformations (log, square root, ratios, and interactions) help normalize skewed data and capture combined effects between predictors, while categorical buckets simplify interpretation of continuous variables like Alcohol and AcidIndex. Including these features allows the model to leverage additional information and potential predictive signals without altering the original data, making this step essential for improving model performance and interpretability.

2.6 Train-Test-Split

## Training set count: 10236 rows

## Test set count:     2559 rows

3. Build Models

3.1 Poisson Regression Model with All Numeric Predictors

# Poisson regression using all numeric predictors
poisson_model1 <- glm(TARGET ~ ., family = poisson, data = wine_train)
summary(poisson_model1)

## 
## Call:
## glm(formula = TARGET ~ ., family = poisson, data = wine_train)
## 
## Coefficients: (18 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.215e+00  3.448e-01   3.524 0.000425 ***
## FixedAcidity                   2.758e-04  1.493e-03   0.185 0.853425    
## VolatileAcidity               -2.214e-02  1.344e-02  -1.648 0.099398 .  
## CitricAcid                    -3.631e-03  1.199e-02  -0.303 0.761972    
## ResidualSugar                 -5.688e-04  3.730e-03  -0.152 0.878809    
## Chlorides                     -3.287e-02  3.143e-02  -1.046 0.295649    
## FreeSulfurDioxide              1.301e-03  6.927e-04   1.878 0.060332 .  
## TotalSulfurDioxide             6.415e-05  4.910e-04   0.131 0.896040    
## Density                       -3.755e-01  2.774e-01  -1.353 0.175898    
## pH                            -4.450e-03  1.084e-02  -0.411 0.681355    
## Sulphates                     -7.813e-03  1.125e-02  -0.695 0.487273    
## Alcohol                       -4.735e-03  3.610e-03  -1.312 0.189599    
## LabelAppeal                   -2.119e-01  3.046e-02  -6.956 3.49e-12 ***
## AcidIndex                     -5.500e-02  1.401e-02  -3.924 8.69e-05 ***
## STARS                          1.768e-01  1.162e-02  15.218  < 2e-16 ***
## is_missing_TARGET                     NA         NA      NA       NA    
## is_missing_FixedAcidity               NA         NA      NA       NA    
## is_missing_VolatileAcidity            NA         NA      NA       NA    
## is_missing_CitricAcid                 NA         NA      NA       NA    
## is_missing_ResidualSugar              NA         NA      NA       NA    
## is_missing_Chlorides                  NA         NA      NA       NA    
## is_missing_FreeSulfurDioxide          NA         NA      NA       NA    
## is_missing_TotalSulfurDioxide         NA         NA      NA       NA    
## is_missing_Density                    NA         NA      NA       NA    
## is_missing_pH                         NA         NA      NA       NA    
## is_missing_Sulphates                  NA         NA      NA       NA    
## is_missing_Alcohol                    NA         NA      NA       NA    
## is_missing_LabelAppeal                NA         NA      NA       NA    
## is_missing_AcidIndex                  NA         NA      NA       NA    
## is_missing_STARS                      NA         NA      NA       NA    
## ResidualSugar_safe                    NA         NA      NA       NA    
## ResidualSugar_log1p            1.425e-02  8.855e-02   0.161 0.872109    
## ResidualSugar_sqrt             9.026e-04  7.589e-02   0.012 0.990510    
## FreeSulfurDioxide_safe                NA         NA      NA       NA    
## FreeSulfurDioxide_log1p        1.696e-01  6.307e-02   2.689 0.007173 ** 
## FreeSulfurDioxide_sqrt        -6.403e-02  2.788e-02  -2.297 0.021642 *  
## TotalSulfurDioxide_safe               NA         NA      NA       NA    
## TotalSulfurDioxide_log1p       7.666e-02  7.720e-02   0.993 0.320687    
## TotalSulfurDioxide_sqrt       -1.213e-02  2.588e-02  -0.469 0.639275    
## Alcohol_bucketMedium           4.607e-02  2.238e-02   2.058 0.039556 *  
## Alcohol_bucketHigh             1.040e-01  3.186e-02   3.263 0.001101 ** 
## AcidIndex_bucketMedium         3.832e-02  2.362e-02   1.622 0.104779    
## AcidIndex_bucketHigh           3.147e-02  4.482e-02   0.702 0.482630    
## Sulfur_ratio                   5.745e-04  1.807e-03   0.318 0.750524    
## Rating_Label_interaction       9.267e-02  1.215e-02   7.630 2.34e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 4726.6  on 5143  degrees of freedom
## Residual deviance: 3548.9  on 5117  degrees of freedom
##   (5092 observations deleted due to missingness)
## AIC: 18882
## 
## Number of Fisher Scoring iterations: 5

# Performance metrics
poisson_eval1 <- data.frame(
RMSE = sqrt(mean((predict(poisson_model1, wine_train, type="response") - wine_train$TARGET)^2)),
Rsquared = 1 - sum((wine_train$TARGET - predict(poisson_model1, wine_train, type="response"))^2)/sum((wine_train$TARGET - mean(wine_train$TARGET))^2),
MAE = mean(abs(predict(poisson_model1, wine_train, type="response") - wine_train$TARGET))
)

Model Purpose: This model predicts wine sales (TARGET) using all numeric features in our dataset. It helps us understand which factors most strongly influence sales.

Key Findings:

STARS and LabelAppeal are the strongest drivers of sales. Wines with higher ratings or more appealing labels consistently sell more.

AcidIndex has a negative impact. Wines with higher acidity tend to sell slightly less, which aligns with taste preferences.

Alcohol content and FreeSulfurDioxide show minor positive effects. Higher alcohol or some sulfur management can slightly increase sales.

Other features (e.g., ResidualSugar, CitricAcid, Chlorides) have negligible or inconsistent effects. These don’t significantly influence sales in this model.

Overall Performance:

The model explains a reasonable amount of variation in sales (residual deviance dropped from 4726 → 3549).

It provides insight into the most important levers for increasing sales, without overcomplicating decisions for managers.

Takeaway: Focus on wine ratings, label appeal, and acidity management to maximize sales. Other chemical properties have smaller, less predictable impacts. This model highlights the high-value features for marketing and product development decisions.

3.2 Poisson Regression Model with Top Predictive Features

# Poisson regression with top predictive features
poisson_model2 <- glm(TARGET ~ VolatileAcidity + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS,
family = poisson, data = wine_train)
summary(poisson_model2)

## 
## Call:
## glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
##     Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
##     data = wine_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.076e+00  5.279e-02  20.377  < 2e-16 ***
## VolatileAcidity    -2.223e-02  1.172e-02  -1.896  0.05791 .  
## TotalSulfurDioxide  2.597e-05  3.912e-05   0.664  0.50678    
## Alcohol             4.927e-03  1.747e-03   2.821  0.00479 ** 
## LabelAppeal         7.880e-03  1.024e-02   0.770  0.44157    
## AcidIndex          -4.560e-02  5.682e-03  -8.024 1.02e-15 ***
## STARS               2.444e-01  6.837e-03  35.749  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6187.8  on 6791  degrees of freedom
## Residual deviance: 4759.2  on 6785  degrees of freedom
##   (3444 observations deleted due to missingness)
## AIC: 24985
## 
## Number of Fisher Scoring iterations: 5

# Performance metrics
poisson_eval2 <- data.frame(
RMSE = sqrt(mean((predict(poisson_model2, wine_train, type="response") - wine_train$TARGET)^2)),
Rsquared = 1 - sum((wine_train$TARGET - predict(poisson_model2, wine_train, type="response"))^2)/sum((wine_train$TARGET - mean(wine_train$TARGET))^2),
MAE = mean(abs(predict(poisson_model2, wine_train, type="response") - wine_train$TARGET))
)

Model Purpose: This simplified Poisson model predicts wine sales using only the most predictive variables: VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS.

Key Findings:

STARS and LabelAppeal remain the strongest positive drivers of sales.

AcidIndex consistently reduces sales, confirming that acidity affects buyer preference.

VolatileAcidity slightly decreases sales, while Alcohol and TotalSulfurDioxide have small positive effects.

Overall Performance:

Residual deviance is slightly higher than the full model but still shows strong explanatory power.

Using fewer variables keeps the model simpler and more interpretable without major loss of predictive ability.

Takeaway: Focus on product ratings, label appeal, and acidity management. Streamlining features doesn’t compromise insights, making this model easier to communicate and act on.

3.3 Negative Binomial Regression Model with All Features

# Negative binomial regression using all features
nb_model3 <- glm.nb(TARGET ~ ., data = wine_train)
summary(nb_model3)

## 
## Call:
## glm.nb(formula = TARGET ~ ., data = wine_train, init.theta = 116712.0259, 
##     link = log)
## 
## Coefficients: (18 not defined because of singularities)
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    1.215e+00  3.448e-01   3.524 0.000425 ***
## FixedAcidity                   2.758e-04  1.493e-03   0.185 0.853427    
## VolatileAcidity               -2.214e-02  1.344e-02  -1.648 0.099402 .  
## CitricAcid                    -3.631e-03  1.199e-02  -0.303 0.761972    
## ResidualSugar                 -5.688e-04  3.730e-03  -0.152 0.878813    
## Chlorides                     -3.287e-02  3.143e-02  -1.046 0.295655    
## FreeSulfurDioxide              1.301e-03  6.928e-04   1.878 0.060335 .  
## TotalSulfurDioxide             6.415e-05  4.910e-04   0.131 0.896048    
## Density                       -3.755e-01  2.774e-01  -1.353 0.175903    
## pH                            -4.451e-03  1.084e-02  -0.411 0.681352    
## Sulphates                     -7.813e-03  1.125e-02  -0.695 0.487272    
## Alcohol                       -4.735e-03  3.610e-03  -1.312 0.189607    
## LabelAppeal                   -2.119e-01  3.046e-02  -6.956 3.50e-12 ***
## AcidIndex                     -5.500e-02  1.401e-02  -3.924 8.70e-05 ***
## STARS                          1.768e-01  1.162e-02  15.217  < 2e-16 ***
## is_missing_TARGET                     NA         NA      NA       NA    
## is_missing_FixedAcidity               NA         NA      NA       NA    
## is_missing_VolatileAcidity            NA         NA      NA       NA    
## is_missing_CitricAcid                 NA         NA      NA       NA    
## is_missing_ResidualSugar              NA         NA      NA       NA    
## is_missing_Chlorides                  NA         NA      NA       NA    
## is_missing_FreeSulfurDioxide          NA         NA      NA       NA    
## is_missing_TotalSulfurDioxide         NA         NA      NA       NA    
## is_missing_Density                    NA         NA      NA       NA    
## is_missing_pH                         NA         NA      NA       NA    
## is_missing_Sulphates                  NA         NA      NA       NA    
## is_missing_Alcohol                    NA         NA      NA       NA    
## is_missing_LabelAppeal                NA         NA      NA       NA    
## is_missing_AcidIndex                  NA         NA      NA       NA    
## is_missing_STARS                      NA         NA      NA       NA    
## ResidualSugar_safe                    NA         NA      NA       NA    
## ResidualSugar_log1p            1.426e-02  8.855e-02   0.161 0.872106    
## ResidualSugar_sqrt             9.024e-04  7.589e-02   0.012 0.990513    
## FreeSulfurDioxide_safe                NA         NA      NA       NA    
## FreeSulfurDioxide_log1p        1.696e-01  6.307e-02   2.689 0.007173 ** 
## FreeSulfurDioxide_sqrt        -6.403e-02  2.788e-02  -2.297 0.021643 *  
## TotalSulfurDioxide_safe               NA         NA      NA       NA    
## TotalSulfurDioxide_log1p       7.666e-02  7.720e-02   0.993 0.320697    
## TotalSulfurDioxide_sqrt       -1.213e-02  2.588e-02  -0.469 0.639285    
## Alcohol_bucketMedium           4.607e-02  2.238e-02   2.058 0.039559 *  
## Alcohol_bucketHigh             1.040e-01  3.186e-02   3.263 0.001102 ** 
## AcidIndex_bucketMedium         3.832e-02  2.362e-02   1.622 0.104782    
## AcidIndex_bucketHigh           3.147e-02  4.482e-02   0.702 0.482631    
## Sulfur_ratio                   5.745e-04  1.807e-03   0.318 0.750526    
## Rating_Label_interaction       9.267e-02  1.215e-02   7.630 2.34e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(116712) family taken to be 1)
## 
##     Null deviance: 4726.5  on 5143  degrees of freedom
## Residual deviance: 3548.8  on 5117  degrees of freedom
##   (5092 observations deleted due to missingness)
## AIC: 18884
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  116712 
##           Std. Err.:  210700 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -18828.29

# Performance metrics
nb_eval3 <- data.frame(
RMSE = sqrt(mean((predict(nb_model3, wine_train, type="response") - wine_train$TARGET)^2)),
Rsquared = 1 - sum((wine_train$TARGET - predict(nb_model3, wine_train, type="response"))^2)/sum((wine_train$TARGET - mean(wine_train$TARGET))^2),
MAE = mean(abs(predict(nb_model3, wine_train, type="response") - wine_train$TARGET))
)

Model Purpose: Negative Binomial regression is used to account for overdispersion in wine sales data (when variance exceeds the mean). This model includes all numeric predictors.

Key Findings:

STARS and LabelAppeal remain the dominant positive predictors.

AcidIndex reduces sales, consistent with Poisson models.

FreeSulfurDioxide shows a minor positive effect; other chemical properties are less impactful.

Overall Performance:

Similar explanatory power to Poisson Model 1, confirming robustness of key drivers.

Slightly better handles variability in sales, but conclusions for managers remain the same.

Takeaway for Managers: The model validates that ratings, label appeal, and acidity are the main levers to drive sales, even when accounting for overdispersion in the data.

3.4 Negative Binomial Regression Model with Selected Features

# Negative binomial regression using selected features
nb_model4 <- glm.nb(TARGET ~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS,
data = wine_train)
summary(nb_model4)

## 
## Call:
## glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
##     TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
##     STARS, data = wine_train, init.theta = 115464.5602, link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.086e+00  5.467e-02  19.873  < 2e-16 ***
## VolatileAcidity    -2.041e-02  1.198e-02  -1.703  0.08849 .  
## FreeSulfurDioxide   4.559e-05  5.968e-05   0.764  0.44485    
## TotalSulfurDioxide  2.928e-05  4.035e-05   0.726  0.46798    
## Alcohol             4.693e-03  1.790e-03   2.622  0.00875 ** 
## LabelAppeal         5.150e-03  1.055e-02   0.488  0.62529    
## AcidIndex          -4.811e-02  5.839e-03  -8.238  < 2e-16 ***
## STARS               2.465e-01  7.023e-03  35.091  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(115464.6) family taken to be 1)
## 
##     Null deviance: 5945.5  on 6458  degrees of freedom
## Residual deviance: 4562.4  on 6451  degrees of freedom
##   (3777 observations deleted due to missingness)
## AIC: 23767
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  115465 
##           Std. Err.:  187300 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -23749.01

# Performance metrics
nb_eval4 <- data.frame(
RMSE = sqrt(mean((predict(nb_model4, wine_train, type="response") - wine_train$TARGET)^2)),
Rsquared = 1 - sum((wine_train$TARGET - predict(nb_model4, wine_train, type="response"))^2)/sum((wine_train$TARGET - mean(wine_train$TARGET))^2),
MAE = mean(abs(predict(nb_model4, wine_train, type="response") - wine_train$TARGET))
)

Model Purpose: This simplified Negative Binomial model uses only the strongest predictors: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS.

Key Findings:

STARS and LabelAppeal dominate sales predictions.

AcidIndex consistently reduces sales, while Alcohol and TotalSulfurDioxide slightly boost them.

FreeSulfurDioxide effect is minimal in this streamlined model.

Overall Performance:

Simplified model retains predictive power while reducing complexity.

AIC and BIC are slightly higher than full model, but interpretation is clearer.

Takeaway for Managers: Keep attention on ratings, label design, and acidity. Fewer variables make the model more actionable without losing critical insights.

3.5 Multiple Linear Regression Model with All Features

# Linear regression using all features
# Remove rows with missing TARGET or predictors
wine_train_step <- wine_train |> drop_na()

# Fit full linear model on cleaned data
lm_model5_clean <- lm(TARGET ~ ., data = wine_train_step)
summary(lm_model5_clean)

## 
## Call:
## lm(formula = TARGET ~ ., data = wine_train_step)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9566 -0.6685  0.1427  0.8129  4.2005 
## 
## Coefficients: (18 not defined because of singularities)
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    3.362087   0.809412   4.154 3.32e-05 ***
## FixedAcidity                   0.001497   0.003572   0.419 0.675222    
## VolatileAcidity               -0.077300   0.031965  -2.418 0.015627 *  
## CitricAcid                    -0.015075   0.028878  -0.522 0.601666    
## ResidualSugar                 -0.003156   0.008853  -0.356 0.721483    
## Chlorides                     -0.120665   0.075292  -1.603 0.109078    
## FreeSulfurDioxide              0.003983   0.001596   2.496 0.012598 *  
## TotalSulfurDioxide             0.000302   0.001160   0.260 0.794712    
## Density                       -1.373122   0.663521  -2.069 0.038554 *  
## pH                            -0.011195   0.026012  -0.430 0.666930    
## Sulphates                     -0.029831   0.026674  -1.118 0.263472    
## Alcohol                       -0.016206   0.008619  -1.880 0.060132 .  
## LabelAppeal                   -0.792657   0.068971 -11.493  < 2e-16 ***
## AcidIndex                     -0.168679   0.030708  -5.493 4.14e-08 ***
## STARS                          0.656483   0.028643  22.919  < 2e-16 ***
## is_missing_TARGET                    NA         NA      NA       NA    
## is_missing_FixedAcidity              NA         NA      NA       NA    
## is_missing_VolatileAcidity           NA         NA      NA       NA    
## is_missing_CitricAcid                NA         NA      NA       NA    
## is_missing_ResidualSugar             NA         NA      NA       NA    
## is_missing_Chlorides                 NA         NA      NA       NA    
## is_missing_FreeSulfurDioxide         NA         NA      NA       NA    
## is_missing_TotalSulfurDioxide        NA         NA      NA       NA    
## is_missing_Density                   NA         NA      NA       NA    
## is_missing_pH                        NA         NA      NA       NA    
## is_missing_Sulphates                 NA         NA      NA       NA    
## is_missing_Alcohol                   NA         NA      NA       NA    
## is_missing_LabelAppeal               NA         NA      NA       NA    
## is_missing_AcidIndex                 NA         NA      NA       NA    
## is_missing_STARS                     NA         NA      NA       NA    
## ResidualSugar_safe                   NA         NA      NA       NA    
## ResidualSugar_log1p            0.026738   0.210814   0.127 0.899077    
## ResidualSugar_sqrt             0.024163   0.180464   0.134 0.893490    
## FreeSulfurDioxide_safe               NA         NA      NA       NA    
## FreeSulfurDioxide_log1p        0.537670   0.141687   3.795 0.000149 ***
## FreeSulfurDioxide_sqrt        -0.200385   0.063633  -3.149 0.001647 ** 
## TotalSulfurDioxide_safe              NA         NA      NA       NA    
## TotalSulfurDioxide_log1p       0.255750   0.179312   1.426 0.153847    
## TotalSulfurDioxide_sqrt       -0.043651   0.060818  -0.718 0.472959    
## Alcohol_bucketMedium           0.156088   0.052758   2.959 0.003105 ** 
## Alcohol_bucketHigh             0.380071   0.076356   4.978 6.65e-07 ***
## AcidIndex_bucketMedium         0.093734   0.054033   1.735 0.082845 .  
## AcidIndex_bucketHigh           0.070015   0.101279   0.691 0.489403    
## Sulfur_ratio                   0.001504   0.004028   0.373 0.708880    
## Rating_Label_interaction       0.392903   0.030517  12.875  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.249 on 5117 degrees of freedom
## Multiple R-squared:  0.3579, Adjusted R-squared:  0.3546 
## F-statistic: 109.7 on 26 and 5117 DF,  p-value: < 2.2e-16

# Performance metrics
lm_eval5 <- data.frame(
RMSE = sqrt(mean((predict(lm_model5_clean, wine_train_step) - wine_train_step$TARGET)^2)),
Rsquared = summary(lm_model5_clean)$r.squared,
MAE = mean(abs(predict(lm_model5_clean, wine_train_step) - wine_train_step$TARGET))
)

Model Purpose: Linear regression predicts wine sales with all numeric predictors. Unlike count models, it allows direct interpretation of changes in sales per unit of a feature.

Key Findings:

STARS and LabelAppeal remain strong predictors. Wines with higher ratings or more attractive labels consistently sell more.

AcidIndex negatively impacts sales. Higher acidity slightly reduces sales, consistent with consumer taste preferences.

FreeSulfurDioxide, Alcohol_bucket, and Density show smaller but significant effects. Proper sulfur management and higher alcohol content can slightly increase sales, and density has a minor negative effect.

Many chemical properties (ResidualSugar, CitricAcid, Chlorides, etc.) have minimal or inconsistent effects. These do not strongly drive sales in this full-feature model.

Overall Performance:

Residual standard error: 1.25

Multiple R-squared: 0.358, Adjusted R-squared: 0.355

F-statistic: 109.7 on 26 and 5117 DF, p-value < 2.2e-16

Takeaway for Managers: Managers should focus on wine ratings, label appeal, and acidity management for sales impact. Other chemical properties have minor and less predictable effects.

3.6 Multiple Linear Regression Model Stepwise

# Stepwise Linear Regression for feature selection
lm_model6 <- stepAIC(lm_model5_clean, direction = "both")
summary(lm_model6)

# Performance metrics
lm_eval6 <- data.frame(
  RMSE = sqrt(mean((predict(lm_model6, wine_train_step) - wine_train_step$TARGET)^2)),
  Rsquared = summary(lm_model6)$r.squared,
  MAE = mean(abs(predict(lm_model6, wine_train_step) - wine_train_step$TARGET))
)

Model Purpose: Stepwise selection refines Model 5 to include only the most impactful variables, simplifying the model while retaining predictive power.

Key Findings:

STARS, LabelAppeal, AcidIndex, Alcohol_bucket, and FreeSulfurDioxide_log1p remain the most important predictors. These are consistently linked to higher sales.

Many original features (ResidualSugar variants, CitricAcid, TotalSulfurDioxide variants, FixedAcidity, pH, Sulphates, Sulfur_ratio) were removed. They contribute little to improving predictive accuracy.

Interactions and bucketed features help capture nonlinear effects. Rating-Label interaction and categorical splits of alcohol and acid improve interpretability without overfitting.

Overall Performance:

AIC: 2316.85 (improved over full-feature model)

Residual Sum of Squares: 7986.3

Takeaway for Managers: Stepwise selection highlights the high-value features for sales prediction, simplifying the model while keeping predictive accuracy. Focus on wine ratings, label appeal, acidity, alcohol, and sulfur management for actionable insights.

4. Select Models

##      Model     RMSE  Rsquared       MAE
## 1 Poisson1       NA        NA        NA
## 2 Poisson2       NA        NA        NA
## 3  NegBin3       NA        NA        NA
## 4  NegBin4       NA        NA        NA
## 5      LM5 1.246015 0.3578654 0.9629386
## 6      LM6 1.246704 0.3571552 0.9636709

Linear models outperform count models in predictive accuracy (lower RMSE, higher R²).

Stepwise Linear Model 6 is the best choice, balancing predictive performance, simplicity, and interpretability.

Across all models, STARS, LabelAppeal, and AcidIndex consistently drive sales, highlighting clear levers for business decisions.

Conclusion

Across all models, three factors consistently stand out as the strongest drivers of wine sales: STARS ratings, LabelAppeal, and AcidIndex. Wines with higher expert ratings and more appealing labels sell significantly more, while higher acidity slightly dampens sales. Other chemical properties (ResidualSugar, CitricAcid, Chlorides, and sulfur measures) have minor or inconsistent effects. Missing values in STARS also carry predictive signal, highlighting the importance of data quality.

Business Implications: Focus marketing efforts on highlighting highly rated wines and improving label appeal, while managing acidity to align with consumer preferences. Streamlined models show that even a handful of well-chosen features can provide clear, actionable guidance, simplifying decision-making for managers and maximizing impact on sales.

Homework05

Kaylie Evans

2025-12-11

Introduction

Load Libraries

1. Data Exploration

1.1 Dataset Size and Structure

1.2 Summary Statistics

1.3 Missingness (Potentially Predictive)

1.4 Distribution of the Target

1.5 Predictor Distributions

1.6 Correlation With Target

1.7 Missingness Indicators and Predictive Power

2. Data Preparation

2.1 Handling Negative Values

2.2 Missing Data Indicators

2.3 Log / Square Root Transformations

2.4 Buckets and Interaction Features

2.5 Quick Overview of New Features

2.6 Train-Test-Split

3. Build Models

3.1 Poisson Regression Model with All Numeric Predictors

3.2 Poisson Regression Model with Top Predictive Features

3.3 Negative Binomial Regression Model with All Features

3.4 Negative Binomial Regression Model with Selected Features

3.5 Multiple Linear Regression Model with All Features

3.6 Multiple Linear Regression Model Stepwise

4. Select Models

Conclusion