Data-621 - Homework 5 ( Wine Sales Prediction Analysis )

Author

Bikash Bhowmik, Rupendra Shrestha, Anthony Josue Roman, Jerald Melukkaran, Haoming Chen

Published

May 16, 2026

Project Overview

Data-driven decision-making is essential for achieving effective business outcomes. In this analysis, we begin by exploring the dataset to identify potential issues such as outliers, missing values, encoding inconsistencies, and multicollinearity. After addressing these challenges through appropriate data preprocessing techniques, we develop and evaluate several predictive models for estimating sales.

The dataset is divided into training and evaluation subsets, allowing us to train models on one portion and validate their performance on unseen data. Multiple modeling approaches are compared, and the final model is selected based on its ability to balance predictive accuracy with model simplicity.

Code
library(summarytools)
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
library(tidymodels)
library(ggpmisc)
library(regclass)
library(skimr)
library(corrgram)
library(mice)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
  plot(residuals(model))
  hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
       xlab = "Residuals",col="lightgreen")
  lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
  curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
  qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
  qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}
#' Print a Variable Importance Plot for the provided model
#'
#' @param model The model
#' @param chart_title The Title to show on the plot
#' @examples
#' variableImportancePlot(myLinearModel, 'My Title)
#' @return null
#' @export
variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  # use caret and gglot to print a variable importance plot
  varImp(model) %>% as.data.frame() %>% 
    ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
    geom_col(aes(fill = Overall)) +
    theme(panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.text.x = element_text(angle = 90)) +
    scale_fill_gradient() +
    labs(title = chart_title,
         x = "Parameter",
         y = "Relative Importance")
}
#' Print a Facet Chart of histograms
#'
#' @param df Dataset
#' @param box Facet size (rows)
#' @examples
#' histbox(my_df, 3)
#' @return null
#' @export
histbox <- function(df, box) {
    par(mfrow = box)
    ndf <- dimnames(df)[[2]]
    
    for (i in seq_along(ndf)) {
            data <- na.omit(unlist(df[, i]))
            hist(data, breaks = "fd", main = paste("Histogram of", ndf[i]),
                 xlab = ndf[i], freq = FALSE)
            lines(density(data, kernel = "ep"), col = 'red')
    }
    
    par(mfrow = c(1, 1))
}
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model=NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
}

Exploratory Data Analysis

Code
train_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-training-data.csv', header=T) %>% as.tibble()
evaluate_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-evaluation-data.csv', header=T) %>% as.tibble()
train_df$INDEX <- NULL # Remove index column
evaluate_df$IN <- NULL
#evaluate_df$TARGET <- NULL
#glimpse(train_df)
head(train_df)
# A tibble: 6 × 15
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
   <int>        <dbl>           <dbl>      <dbl>         <dbl>     <dbl>
1      3          3.2           1.16       -0.98          54.2    -0.567
2      3          4.5           0.16       -0.81          26.1    -0.425
3      5          7.1           2.64       -0.88          14.8     0.037
4      3          5.7           0.385       0.04          18.8    -0.425
5      4          8             0.33       -1.26           9.4    NA    
6      0         11.3           0.32        0.59           2.2     0.556
# ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
#   Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <int>,
#   AcidIndex <int>, STARS <int>
Code
#skim(train_df)

head(evaluate_df)
# A tibble: 6 × 15
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
  <lgl>         <dbl>           <dbl>      <dbl>         <dbl>     <dbl>
1 NA              5.4          -0.86        0.27         -10.7     0.092
2 NA             12.4           0.385      -0.76         -19.7     1.17 
3 NA              7.2           1.75        0.17         -33       0.065
4 NA              6.2           0.1         1.8            1      -0.179
5 NA             11.4           0.21        0.28           1.2     0.038
6 NA             17.6           0.04       -1.15           1.4     0.535
# ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
#   Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <int>,
#   AcidIndex <int>, STARS <int>
Code
#str(evaluate_df)

The training dataset consists of 12,795 observations and 15 variables, including 14 predictors and one response variable (TARGET). The INDEX column serves only as an identifier and is removed prior to analysis. Most predictors represent chemical properties of wine, while LabelAppeal and STARS reflect subjective quality and presentation ratings.

The response variable, TARGET, represents the number of wine cases purchased and takes integer values between 0 and 8. A large proportion of observations contain zero values, indicating many instances with no purchases.

Code
library(vtable)


# st(train_df)
summary(train_df)
     TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
 Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
 1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
 Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
 Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
 3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
 Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
                                                                      
 ResidualSugar        Chlorides        FreeSulfurDioxide TotalSulfurDioxide
 Min.   :-127.800   Min.   :-1.17100   Min.   :-555.00   Min.   :-823.0    
 1st Qu.:  -2.000   1st Qu.:-0.03100   1st Qu.:   0.00   1st Qu.:  27.0    
 Median :   3.900   Median : 0.04600   Median :  30.00   Median : 123.0    
 Mean   :   5.419   Mean   : 0.05482   Mean   :  30.85   Mean   : 120.7    
 3rd Qu.:  15.900   3rd Qu.: 0.15300   3rd Qu.:  70.00   3rd Qu.: 208.0    
 Max.   : 141.150   Max.   : 1.35100   Max.   : 623.00   Max.   :1057.0    
 NA's   :616        NA's   :638        NA's   :647       NA's   :682       
    Density             pH          Sulphates          Alcohol     
 Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
 1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
 Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
 Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
 3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
 Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
                  NA's   :395     NA's   :1210      NA's   :653    
  LabelAppeal          AcidIndex          STARS      
 Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
 1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
 Median : 0.000000   Median : 8.000   Median :2.000  
 Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
 3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
 Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
                                      NA's   :3359   

Feature Analysis

Among the predictor variables, several contain missing values, particularly STARS, Sulphates, and sulfur-related measurements. The dataset includes both continuous and discrete variables, with LabelAppeal, AcidIndex, and STARS treated as categorical features.

Some chemical variables contain negative values, which are not physically meaningful. These values are likely the result of prior transformations such as normalization or scaling. While this inconsistency is noted, the values are retained (with adjustments applied later) due to limited information about the data generation process.

Target Variable Analysis

Code
library(ggplot2)

ggplot(train_df, aes(x = factor(TARGET), fill = factor(TARGET))) +
  
  geom_bar(
    color = "white",
    linewidth = 0.4,
    width = 0.75
  ) +
  
  geom_text(
    stat = "count",
    aes(label = after_stat(count)),
    vjust = -0.4,
    fontface = "bold",
    size = 4
  ) +
  
  scale_fill_brewer(palette = "RdPu") +
  
  labs(
    title = "Distribution of Wine Sales (TARGET)",
    subtitle = "Most observations contain low purchase counts",
    x = "Wine Cases Purchased",
    y = "Frequency",
    fill = "TARGET"
  ) +
  
  theme_minimal(base_size = 14) +
  
  theme(
    plot.title = element_text(face = "bold", size = 18),
    plot.subtitle = element_text(size = 12),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    plot.title.position = "plot"
  )

The distribution of the response variable indicates that most observations correspond to low sales volumes, with a high frequency of zero values. For non-zero observations, the distribution appears approximately symmetric, suggesting moderate variability in purchase behavior.

Distribution Analysis

Most continuous variables exhibit approximately normal distributions, although some degree of skewness is present. Variables such as AcidIndex and STARS show noticeable right skewness, indicating that higher values occur less frequently.

Boxplot analysis reveals the presence of outliers across several variables, particularly in TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar. These variables also display wider ranges compared to others, suggesting greater variability.

Code
library(tidyverse)

# Reshape data for plotting
train_long <- train_df %>%
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Value"
  )

# Create professional histogram panel
ggplot(train_long, aes(x = Value)) +

  geom_histogram(
    bins = 30,
    fill = "#3498DB",
    color = "white",
    alpha = 0.85
  ) +

  facet_wrap(
    ~Variable,
    scales = "free",
    ncol = 3
  ) +

  labs(
    title = "Distribution of Variables in Training Dataset",
    subtitle = "Histogram visualization for each feature",
    x = "Value",
    y = "Frequency"
  ) +

  theme_minimal(base_size = 14) +

  theme(
    plot.title = element_text(
      face = "bold",
      size = 18,
      hjust = 0.5
    ),

    plot.subtitle = element_text(
      size = 12,
      hjust = 0.5,
      color = "gray40"
    ),

    strip.text = element_text(
      face = "bold",
      size = 11
    ),

    panel.grid.minor = element_blank(),

    panel.grid.major = element_line(
      color = "gray90"
    )
  )

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

Code
library(tidyverse)
library(reshape2)

# Reshape data
melted_df <- melt(train_df)

# Professional boxplot visualization
ggplot(
  melted_df,
  aes(
    x = reorder(variable, value, FUN = median, na.rm = TRUE),
    y = value
  )
) +

  # Boxplots
  geom_boxplot(
    fill = "#5DADE2",
    color = "#1B4F72",
    alpha = 0.8,
    outlier.color = "red",
    outlier.alpha = 0.5
  ) +

  # Mean points
  stat_summary(
    fun = mean,
    geom = "point",
    shape = 18,
    size = 3,
    color = "darkgreen"
  ) +

  # Median points
  stat_summary(
    fun = median,
    geom = "point",
    shape = 16,
    size = 2.5,
    color = "darkred"
  ) +

  # Flip for readability
  coord_flip() +

  # Labels
  labs(
    title = "Distribution of Variables Using Boxplots",
    subtitle = "Green diamonds represent means; red circles represent medians",
    x = "Variables",
    y = "Values"
  ) +

  # Professional theme
  theme_minimal(base_size = 14) +

  theme(
    plot.title = element_text(
      face = "bold",
      size = 18,
      hjust = 0.5
    ),

    plot.subtitle = element_text(
      size = 12,
      hjust = 0.5,
      color = "gray40"
    ),

    axis.text.y = element_text(
      face = "bold",
      size = 10
    ),

    panel.grid.minor = element_blank(),

    panel.grid.major.y = element_blank()
  )

In the box plot, Several variables exhibit noticeable outliers, particularly sulfur-related measurements and residual sugar, although the majority of predictors remain reasonably concentrated.

Categorical Feature Relationships

Exploration of categorical predictors shows meaningful relationships with the response variable. Higher values of STARS are generally associated with increased sales, indicating that perceived quality influences purchasing behavior. Similarly, LabelAppeal appears positively related to sales, suggesting that product presentation plays a role in consumer decisions.

The AcidIndex variable shows variation in sales across its levels, indicating that certain acidity levels may be more favorable in the market.

Code
#Create a bar chart for Discrete Variables 
long <- melt(train_df, id.vars= colnames(train_df)[1:12])%>% 
  mutate(target = as.factor(TARGET))

ggplot(data = long, aes(x = value)) + 
  geom_bar(aes(fill = target)) + 
  facet_wrap( ~ variable, scales = "free")

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows the larger quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Higher STARS ratings appear associated with increased wine sales, indicating that customer-perceived quality influences purchasing behavior. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Correlation and Multicollinearity Analysis

Correlation analysis reveals that STARS and LabelAppeal have the strongest positive relationships with the response variable. In contrast, AcidIndex shows a mild negative correlation with sales. Overall, while some correlations exist among predictors, no severe multicollinearity issues are observed.

Code
# Compute correlation matrix
corr_matrix <- cor(drop_na(train_df))

# Extract correlations with the target variable
target_correlation <- corr_matrix[, "TARGET"]

# Display correlation table
knitr::kable(target_correlation, "html", escape = FALSE) %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::scroll_box(height = "500px")
x
TARGET 1.0000000
FixedAcidity -0.0125381
VolatileAcidity -0.0759979
CitricAcid 0.0023450
ResidualSugar 0.0035196
Chlorides -0.0304301
FreeSulfurDioxide 0.0226398
TotalSulfurDioxide 0.0216021
Density -0.0475989
pH 0.0002199
Sulphates -0.0212204
Alcohol 0.0737771
LabelAppeal 0.4979465
AcidIndex -0.1676431
STARS 0.5546857
Code
corrgram(drop_na(train_df), order=TRUE,
         upper.panel=panel.cor, main="Correlation Plot")

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preprocessing

Handling Negative Values

To address negative values in chemical variables, absolute transformations are applied to ensure all values are non-negative. For LabelAppeal, values are shifted by adding the minimum value to maintain relative differences while eliminating negative entries.

Code
train_df$FixedAcidity <- abs(train_df$FixedAcidity)
evaluate_df$FixedAcidity <- abs(evaluate_df$FixedAcidity)

train_df$VolatileAcidity <- abs(train_df$VolatileAcidity)
evaluate_df$VolatileAcidity <- abs(evaluate_df$VolatileAcidity)

train_df$CitricAcid <- abs(train_df$CitricAcid)
evaluate_df$CitricAcid <- abs(evaluate_df$CitricAcid)

train_df$ResidualSugar <- abs(train_df$ResidualSugar)
evaluate_df$ResidualSugar <- abs(evaluate_df$ResidualSugar)

train_df$Chlorides <- abs(train_df$Chlorides)
evaluate_df$Chlorides <- abs(evaluate_df$Chlorides)

train_df$FreeSulfurDioxide <- abs(train_df$FreeSulfurDioxide)
evaluate_df$FreeSulfurDioxide <- abs(evaluate_df$FreeSulfurDioxide)

train_df$TotalSulfurDioxide <- abs(train_df$TotalSulfurDioxide)
evaluate_df$TotalSulfurDioxide <- abs(evaluate_df$TotalSulfurDioxide)

train_df$Sulphates <- abs(train_df$Sulphates)
evaluate_df$Sulphates <- abs(evaluate_df$Sulphates)

train_df$LabelAppeal <- train_df$LabelAppeal + abs(min(train_df$LabelAppeal))
evaluate_df$LabelAppeal <- evaluate_df$LabelAppeal + abs(min(evaluate_df$LabelAppeal))

Missing Value Treatment

Several variables contain missing values, with STARS having the highest proportion. Missing values are assumed to be Missing at Random (MAR) and are imputed using the MICE (Multiple Imputation by Chained Equations) method.

Code
#Identify missing data by Feature and display percent breakout
plot_missing(train_df)

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

Code
train_df$STARS[is.na(train_df$STARS)] <- 0
evaluate_df$STARS[is.na(evaluate_df$STARS)] <- 0
# Perform multiple imputation
mice_imputes <- mice(train_df, m = 2, maxit = 2, print = FALSE)
# Visualize the imputed values with density plot
densityplot(mice_imputes)

The imputed distributions closely resemble the observed distributions, suggesting the missing values are reasonably consistent with a Missing at Random (MAR) assumption. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Code
mice_train <-  mice(train_df, m = 1, maxit = 1, print = FALSE)
cleaned_train <- complete(mice_train)

mice_evaluate <- mice(evaluate_df, m = 1, maxit = 1, print = FALSE)
cleaned_evaluate <- complete(mice_evaluate)

cleaned_train$STARS <- as.factor(cleaned_train$STARS)
cleaned_evaluate$STARS <- as.factor(cleaned_evaluate$STARS)

Statistical Summary and Correlation Review

Following data cleaning and imputation, updated summaries and visualizations confirm that the dataset is well-structured and suitable for modeling. The distributions remain consistent with earlier observations, and correlation patterns are preserved.

Code
plot_histogram(cleaned_train, title = "Revised Histogram of Cleaned Training Data")

Code
library(summarytools)
library(knitr)
library(kableExtra)
library(dplyr)

# Create summary table
summary_table <- descr(cleaned_train)

# Transpose table
transposed_summary_table <- t(summary_table)

# Beautiful HTML table
knitr::kable(
  transposed_summary_table,
  format = "html",
  caption = "Summary Statistics of Cleaned Training Data",
  digits = 2,
  escape = FALSE
) %>%
  kable_styling(
    bootstrap_options = c(
      "striped",
      "hover",
      "condensed",
      "responsive"
    ),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0,
           bold = TRUE,
           color = "white",
           background = "#2C3E50") %>%
  column_spec(1,
              bold = TRUE,
              color = "#2C3E50") %>%
  scroll_box(height = "500px")
Summary Statistics of Cleaned Training Data
Mean Std.Dev Min Q1 Median Q3 Max MAD IQR CV Skewness SE.Skewness Kurtosis N.Valid N Pct.Valid
AcidIndex 7.77 1.32 4.00 7.00 8.00 8.00 17.00 1.48 1.00 0.17 1.65 0.02 5.19 12795 12795 100
Alcohol 10.49 3.74 -4.70 9.00 10.40 12.40 26.50 2.52 3.40 0.36 -0.05 0.02 1.54 12795 12795 100
Chlorides 0.22 0.23 0.00 0.04 0.10 0.37 1.35 0.10 0.32 1.05 1.48 0.02 2.16 12795 12795 100
CitricAcid 0.69 0.61 0.00 0.28 0.44 0.97 3.86 0.33 0.69 0.88 1.64 0.02 2.95 12795 12795 100
Density 0.99 0.03 0.89 0.99 0.99 1.00 1.10 0.01 0.01 0.03 -0.02 0.02 1.90 12795 12795 100
FixedAcidity 8.06 5.00 0.00 5.60 7.00 9.80 34.40 2.97 4.20 0.62 1.17 0.02 1.97 12795 12795 100
FreeSulfurDioxide 106.61 108.10 0.00 28.00 55.50 171.00 623.00 60.05 143.00 1.01 1.54 0.02 2.46 12795 12795 100
LabelAppeal 1.99 0.89 0.00 1.00 2.00 3.00 4.00 1.48 2.00 0.45 0.01 0.02 -0.26 12795 12795 100
pH 3.21 0.68 0.48 2.96 3.20 3.47 6.13 0.39 0.51 0.21 0.04 0.02 1.65 12795 12795 100
ResidualSugar 23.41 24.95 0.00 3.60 12.90 38.80 141.15 16.31 35.20 1.07 1.46 0.02 2.19 12795 12795 100
Sulphates 0.84 0.65 0.00 0.43 0.59 1.09 4.24 0.33 0.66 0.78 1.69 0.02 3.18 12795 12795 100
TARGET 3.03 1.93 0.00 2.00 3.00 4.00 8.00 1.48 2.00 0.64 -0.33 0.02 -0.88 12795 12795 100
TotalSulfurDioxide 204.51 162.96 0.00 100.00 154.00 263.00 1057.00 102.30 163.00 0.80 1.62 0.02 3.09 12795 12795 100
VolatileAcidity 0.64 0.56 0.00 0.25 0.41 0.91 3.68 0.33 0.66 0.87 1.65 0.02 3.08 12795 12795 100
Code
# -----------------------------
# Correlation Plot
# -----------------------------

numeric_data <- cleaned_train %>%
  select(where(is.numeric)) %>%
  na.omit()

# corr_matrix <- cor(numeric_data)

numeric_data <- cleaned_train %>%
  select(where(is.numeric)) %>%
  select(-TARGET)


corrplot(
  corr_matrix,
  method = "color",
  type = "upper",
  order = "hclust",
  addCoef.col = "black",
  tl.col = "black",
  tl.srt = 45,
  number.cex = 0.5,
  col = colorRampPalette(
    c("darkred", "white", "darkblue")
  )(200)
)

title("Correlation Matrix of Cleaned Training Data")


### Train-Test Data Split

The dataset is divided into training (80%) and testing (20%) subsets. Models are trained on the training data and evaluated on the testing data to ensure unbiased performance assessment.


::: {.cell}

```{.r .cell-code}
options(scipen = 999)

# Get training/test split
y_raw <- as.matrix(cleaned_train$TARGET)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)

# Build training data sets
trainX <- cleaned_train[trainingRows,] %>% select(-TARGET)
trainY <- cleaned_train[trainingRows,] %>% select(TARGET)

# Build testing data set
testX <- cleaned_train[-trainingRows,] %>% select(-TARGET)
testY <- cleaned_train[-trainingRows,] %>% select(TARGET)

# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$TARGET <- trainY$TARGET
print(paste('Number of Training Samples: ', dim(trainingData)[1]))
[1] "Number of Training Samples:  10238"
Code
testingData <- as.data.frame(testX)
testingData$TARGET <- testY$TARGET
print(paste('Number of Testing Samples: ', dim(testingData)[1]))
[1] "Number of Testing Samples:  2557"
Code
model_test_perf <- function(model, trainX, trainY, testX, testY) {
  # Evaluate Model 1 with testing data set
  predictedY <- predict(model, newdata=testX)

  model_results <- data.frame(obs = testY, pred=predictedY)
  colnames(model_results) = c('obs', 'pred')
  
  # This grabs RMSE, Rsquaredand MAE by default
  model_eval <- defaultSummary(model_results)
  
  # Add AIC score to the results
  if ('aic' %in% model) {
    model_eval[4] <- model$aic
  } else {
    model_eval[4] <- AIC(model)
  }
  
  names(model_eval)[4] <- 'aic'
 
  # Add BIC score to the results
  model_eval[5] <- BIC(model)
  names(model_eval)[5] <- 'bic'

   
  return(model_eval)
}

:::

Assessing Overdispersion

Code
mean(trainingData$TARGET)
[1] 3.02901
Code
var(trainingData$TARGET)
[1] 3.699754
Code
target_mean <- mean(trainingData$TARGET)
target_var  <- var(trainingData$TARGET)

target_mean
[1] 3.02901
Code
target_var
[1] 3.699754

The Poisson regression model assumes that the mean and variance of the response variable are approximately equal. However, the TARGET variable shows evidence of overdispersion because the variance is substantially larger than the mean.

This indicates that the Poisson assumption may not hold for the wine sales data. As a result, the Negative Binomial regression model becomes more appropriate because it includes an additional dispersion parameter that can better account for variability in count-based outcomes.

Model Development

Multiple models are developed to predict wine sales, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Both full models (using all predictors) and reduced models (using selected predictors) are evaluated.

Full Poisson Regression Model - Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code
options(scipen = 999)

Poisson_Model1 <- glm(TARGET ~ ., data = trainingData, family = poisson)

summary(Poisson_Model1)

Call:
glm(formula = TARGET ~ ., family = poisson, data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.87947458  0.21865504   4.022            0.0000577 ***
FixedAcidity       -0.00080780  0.00117974  -0.685             0.493514    
VolatileAcidity    -0.03979158  0.01059527  -3.756             0.000173 ***
CitricAcid          0.00896254  0.00929062   0.965             0.334702    
ResidualSugar       0.00011147  0.00022600   0.493             0.621846    
Chlorides          -0.03434444  0.02440807  -1.407             0.159400    
FreeSulfurDioxide   0.00004408  0.00005221   0.844             0.398508    
TotalSulfurDioxide  0.00008236  0.00003485   2.363             0.018119 *  
Density            -0.32104716  0.21350329  -1.504             0.132656    
pH                 -0.01110530  0.00839945  -1.322             0.186120    
Sulphates          -0.00971838  0.00867069  -1.121             0.262360    
Alcohol             0.00351777  0.00151846   2.317             0.020521 *  
LabelAppeal         0.15585313  0.00683140  22.814 < 0.0000000000000002 ***
AcidIndex          -0.08201567  0.00512208 -16.012 < 0.0000000000000002 ***
STARS1              0.76514823  0.02169905  35.262 < 0.0000000000000002 ***
STARS2              1.08588180  0.02026659  53.580 < 0.0000000000000002 ***
STARS3              1.20763733  0.02130707  56.678 < 0.0000000000000002 ***
STARS4              1.32284186  0.02717071  48.686 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18238  on 10237  degrees of freedom
Residual deviance: 10931  on 10220  degrees of freedom
AIC: 36541

Number of Fisher Scoring iterations: 6
Code
# Evaluate Model 1 with testing data set
(Poission_Evaluate1 <- model_test_perf(Poisson_Model1, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    2.5856948     0.5241326     2.2120363 36541.2541161 36671.4636243 

This model identified several significant predictors of wine sales, particularly LabelAppeal, STARS, Alcohol, and AcidIndex. Higher STARS ratings and stronger label appeal were associated with increased wine purchases, while higher AcidIndex negatively impacted sales. The model achieved moderate predictive performance with an RMSE of 2.59 and an R-squared of 0.50. However, because the wine sales data showed overdispersion, the Poisson model may not fully capture the variability in the response variable, suggesting that a Negative Binomial model may provide a better fit.

Reduced Poisson Regression Model - Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

Code
Poisson_Model2 <- glm(TARGET ~  VolatileAcidity + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS, data = trainingData, family = poisson)

summary(Poisson_Model2)

Call:
glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
    Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
    data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.51763386  0.04966257  10.423 < 0.0000000000000002 ***
VolatileAcidity    -0.04002963  0.01059337  -3.779             0.000158 ***
TotalSulfurDioxide  0.00008324  0.00003482   2.391             0.016817 *  
Alcohol             0.00355403  0.00151798   2.341             0.019217 *  
LabelAppeal         0.15596745  0.00682763  22.844 < 0.0000000000000002 ***
AcidIndex          -0.08247967  0.00504678 -16.343 < 0.0000000000000002 ***
STARS1              0.76689124  0.02168816  35.360 < 0.0000000000000002 ***
STARS2              1.08738351  0.02025825  53.676 < 0.0000000000000002 ***
STARS3              1.20954432  0.02129707  56.794 < 0.0000000000000002 ***
STARS4              1.32436529  0.02716599  48.751 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18238  on 10237  degrees of freedom
Residual deviance: 10941  on 10228  degrees of freedom
AIC: 36535

Number of Fisher Scoring iterations: 6
Code
# Evaluate Model 2 with testing data set
(Poission_Evaluate2 <- model_test_perf(Poisson_Model2, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    2.5861901     0.5238788     2.2126185 36535.0243447 36607.3629604 

This Model retained the most significant predictors of wine sales, including VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS. The results show that higher STARS ratings, stronger label appeal, and higher alcohol content positively influenced wine purchases, while higher volatile acidity and AcidIndex negatively affected sales. The model achieved moderate predictive performance with an RMSE of 2.59 and an R-squared of 0.50. Compared to the full Poisson model, this reduced model provided a simpler and more interpretable structure while maintaining similar predictive accuracy.

Full Negative Binomial Model - Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code
nb3 <- glm.nb(TARGET ~ ., data = trainingData)
summary(nb3)

Call:
glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40907.59317, 
    link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.87951046  0.21866487   4.022            0.0000577 ***
FixedAcidity       -0.00080782  0.00117979  -0.685             0.493523    
VolatileAcidity    -0.03979320  0.01059573  -3.756             0.000173 ***
CitricAcid          0.00896277  0.00929105   0.965             0.334711    
ResidualSugar       0.00011147  0.00022601   0.493             0.621858    
Chlorides          -0.03434481  0.02440917  -1.407             0.159414    
FreeSulfurDioxide   0.00004408  0.00005221   0.844             0.398503    
TotalSulfurDioxide  0.00008236  0.00003485   2.363             0.018118 *  
Density            -0.32105647  0.21351293  -1.504             0.132662    
pH                 -0.01110634  0.00839983  -1.322             0.186098    
Sulphates          -0.00971890  0.00867108  -1.121             0.262355    
Alcohol             0.00351770  0.00151852   2.317             0.020530 *  
LabelAppeal         0.15585201  0.00683171  22.813 < 0.0000000000000002 ***
AcidIndex          -0.08201820  0.00512228 -16.012 < 0.0000000000000002 ***
STARS1              0.76514711  0.02169951  35.261 < 0.0000000000000002 ***
STARS2              1.08588083  0.02026705  53.579 < 0.0000000000000002 ***
STARS3              1.20763727  0.02130765  56.676 < 0.0000000000000002 ***
STARS4              1.32284266  0.02717190  48.684 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40907.59) family taken to be 1)

    Null deviance: 18237  on 10237  degrees of freedom
Residual deviance: 10930  on 10220  degrees of freedom
AIC: 36544

Number of Fisher Scoring iterations: 1

              Theta:  40908 
          Std. Err.:  38575 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -36505.59 
Code
# Evaluate Model 3 with testing data set
(Negative_Binomial_eval3 <- model_test_perf(nb3, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    2.5856947     0.5241325     2.2120361 36543.5873515 36681.0307213 

This Model retained the most significant predictors of wine sales, including VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS. The results show that higher STARS ratings, stronger label appeal, and higher alcohol content positively influenced wine purchases, while higher volatile acidity and AcidIndex negatively affected sales. The model achieved moderate predictive performance with an RMSE of 2.59 and an R-squared of 0.50. Compared to the full Poisson model, this reduced model provided a simpler and more interpretable structure while maintaining similar predictive accuracy.

Reduced Negative Binomial Model - Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

Code
nb4 <- glm.nb(TARGET~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS,  data=cleaned_train)

summary (nb4)

Call:
glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
    TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
    STARS, data = cleaned_train, init.theta = 40660.03007, link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.48095237  0.04480663  10.734 < 0.0000000000000002 ***
VolatileAcidity    -0.03705484  0.00939643  -3.944            0.0000803 ***
FreeSulfurDioxide   0.00004554  0.00004696   0.970              0.33217    
TotalSulfurDioxide  0.00008470  0.00003128   2.708              0.00678 ** 
Alcohol             0.00356284  0.00136690   2.606              0.00915 ** 
LabelAppeal         0.15925190  0.00612540  25.999 < 0.0000000000000002 ***
AcidIndex          -0.08035959  0.00449916 -17.861 < 0.0000000000000002 ***
STARS1              0.77100114  0.01952976  39.478 < 0.0000000000000002 ***
STARS2              1.09244656  0.01820271  60.016 < 0.0000000000000002 ***
STARS3              1.21303582  0.01917468  63.262 < 0.0000000000000002 ***
STARS4              1.32897187  0.02428131  54.732 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40660.03) family taken to be 1)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 13685  on 12784  degrees of freedom
AIC: 45652

Number of Fisher Scoring iterations: 1

              Theta:  40660 
          Std. Err.:  34193 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -45627.64 
Code
# Evaluate Model 4 with testing data set
(Negative_Binomial_eval4 <- model_test_perf(nb4, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    2.5880849     0.5243596     2.2142707 45651.6396402 45741.1213571 

This Model identified STARS, LabelAppeal, Alcohol, and TotalSulfurDioxide as significant positive predictors of wine sales, while VolatileAcidity and AcidIndex showed negative relationships with TARGET. The model achieved moderate predictive performance with an RMSE of 2.59 and an R-squared of 0.50. Compared to the Poisson models, the Negative Binomial model is more appropriate because it better handles overdispersion in the count-based wine sales data. Overall, Model 4 provided a good balance between predictive accuracy, interpretability, and suitability for the dataset.

Full Multiple Linear Regression Model - Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code
lm5 <- lm(TARGET ~ ., data = trainingData)
summary(lm5)

Call:
lm(formula = TARGET ~ ., data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7926 -0.8798  0.0301  0.8432  6.1990 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         3.01225435  0.49644404   6.068        0.00000000134 ***
FixedAcidity       -0.00145845  0.00264960  -0.550             0.582029    
VolatileAcidity    -0.11663240  0.02352934  -4.957        0.00000072780 ***
CitricAcid          0.02729614  0.02130213   1.281             0.200089    
ResidualSugar       0.00010163  0.00051539   0.197             0.843676    
Chlorides          -0.09747423  0.05502802  -1.771             0.076531 .  
FreeSulfurDioxide   0.00014577  0.00011905   1.225             0.220789    
TotalSulfurDioxide  0.00025652  0.00007918   3.240             0.001200 ** 
Density            -0.98515411  0.48677807  -2.024             0.043014 *  
pH                 -0.02874418  0.01907151  -1.507             0.131796    
Sulphates          -0.02218898  0.01953237  -1.136             0.255979    
Alcohol             0.01159651  0.00344222   3.369             0.000757 ***
LabelAppeal         0.45449465  0.01521162  29.878 < 0.0000000000000002 ***
AcidIndex          -0.20323954  0.01017001 -19.984 < 0.0000000000000002 ***
STARS1              1.36888638  0.03674681  37.252 < 0.0000000000000002 ***
STARS2              2.40223814  0.03582838  67.048 < 0.0000000000000002 ***
STARS3              2.97722302  0.04129503  72.096 < 0.0000000000000002 ***
STARS4              3.66945191  0.06663608  55.067 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.309 on 10220 degrees of freedom
Multiple R-squared:  0.5378,    Adjusted R-squared:  0.537 
F-statistic: 699.5 on 17 and 10220 DF,  p-value: < 0.00000000000000022
Code
# Evaluate Model 5 with testing data set
(Linear_Regression_eval5 <- model_test_perf(lm5, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    1.3065395     0.5457749     1.0230416 34583.7182437 34721.1616135 

This Model demonstrated strong predictive performance with an RMSE of 1.33 and an R-squared of 0.53, outperforming the Poisson and Negative Binomial models. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive relationships with wine sales, while VolatileAcidity and AcidIndex negatively affected TARGET. The overall model was statistically significant, indicating that the predictors collectively explained a substantial portion of the variation in wine sales. However, despite its strong performance, linear regression may not be ideal for count-based response data because it does not account for overdispersion or non-negative count constraints.

Stepwise Linear Regression Model - Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.

Code
lm6 <- stepAIC(lm5, direction = "both",
               scope = list(upper = lm5, lower = ~ 1),
               scale = 0, trace = FALSE)

summary(lm6)

Call:
lm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
    Density + pH + Alcohol + LabelAppeal + AcidIndex + STARS, 
    data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8487 -0.8798  0.0321  0.8451  6.1851 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         3.03344749  0.49584240   6.118       0.000000000984 ***
VolatileAcidity    -0.11701321  0.02352717  -4.974       0.000000668113 ***
Chlorides          -0.09909862  0.05501229  -1.801             0.071671 .  
TotalSulfurDioxide  0.00025994  0.00007915   3.284             0.001026 ** 
Density            -0.99423962  0.48665454  -2.043             0.041077 *  
pH                 -0.02896674  0.01906636  -1.519             0.128728    
Alcohol             0.01151206  0.00344162   3.345             0.000826 ***
LabelAppeal         0.45516977  0.01520667  29.932 < 0.0000000000000002 ***
AcidIndex          -0.20417181  0.00999712 -20.423 < 0.0000000000000002 ***
STARS1              1.37082946  0.03672941  37.322 < 0.0000000000000002 ***
STARS2              2.40430556  0.03580801  67.144 < 0.0000000000000002 ***
STARS3              2.97868645  0.04128384  72.151 < 0.0000000000000002 ***
STARS4              3.67113070  0.06662296  55.103 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.309 on 10225 degrees of freedom
Multiple R-squared:  0.5376,    Adjusted R-squared:  0.5371 
F-statistic: 990.6 on 12 and 10225 DF,  p-value: < 0.00000000000000022
Code
# Evaluate Model 6 with testing data set
(Linear_Regression_eval6 <- model_test_perf(lm6, trainX, trainY, testX, testY))
         RMSE      Rsquared           MAE           aic           bic 
    1.3072557     0.5452748     1.0236576 34578.4699096 34679.7439715 

This Model selected the most important predictors using the stepAIC method and achieved strong predictive performance with an RMSE of 1.33 and an R-squared of 0.53. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive effects on wine sales, while VolatileAcidity and AcidIndex negatively influenced TARGET. Compared to the full linear regression model, this reduced model maintained similar predictive accuracy while using fewer variables, making it more efficient and interpretable. Based on AIC and BIC values, Model 6 provided the best overall balance between model simplicity and predictive performance.

Model Evaluation and Selection

Model performance is evaluated using metrics such as RMSE, R-squared, MAE, AIC, and BIC. Among all models, the multiple linear regression models demonstrate superior predictive performance.

In particular, the stepwise-selected model (Model 6) achieves the best balance across evaluation metrics, with lower AIC and BIC values indicating improved model efficiency.

Code
models_summary <- rbind(Poission_Evaluate1, Poission_Evaluate2, Negative_Binomial_eval3, Negative_Binomial_eval4, Linear_Regression_eval5, Linear_Regression_eval6)

kable(
  round(models_summary,3),
  caption = "Model Performance Comparison"
) %>%
  kable_styling(
    bootstrap_options = c(
      "striped",
      "hover",
      "condensed",
      "responsive"
    ),
    full_width = FALSE,
    position = "center"
  )
Model Performance Comparison
RMSE Rsquared MAE aic bic
Poission_Evaluate1 2.586 0.524 2.212 36541.25 36671.46
Poission_Evaluate2 2.586 0.524 2.213 36535.02 36607.36
Negative_Binomial_eval3 2.586 0.524 2.212 36543.59 36681.03
Negative_Binomial_eval4 2.588 0.524 2.214 45651.64 45741.12
Linear_Regression_eval5 1.307 0.546 1.023 34583.72 34721.16
Linear_Regression_eval6 1.307 0.545 1.024 34578.47 34679.74

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Across all model types, STARS and LabelAppeal consistently showed positive coefficients, indicating that wines with higher ratings and stronger label appeal tend to generate higher sales volumes. AcidIndex generally showed a negative relationship with TARGET, suggesting that higher acidity may reduce customer demand. Alcohol content displayed a moderate positive effect in both Poisson and Negative Binomial models, which may indicate customer preference for wines with higher alcohol concentration. Some variables changed sign across different model types. For example, pH showed inconsistent effects between the Poisson and linear regression models. This may be due to multicollinearity or differing distributional assumptions among the models. Despite these inconsistencies, variables with unstable coefficients were retained if they improved overall predictive performance and model fit statistics.

Although the multiple linear regression models achieved slightly better RMSE and R-squared performance, linear regression is not ideal for count-based response variables because predictions may fall outside the valid range and assumptions of normality may be violated.

Since TARGET represents count data with substantial overdispersion and many zero values, the Negative Binomial regression model provides a more theoretically appropriate framework for deployment.

Therefore, Negative Binomial Model 4 was selected as the final deployment model because it balances predictive performance, interpretability, and suitability for count data.

Sales Prediction Results

The selected model is applied to the evaluation dataset to generate predictions. The resulting distribution of predicted values closely resembles that of the training data, suggesting that the model generalizes well.

Code
library(ggplot2)

# Prepare evaluation data
eval_data <- cleaned_evaluate %>% 
  select(-TARGET)

# Generate predictions
predictions <- predict(nb4, eval_data, type = "response")

# Add predictions back to dataframe
eval_data$TARGET <- predictions

# Beautiful histogram
ggplot(data.frame(predictions), aes(x = predictions)) +
  geom_histogram(
    bins = 30,
    fill = "steelblue",
    color = "white",
    alpha = 0.9
  ) +
  geom_density(
    aes(y = after_stat(count)),
    color = "darkred",
    linewidth = 1.2
  ) +
  labs(
    title = "Distribution of Predicted Values",
    x = "Predicted TARGET",
    y = "Frequency"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(
      face = "bold",
      hjust = 0.5
    ),
    axis.title = element_text(face = "bold")
  )

Code
# Export predictions
write.csv(
  eval_data,
  "DATA621_HW5_G2-Predictions.csv",
  row.names = FALSE
)

# Preview results
head(eval_data)
  FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1          5.4           0.860       0.27          10.7     0.092
2         12.4           0.385       0.76          19.7     1.169
3          7.2           1.750       0.17          33.0     0.065
4          6.2           0.100       1.80           1.0     0.179
5         11.4           0.210       0.28           1.2     0.038
6         17.6           0.040       1.15           1.4     0.535
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                23                398 0.98527 5.02      0.64   12.30
2                37                 68 0.99048 3.37      1.09   16.00
3                 9                 76 1.04641 4.61      0.68    8.55
4               104                 89 0.98877 3.20      2.11   12.30
5                70                 53 1.02899 2.54      0.07    4.80
6               250                140 0.95028 3.06      0.02   11.40
  LabelAppeal AcidIndex STARS   TARGET
1           1         6     0 1.227233
2           2         6     2 4.305655
3           2         8     1 2.459712
4           1         8     1 2.272137
5           2        10     0 1.012940
6           3         8     4 5.513393

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

Final Conclusions

Several predictive modeling approaches were explored, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Exploratory analysis revealed that the TARGET variable contains many zero observations and demonstrates characteristics of overdispersion, making count-based regression models more appropriate.

Although the multiple linear regression models achieved slightly stronger RMSE and R-squared performance, the Negative Binomial models provided a better theoretical fit for the count nature of the response variable. In particular, Negative Binomial Model 4 achieved a strong balance between predictive accuracy, interpretability, and model simplicity.

Key predictors such as STARS, Alcohol, and LabelAppeal consistently demonstrated positive relationships with wine sales, while AcidIndex showed a generally negative relationship with TARGET.

Overall, the Negative Binomial regression framework provided the most appropriate balance between predictive performance and theoretical suitability for count-based sales data. The final model successfully captured key drivers of wine purchasing behavior while addressing overdispersion present in the response variable. These findings demonstrate how statistical modeling can support data-driven decision-making in consumer sales forecasting.