Data621 - Homework 5 (Wine Sales Prediction Analysis)

Author

Bikash Bhowmik, Rupendra Shrestha, Anthony Josue Roman, Jerald Melukkaran, Haoming Chen

Published

May 17, 2026

Project Overview

Data-driven decision-making is essential for achieving effective business outcomes. In this analysis, we begin by exploring the dataset to identify potential issues such as outliers, missing values, encoding inconsistencies, and multicollinearity. After addressing these challenges through appropriate data preprocessing techniques, we develop and evaluate several predictive models for estimating sales.

The dataset is divided into training and evaluation subsets, allowing us to train models on one portion and validate their performance on unseen data. Multiple modeling approaches are compared, and the final model is selected based on its ability to balance predictive accuracy with model simplicity.

Code

library(summarytools)
library(MASS)
library(rpart.plot)
library(ggplot2)
library(ggfortify)
library(gridExtra)
library(forecast)
library(fpp2)
library(fma)
library(kableExtra)
library(e1071)
library(mlbench)
library(ggcorrplot)
library(DataExplorer)
library(timeDate)
library(caret)
library(GGally)
library(corrplot)
library(RColorBrewer)
library(tibble)
library(tidyr)
library(tidyverse)
library(dplyr)
library(reshape2)
library(mixtools)
library(tidymodels)
library(ggpmisc)
library(regclass)
library(skimr)
library(corrgram)
library(mice)
#' Print a side-by-side Histogram and QQPlot of Residuals
#'
#' @param model A model
#' @examples
#' residPlot(myModel)
#' @return null
#' @export
residPlot <- function(model) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE))
  plot(residuals(model))
  hist(model[["residuals"]], freq = FALSE, breaks = "fd", main = "Residual Histogram",
       xlab = "Residuals",col="lightgreen")
  lines(density(model[["residuals"]], kernel = "ep"),col="blue", lwd=3)
  curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)
  qqnorm(model[["residuals"]], main = "Residual Q-Q plot")
  qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")
  par(mfrow = c(1, 1))
}
#' Print a Variable Importance Plot for the provided model
#'
#' @param model The model
#' @param chart_title The Title to show on the plot
#' @examples
#' variableImportancePlot(myLinearModel, 'My Title)
#' @return null
#' @export
variableImportancePlot <- function(model=NULL, chart_title='Variable Importance Plot') {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  # use caret and gglot to print a variable importance plot
  varImp(model) %>% as.data.frame() %>% 
    ggplot(aes(x = reorder(rownames(.), desc(Overall)), y = Overall)) +
    geom_col(aes(fill = Overall)) +
    theme(panel.background = element_blank(),
          panel.grid = element_blank(),
          axis.text.x = element_text(angle = 90)) +
    scale_fill_gradient() +
    labs(title = chart_title,
         x = "Parameter",
         y = "Relative Importance")
}
#' Print a Facet Chart of histograms
#'
#' @param df Dataset
#' @param box Facet size (rows)
#' @examples
#' histbox(my_df, 3)
#' @return null
#' @export
histbox <- function(df, box) {
    par(mfrow = box)
    ndf <- dimnames(df)[[2]]
    
    for (i in seq_along(ndf)) {
            data <- na.omit(unlist(df[, i]))
            hist(data, breaks = "fd", main = paste("Histogram of", ndf[i]),
                 xlab = ndf[i], freq = FALSE)
            lines(density(data, kernel = "ep"), col = 'red')
    }
    
    par(mfrow = c(1, 1))
}
#' Extract key performance results from a model
#'
#' @param model A linear model of interest
#' @examples
#' model_performance_extraction(my_model)
#' @return data.frame
#' @export
model_performance_extraction <- function(model=NULL) {
  # Make sure a model was passed
  if (is.null(model)) {
    return
  }
  
  data.frame("RSE" = model$sigma,
             "Adj R2" = model$adj.r.squared,
             "F-Statistic" = model$fstatistic[1])
}

Exploratory Data Analysis

Code

train_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-training-data.csv', header=T) %>% as.tibble()
evaluate_df <- read.csv('https://raw.githubusercontent.com/BIKASHBHOWMIK15/Data-621/main/wine-evaluation-data.csv', header=T) %>% as.tibble()
train_df$INDEX <- NULL # Remove index column
evaluate_df$IN <- NULL
#evaluate_df$TARGET <- NULL
#glimpse(train_df)
head(train_df)

# A tibble: 6 × 15
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
   <int>        <dbl>           <dbl>      <dbl>         <dbl>     <dbl>
1      3          3.2           1.16       -0.98          54.2    -0.567
2      3          4.5           0.16       -0.81          26.1    -0.425
3      5          7.1           2.64       -0.88          14.8     0.037
4      3          5.7           0.385       0.04          18.8    -0.425
5      4          8             0.33       -1.26           9.4    NA    
6      0         11.3           0.32        0.59           2.2     0.556
# ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
#   Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <int>,
#   AcidIndex <int>, STARS <int>

Code

#skim(train_df)

head(evaluate_df)

# A tibble: 6 × 15
  TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
  <lgl>         <dbl>           <dbl>      <dbl>         <dbl>     <dbl>
1 NA              5.4          -0.86        0.27         -10.7     0.092
2 NA             12.4           0.385      -0.76         -19.7     1.17 
3 NA              7.2           1.75        0.17         -33       0.065
4 NA              6.2           0.1         1.8            1      -0.179
5 NA             11.4           0.21        0.28           1.2     0.038
6 NA             17.6           0.04       -1.15           1.4     0.535
# ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
#   Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <int>,
#   AcidIndex <int>, STARS <int>

Code

#str(evaluate_df)

The training dataset consists of 12,795 observations and 15 variables, including 14 predictors and one response variable (TARGET). The INDEX column serves only as an identifier and is removed prior to analysis. Most predictors represent chemical properties of wine, while LabelAppeal and STARS reflect subjective quality and presentation ratings.

The response variable, TARGET, represents the number of wine cases purchased and takes integer values between 0 and 8. A large proportion of observations contain zero values, indicating many instances with no purchases.

Code

library(vtable)


# st(train_df)
summary(train_df)

     TARGET       FixedAcidity     VolatileAcidity     CitricAcid     
 Min.   :0.000   Min.   :-18.100   Min.   :-2.7900   Min.   :-3.2400  
 1st Qu.:2.000   1st Qu.:  5.200   1st Qu.: 0.1300   1st Qu.: 0.0300  
 Median :3.000   Median :  6.900   Median : 0.2800   Median : 0.3100  
 Mean   :3.029   Mean   :  7.076   Mean   : 0.3241   Mean   : 0.3084  
 3rd Qu.:4.000   3rd Qu.:  9.500   3rd Qu.: 0.6400   3rd Qu.: 0.5800  
 Max.   :8.000   Max.   : 34.400   Max.   : 3.6800   Max.   : 3.8600  
                                                                      
 ResidualSugar        Chlorides        FreeSulfurDioxide TotalSulfurDioxide
 Min.   :-127.800   Min.   :-1.17100   Min.   :-555.00   Min.   :-823.0    
 1st Qu.:  -2.000   1st Qu.:-0.03100   1st Qu.:   0.00   1st Qu.:  27.0    
 Median :   3.900   Median : 0.04600   Median :  30.00   Median : 123.0    
 Mean   :   5.419   Mean   : 0.05482   Mean   :  30.85   Mean   : 120.7    
 3rd Qu.:  15.900   3rd Qu.: 0.15300   3rd Qu.:  70.00   3rd Qu.: 208.0    
 Max.   : 141.150   Max.   : 1.35100   Max.   : 623.00   Max.   :1057.0    
 NA's   :616        NA's   :638        NA's   :647       NA's   :682       
    Density             pH          Sulphates          Alcohol     
 Min.   :0.8881   Min.   :0.480   Min.   :-3.1300   Min.   :-4.70  
 1st Qu.:0.9877   1st Qu.:2.960   1st Qu.: 0.2800   1st Qu.: 9.00  
 Median :0.9945   Median :3.200   Median : 0.5000   Median :10.40  
 Mean   :0.9942   Mean   :3.208   Mean   : 0.5271   Mean   :10.49  
 3rd Qu.:1.0005   3rd Qu.:3.470   3rd Qu.: 0.8600   3rd Qu.:12.40  
 Max.   :1.0992   Max.   :6.130   Max.   : 4.2400   Max.   :26.50  
                  NA's   :395     NA's   :1210      NA's   :653    
  LabelAppeal          AcidIndex          STARS      
 Min.   :-2.000000   Min.   : 4.000   Min.   :1.000  
 1st Qu.:-1.000000   1st Qu.: 7.000   1st Qu.:1.000  
 Median : 0.000000   Median : 8.000   Median :2.000  
 Mean   :-0.009066   Mean   : 7.773   Mean   :2.042  
 3rd Qu.: 1.000000   3rd Qu.: 8.000   3rd Qu.:3.000  
 Max.   : 2.000000   Max.   :17.000   Max.   :4.000  
                                      NA's   :3359

Feature Analysis

Among the predictor variables, several contain missing values, particularly STARS, Sulphates, and sulfur-related measurements. The dataset includes both continuous and discrete variables, with LabelAppeal, AcidIndex, and STARS treated as categorical features.

Some chemical variables contain negative values, which are not physically meaningful. These values are likely the result of prior transformations such as normalization or scaling. While this inconsistency is noted, the values are retained (with adjustments applied later) due to limited information about the data generation process.

Target Variable Analysis

Code

library(ggplot2)

ggplot(train_df, aes(x = factor(TARGET), fill = factor(TARGET))) +
  
  geom_bar(
    color = "white",
    linewidth = 0.4,
    width = 0.75
  ) +
  
  geom_text(
    stat = "count",
    aes(label = after_stat(count)),
    vjust = -0.4,
    fontface = "bold",
    size = 4
  ) +
  
  scale_fill_brewer(palette = "RdPu") +
  
  labs(
    title = "Distribution of Wine Sales (TARGET)",
    subtitle = "Most observations contain low purchase counts",
    x = "Wine Cases Purchased",
    y = "Frequency",
    fill = "TARGET"
  ) +
  
  theme_minimal(base_size = 14) +
  
  theme(
    plot.title = element_text(face = "bold", size = 18),
    plot.subtitle = element_text(size = 12),
    legend.position = "none",
    panel.grid.minor = element_blank(),
    plot.title.position = "plot"
  )

The distribution of the response variable indicates that most observations correspond to low sales volumes, with a high frequency of zero values. For non-zero observations, the distribution appears approximately symmetric, suggesting moderate variability in purchase behavior.

Distribution Analysis

Most continuous variables exhibit approximately normal distributions, although some degree of skewness is present. Variables such as AcidIndex and STARS show noticeable right skewness, indicating that higher values occur less frequently.

Boxplot analysis reveals the presence of outliers across several variables, particularly in TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar. These variables also display wider ranges compared to others, suggesting greater variability.

Code

# Histogram
train_df %>% 
  gather(key = 'key', value = 'value') %>%  # Include gather() to reshape the data
  ggplot(aes(value)) +
  facet_wrap(~key, scale = "free", ncol = 3) +
  geom_histogram(binwidth = function(x) 2 * IQR(x) / (length(x)^(1/3)), fill = "#FF69B4") +
  theme_minimal() +
  theme(panel.grid = element_blank()) +
  labs(title = "Histogram of Each Variable")

We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.

Code

# Reshape the data using 'melt'
melted_df <- melt(train_df)
# Create the box plot
ggplot(melted_df, aes(x = factor(variable), y = value)) + 
  geom_boxplot() + 
  stat_summary(fun.y = mean, color = "green", geom = "point") +  
  stat_summary(fun.y = median, color = "red", geom = "point") +
  coord_flip() +
  theme_bw() +
  labs(title = "Box Plot of Each Variable")

In the box plot, Several variables exhibit noticeable outliers, particularly sulfur-related measurements and residual sugar, although the majority of predictors remain reasonably concentrated.

Categorical Feature Relationships

Exploration of categorical predictors shows meaningful relationships with the response variable. Higher values of STARS are generally associated with increased sales, indicating that perceived quality influences purchasing behavior. Similarly, LabelAppeal appears positively related to sales, suggesting that product presentation plays a role in consumer decisions.

The AcidIndex variable shows variation in sales across its levels, indicating that certain acidity levels may be more favorable in the market.

Code

#Create a bar chart for Discrete Variables 
long <- melt(train_df, id.vars= colnames(train_df)[1:12])%>% 
  mutate(target = as.factor(TARGET))

ggplot(data = long, aes(x = value)) + 
  geom_bar(aes(fill = target)) + 
  facet_wrap( ~ variable, scales = "free")

The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows the larger quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Higher STARS ratings appear associated with increased wine sales, indicating that customer-perceived quality influences purchasing behavior. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.

Correlation and Multicollinearity Analysis

Correlation analysis reveals that STARS and LabelAppeal have the strongest positive relationships with the response variable. In contrast, AcidIndex shows a mild negative correlation with sales. Overall, while some correlations exist among predictors, no severe multicollinearity issues are observed.

Code

# Compute correlation matrix
corr_matrix <- cor(drop_na(train_df))

# Extract correlations with the target variable
target_correlation <- corr_matrix[, "TARGET"]

# Display correlation table
knitr::kable(target_correlation, "html", escape = FALSE) %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::scroll_box(height = "500px")

	x
TARGET	1.0000000
FixedAcidity	-0.0125381
VolatileAcidity	-0.0759979
CitricAcid	0.0023450
ResidualSugar	0.0035196
Chlorides	-0.0304301
FreeSulfurDioxide	0.0226398
TotalSulfurDioxide	0.0216021
Density	-0.0475989
pH	0.0002199
Sulphates	-0.0212204
Alcohol	0.0737771
LabelAppeal	0.4979465
AcidIndex	-0.1676431
STARS	0.5546857

Code

corrgram(drop_na(train_df), order=TRUE,
         upper.panel=panel.cor, main="Correlation Plot")

In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.

Data Preprocessing

Handling Negative Values

To address negative values in chemical variables, absolute transformations are applied to ensure all values are non-negative. For LabelAppeal, values are shifted by adding the minimum value to maintain relative differences while eliminating negative entries.

Code

train_df$FixedAcidity <- abs(train_df$FixedAcidity)
evaluate_df$FixedAcidity <- abs(evaluate_df$FixedAcidity)

train_df$VolatileAcidity <- abs(train_df$VolatileAcidity)
evaluate_df$VolatileAcidity <- abs(evaluate_df$VolatileAcidity)

train_df$CitricAcid <- abs(train_df$CitricAcid)
evaluate_df$CitricAcid <- abs(evaluate_df$CitricAcid)

train_df$ResidualSugar <- abs(train_df$ResidualSugar)
evaluate_df$ResidualSugar <- abs(evaluate_df$ResidualSugar)

train_df$Chlorides <- abs(train_df$Chlorides)
evaluate_df$Chlorides <- abs(evaluate_df$Chlorides)

train_df$FreeSulfurDioxide <- abs(train_df$FreeSulfurDioxide)
evaluate_df$FreeSulfurDioxide <- abs(evaluate_df$FreeSulfurDioxide)

train_df$TotalSulfurDioxide <- abs(train_df$TotalSulfurDioxide)
evaluate_df$TotalSulfurDioxide <- abs(evaluate_df$TotalSulfurDioxide)

train_df$Sulphates <- abs(train_df$Sulphates)
evaluate_df$Sulphates <- abs(evaluate_df$Sulphates)

train_df$LabelAppeal <- train_df$LabelAppeal + abs(min(train_df$LabelAppeal))
evaluate_df$LabelAppeal <- evaluate_df$LabelAppeal + abs(min(evaluate_df$LabelAppeal))

Missing Value Treatment

Several variables contain missing values, with STARS having the highest proportion. Missing values are assumed to be Missing at Random (MAR) and are imputed using the MICE (Multiple Imputation by Chained Equations) method.

Code

#Identify missing data by Feature and display percent breakout
plot_missing(train_df)

According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.

Code

train_df$STARS[is.na(train_df$STARS)] <- 0
evaluate_df$STARS[is.na(evaluate_df$STARS)] <- 0
# Perform multiple imputation
mice_imputes <- mice(train_df, m = 2, maxit = 2, print = FALSE)
# Visualize the imputed values with density plot
densityplot(mice_imputes)

The imputed distributions closely resemble the observed distributions, suggesting the missing values are reasonably consistent with a Missing at Random (MAR) assumption. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.

Code

mice_train <-  mice(train_df, m = 1, maxit = 1, print = FALSE)
cleaned_train <- complete(mice_train)

mice_evaluate <- mice(evaluate_df, m = 1, maxit = 1, print = FALSE)
cleaned_evaluate <- complete(mice_evaluate)

cleaned_train$STARS <- as.factor(cleaned_train$STARS)
cleaned_evaluate$STARS <- as.factor(cleaned_evaluate$STARS)

Statistical Summary and Correlation Review

Following data cleaning and imputation, updated summaries and visualizations confirm that the dataset is well-structured and suitable for modeling. The distributions remain consistent with earlier observations, and correlation patterns are preserved.

Code

plot_histogram(cleaned_train, title = "Revised Histogram of Cleaned Training Data")

Code

# Create a summary table
summary_table <- descr(cleaned_train)

# Transpose the summary table
transposed_summary_table <- t(summary_table)

# Display transposed summary table as an HTML table
knitr::kable(transposed_summary_table, "html", escape = FALSE) %>%
  kableExtra::kable_styling("striped", full_width = FALSE) %>%
  kableExtra::column_spec(1, bold = TRUE) %>%
  kableExtra::scroll_box(height = "500px")

	Mean	Std.Dev	Min	Q1	Median	Q3	Max	MAD	IQR	CV	Skewness	SE.Skewness	Kurtosis	N.Valid	N	Pct.Valid
AcidIndex	7.7727237	1.3239264	4.00000	7.00000	8.00000	8.00000	17.00000	1.4826000	1.000000	0.1703298	1.6484959	0.0216523	5.1900925	12795	12795	100
Alcohol	10.4910283	3.7360616	-4.70000	9.00000	10.40000	12.40000	26.50000	2.3721600	3.400000	0.3561197	-0.0213019	0.0216523	1.5599371	12795	12795	100
Chlorides	0.2230572	0.2346388	0.00000	0.04600	0.09800	0.37000	1.35100	0.0993342	0.324000	1.0519219	1.4741856	0.0216523	2.1402838	12795	12795	100
CitricAcid	0.6863150	0.6060052	0.00000	0.28000	0.44000	0.97000	3.86000	0.3261720	0.690000	0.8829841	1.6428101	0.0216523	2.9474433	12795	12795	100
Density	0.9942027	0.0265376	0.88809	0.98772	0.99449	1.00052	1.09924	0.0093552	0.012795	0.0266924	-0.0186938	0.0216523	1.8999592	12795	12795	100
FixedAcidity	8.0632513	4.9961186	0.00000	5.60000	7.00000	9.80000	34.40000	2.9652000	4.200000	0.6196159	1.1742806	0.0216523	1.9666235	12795	12795	100
FreeSulfurDioxide	106.5399766	108.0190822	0.00000	28.00000	55.50000	171.00000	623.00000	60.0453000	143.000000	1.0138831	1.5317317	0.0216523	2.4618355	12795	12795	100
LabelAppeal	1.9909340	0.8910892	0.00000	1.00000	2.00000	3.00000	4.00000	1.4826000	2.000000	0.4475735	0.0084295	0.0216523	-0.2622916	12795	12795	100
pH	3.2084416	0.6808941	0.48000	2.96000	3.20000	3.47000	6.13000	0.3854760	0.510000	0.2122196	0.0400391	0.0216523	1.6567514	12795	12795	100
ResidualSugar	23.4225010	25.0154465	0.00000	3.60000	12.90000	38.80000	141.15000	16.3086000	35.200000	1.0680092	1.4675686	0.0216523	2.2193811	12795	12795	100
Sulphates	0.8495287	0.6585222	0.00000	0.43000	0.59000	1.10000	4.24000	0.3261720	0.670000	0.7751618	1.6873422	0.0216523	3.1681651	12795	12795	100
TARGET	3.0290739	1.9263682	0.00000	2.00000	3.00000	4.00000	8.00000	1.4826000	2.000000	0.6359595	-0.3263010	0.0216523	-0.8772457	12795	12795	100
TotalSulfurDioxide	204.5812036	163.7012614	0.00000	100.00000	154.00000	263.00000	1057.00000	102.2994000	163.000000	0.8001774	1.6193496	0.0216523	3.0739192	12795	12795	100
VolatileAcidity	0.6410856	0.5556141	0.00000	0.25000	0.41000	0.91000	3.68000	0.3261720	0.660000	0.8666770	1.6529782	0.0216523	3.0833040	12795	12795	100

Code

corrgram(drop_na(cleaned_train), order=TRUE,
         upper.panel=panel.cor, main="Revised Correlation")

Train-Test Data Split

The dataset is divided into training (80%) and testing (20%) subsets. Models are trained on the training data and evaluated on the testing data to ensure unbiased performance assessment.

Code

options(scipen = 999)

# Get training/test split
y_raw <- as.matrix(cleaned_train$TARGET)
trainingRows <- createDataPartition(y_raw, p=0.8, list=FALSE)

# Build training data sets
trainX <- cleaned_train[trainingRows,] %>% select(-TARGET)
trainY <- cleaned_train[trainingRows,] %>% select(TARGET)

# Build testing data set
testX <- cleaned_train[-trainingRows,] %>% select(-TARGET)
testY <- cleaned_train[-trainingRows,] %>% select(TARGET)

# Build a DF
trainingData <- as.data.frame(trainX)
trainingData$TARGET <- trainY$TARGET
print(paste('Number of Training Samples: ', dim(trainingData)[1]))

[1] "Number of Training Samples:  10238"

Code

testingData <- as.data.frame(testX)
testingData$TARGET <- testY$TARGET
print(paste('Number of Testing Samples: ', dim(testingData)[1]))

[1] "Number of Testing Samples:  2557"

Code

model_test_perf <- function(model, trainX, trainY, testX, testY) {
  # Evaluate Model 1 with testing data set
  predictedY <- predict(model, newdata=testX)

  model_results <- data.frame(obs = testY, pred=predictedY)
  colnames(model_results) = c('obs', 'pred')
  
  # This grabs RMSE, Rsquaredand MAE by default
  model_eval <- defaultSummary(model_results)
  
  # Add AIC score to the results
  if ('aic' %in% model) {
    model_eval[4] <- model$aic
  } else {
    model_eval[4] <- AIC(model)
  }
  
  names(model_eval)[4] <- 'aic'
 
  # Add BIC score to the results
  model_eval[5] <- BIC(model)
  names(model_eval)[5] <- 'bic'

   
  return(model_eval)
}

Assessing Overdispersion

Code

mean(trainingData$TARGET)

[1] 3.029303

Code

var(trainingData$TARGET)

[1] 3.706282

Code

target_mean <- mean(trainingData$TARGET)
target_var  <- var(trainingData$TARGET)

target_mean

[1] 3.029303

Code

target_var

[1] 3.706282

The Poisson regression model assumes that the mean and variance of the response variable are approximately equal. However, the TARGET variable shows evidence of overdispersion because the variance is substantially larger than the mean.

This indicates that the Poisson assumption may not hold for the wine sales data. As a result, the Negative Binomial regression model becomes more appropriate because it includes an additional dispersion parameter that can better account for variability in count-based outcomes.

Model Development

Multiple models are developed to predict wine sales, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Both full models (using all predictors) and reduced models (using selected predictors) are evaluated.

Full Poisson Regression Model - Model 1

In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code

options(scipen = 999)

Poisson_Model1 <- glm(TARGET ~ ., data = trainingData, family = poisson)

summary(Poisson_Model1)


Call:
glm(formula = TARGET ~ ., family = poisson, data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         1.04308852  0.22032337   4.734            0.0000022 ***
FixedAcidity       -0.00023007  0.00115696  -0.199             0.842374    
VolatileAcidity    -0.03847804  0.01049733  -3.666             0.000247 ***
CitricAcid          0.00896377  0.00933341   0.960             0.336856    
ResidualSugar       0.00012082  0.00022904   0.528             0.597844    
Chlorides          -0.03878396  0.02443258  -1.587             0.112425    
FreeSulfurDioxide   0.00003613  0.00005221   0.692             0.488992    
TotalSulfurDioxide  0.00008102  0.00003471   2.334             0.019584 *  
Density            -0.51650865  0.21493616  -2.403             0.016258 *  
pH                 -0.01536190  0.00838698  -1.832             0.067006 .  
Sulphates          -0.02045060  0.00890241  -2.297             0.021608 *  
Alcohol             0.00305543  0.00153556   1.990             0.046615 *  
LabelAppeal         0.16123363  0.00682342  23.629 < 0.0000000000000002 ***
AcidIndex          -0.07887187  0.00510506 -15.450 < 0.0000000000000002 ***
STARS1              0.79008891  0.02193312  36.023 < 0.0000000000000002 ***
STARS2              1.10608943  0.02060489  53.681 < 0.0000000000000002 ***
STARS3              1.22291468  0.02165312  56.478 < 0.0000000000000002 ***
STARS4              1.33980996  0.02740911  48.882 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18256  on 10237  degrees of freedom
Residual deviance: 10850  on 10220  degrees of freedom
AIC: 36460

Number of Fisher Scoring iterations: 6

Code

# Evaluate Model 1 with testing data set
(Poission_Evaluate1 <- model_test_perf(Poisson_Model1, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    2.5944814     0.5021363     2.2226635 36459.7403979 36589.9499061

This model identified several significant predictors of wine sales, particularly LabelAppeal, STARS, Alcohol, and AcidIndex. Higher STARS ratings and stronger label appeal were associated with increased wine purchases, while higher AcidIndex negatively impacted sales. The model achieved moderate predictive performance with an RMSE of 2.58 and an R-squared of 0.54. However, because the wine sales data showed overdispersion, the Poisson model may not fully capture the variability in the response variable, suggesting that a Negative Binomial model may provide a better fit.

Reduced Poisson Regression Model - Model 2

In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

Code

Poisson_Model2 <- glm(TARGET ~  VolatileAcidity + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS, data = trainingData, family = poisson)

summary(Poisson_Model2)


Call:
glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide + 
    Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson, 
    data = trainingData)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.46626756  0.05003945   9.318 < 0.0000000000000002 ***
VolatileAcidity    -0.03913821  0.01049292  -3.730             0.000192 ***
TotalSulfurDioxide  0.00008022  0.00003469   2.313             0.020746 *  
Alcohol             0.00313860  0.00153549   2.044             0.040950 *  
LabelAppeal         0.16110929  0.00681836  23.629 < 0.0000000000000002 ***
AcidIndex          -0.07914181  0.00503497 -15.718 < 0.0000000000000002 ***
STARS1              0.79185023  0.02192795  36.111 < 0.0000000000000002 ***
STARS2              1.10813011  0.02059574  53.804 < 0.0000000000000002 ***
STARS3              1.22548288  0.02164467  56.618 < 0.0000000000000002 ***
STARS4              1.34106185  0.02740438  48.936 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 18256  on 10237  degrees of freedom
Residual deviance: 10869  on 10228  degrees of freedom
AIC: 36463

Number of Fisher Scoring iterations: 6

Code

# Evaluate Model 2 with testing data set
(Poission_Evaluate2 <- model_test_perf(Poisson_Model2, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    2.5937565     0.5037147     2.2219422 36463.0377979 36535.3764136

This Model retained the most significant predictors of wine sales, including VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS. The results show that higher STARS ratings, stronger label appeal, and higher alcohol content positively influenced wine purchases, while higher volatile acidity and AcidIndex negatively affected sales. The model achieved moderate predictive performance with an RMSE of 2.60 and an R-squared of 0.53. Compared to the full Poisson model, this reduced model provided a simpler and more interpretable structure while maintaining similar predictive accuracy.

Full Negative Binomial Model - Model 3

Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code

nb3 <- glm.nb(TARGET ~ ., data = trainingData)
summary(nb3)


Call:
glm.nb(formula = TARGET ~ ., data = trainingData, init.theta = 40595.15616, 
    link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         1.04312321  0.22033334   4.734            0.0000022 ***
FixedAcidity       -0.00023007  0.00115701  -0.199             0.842383    
VolatileAcidity    -0.03847944  0.01049779  -3.665             0.000247 ***
CitricAcid          0.00896387  0.00933384   0.960             0.336873    
ResidualSugar       0.00012082  0.00022905   0.527             0.597868    
Chlorides          -0.03878473  0.02443368  -1.587             0.112434    
FreeSulfurDioxide   0.00003613  0.00005221   0.692             0.488983    
TotalSulfurDioxide  0.00008103  0.00003471   2.334             0.019583 *  
Density            -0.51651798  0.21494595  -2.403             0.016261 *  
pH                 -0.01536284  0.00838736  -1.832             0.067001 .  
Sulphates          -0.02045130  0.00890281  -2.297             0.021609 *  
Alcohol             0.00305532  0.00153563   1.990             0.046633 *  
LabelAppeal         0.16123278  0.00682373  23.628 < 0.0000000000000002 ***
AcidIndex          -0.07887427  0.00510525 -15.450 < 0.0000000000000002 ***
STARS1              0.79008785  0.02193358  36.022 < 0.0000000000000002 ***
STARS2              1.10608846  0.02060535  53.680 < 0.0000000000000002 ***
STARS3              1.22291446  0.02165370  56.476 < 0.0000000000000002 ***
STARS4              1.33981071  0.02741029  48.880 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40595.16) family taken to be 1)

    Null deviance: 18255  on 10237  degrees of freedom
Residual deviance: 10850  on 10220  degrees of freedom
AIC: 36462

Number of Fisher Scoring iterations: 1

              Theta:  40595 
          Std. Err.:  37906 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -36424.08

Code

# Evaluate Model 1 with testing data set
(Negative_Binomial_eval3 <- model_test_perf(nb3, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    2.5944813     0.5021364     2.2226634 36462.0802450 36599.5236148

Reduced Negative Binomial Model - Model 4

Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS

Code

nb4 <- glm.nb(TARGET~ VolatileAcidity + FreeSulfurDioxide + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS,  data=cleaned_train)

summary (nb4)


Call:
glm.nb(formula = TARGET ~ VolatileAcidity + FreeSulfurDioxide + 
    TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + 
    STARS, data = cleaned_train, init.theta = 40642.7858, link = log)

Coefficients:
                      Estimate  Std. Error z value             Pr(>|z|)    
(Intercept)         0.48523621  0.04494017  10.797 < 0.0000000000000002 ***
VolatileAcidity    -0.03715255  0.00939627  -3.954            0.0000769 ***
FreeSulfurDioxide   0.00004915  0.00004696   1.047               0.2952    
TotalSulfurDioxide  0.00007236  0.00003114   2.324               0.0201 *  
Alcohol             0.00331910  0.00137097   2.421               0.0155 *  
LabelAppeal         0.15919830  0.00612552  25.989 < 0.0000000000000002 ***
AcidIndex          -0.08030500  0.00450020 -17.845 < 0.0000000000000002 ***
STARS1              0.77111897  0.01952989  39.484 < 0.0000000000000002 ***
STARS2              1.09272796  0.01820168  60.034 < 0.0000000000000002 ***
STARS3              1.21356938  0.01917251  63.297 < 0.0000000000000002 ***
STARS4              1.32912621  0.02428009  54.741 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(40642.79) family taken to be 1)

    Null deviance: 22860  on 12794  degrees of freedom
Residual deviance: 13687  on 12784  degrees of freedom
AIC: 45654

Number of Fisher Scoring iterations: 1

              Theta:  40643 
          Std. Err.:  34173 
Warning while fitting theta: iteration limit reached 

 2 x log-likelihood:  -45630.33

Code

# Evaluate Model 1 with testing data set
(Negative_Binomial_eval4 <- model_test_perf(nb4, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    2.5938778     0.5059605     2.2228947 45654.3265603 45743.8082773

This Model identified STARS, LabelAppeal, Alcohol, and TotalSulfurDioxide as significant positive predictors of wine sales, while VolatileAcidity and AcidIndex showed negative relationships with TARGET. The model achieved moderate predictive performance with an RMSE of 2.58 and an R-squared of 0.53. Compared to the Poisson models, the Negative Binomial model is more appropriate because it better handles overdispersion in the count-based wine sales data. Overall, Model 4 provided a good balance between predictive accuracy, interpretability, and suitability for the dataset.

Full Multiple Linear Regression Model - Model 5

The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS

Code

lm5 <- lm(TARGET ~ ., data = trainingData)
summary(lm5)


Call:
lm(formula = TARGET ~ ., data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.7949 -0.8547  0.0212  0.8347  6.2159 

Coefficients:
                     Estimate Std. Error t value             Pr(>|t|)    
(Intercept)         3.5361645  0.4966537   7.120     0.00000000000115 ***
FixedAcidity        0.0006578  0.0026009   0.253              0.80034    
VolatileAcidity    -0.1141741  0.0231754  -4.927     0.00000085014550 ***
CitricAcid          0.0276580  0.0213057   1.298              0.19426    
ResidualSugar       0.0001831  0.0005188   0.353              0.72416    
Chlorides          -0.1099350  0.0548896  -2.003              0.04522 *  
FreeSulfurDioxide   0.0001146  0.0001185   0.967              0.33334    
TotalSulfurDioxide  0.0002564  0.0000789   3.250              0.00116 ** 
Density            -1.5453663  0.4868494  -3.174              0.00151 ** 
pH                 -0.0378888  0.0189369  -2.001              0.04544 *  
Sulphates          -0.0546506  0.0198514  -2.753              0.00592 ** 
Alcohol             0.0110505  0.0034709   3.184              0.00146 ** 
LabelAppeal         0.4717344  0.0151427  31.153 < 0.0000000000000002 ***
AcidIndex          -0.1990509  0.0101362 -19.638 < 0.0000000000000002 ***
STARS1              1.3990126  0.0365968  38.228 < 0.0000000000000002 ***
STARS2              2.4148219  0.0358443  67.370 < 0.0000000000000002 ***
STARS3              2.9795120  0.0413404  72.073 < 0.0000000000000002 ***
STARS4              3.6760023  0.0663358  55.415 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.303 on 10220 degrees of freedom
Multiple R-squared:  0.5426,    Adjusted R-squared:  0.5418 
F-statistic: 713.1 on 17 and 10220 DF,  p-value: < 0.00000000000000022

Code

# Evaluate Model 1 with testing data set
(Linear_Regression_eval5 <- model_test_perf(lm5, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    1.3294009     0.5263324     1.0402172 34495.7176418 34633.1610116

This Model demonstrated strong predictive performance with an RMSE of 1.30 and an R-squared of 0.55, outperforming the Poisson and Negative Binomial models. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive relationships with wine sales, while VolatileAcidity and AcidIndex negatively affected TARGET. The overall model was statistically significant, indicating that the predictors collectively explained a substantial portion of the variation in wine sales. However, despite its strong performance, linear regression may not be ideal for count-based response data because it does not account for overdispersion or non-negative count constraints.

Stepwise Linear Regression Model - Model 6

For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.

Code

lm6 <- stepAIC(lm5, direction = "both",
               scope = list(upper = lm5, lower = ~ 1),
               scale = 0, trace = FALSE)

summary(lm6)


Call:
lm(formula = TARGET ~ VolatileAcidity + Chlorides + TotalSulfurDioxide + 
    Density + pH + Sulphates + Alcohol + LabelAppeal + AcidIndex + 
    STARS, data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.8114 -0.8540  0.0237  0.8369  6.1862 

Coefficients:
                      Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         3.58063339  0.49568208   7.224    0.000000000000542 ***
VolatileAcidity    -0.11455042  0.02317083  -4.944    0.000000778574739 ***
Chlorides          -0.11032367  0.05488157  -2.010              0.04443 *  
TotalSulfurDioxide  0.00025795  0.00007888   3.270              0.00108 ** 
Density            -1.55512852  0.48664919  -3.196              0.00140 ** 
pH                 -0.03789903  0.01893498  -2.002              0.04536 *  
Sulphates          -0.05414683  0.01984585  -2.728              0.00638 ** 
Alcohol             0.01096383  0.00347023   3.159              0.00159 ** 
LabelAppeal         0.47230010  0.01513719  31.201 < 0.0000000000000002 ***
AcidIndex          -0.19839885  0.00997002 -19.900 < 0.0000000000000002 ***
STARS1              1.39950815  0.03659160  38.247 < 0.0000000000000002 ***
STARS2              2.41588226  0.03583075  67.425 < 0.0000000000000002 ***
STARS3              2.97972198  0.04133534  72.087 < 0.0000000000000002 ***
STARS4              3.67561772  0.06632590  55.418 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.303 on 10224 degrees of freedom
Multiple R-squared:  0.5424,    Adjusted R-squared:  0.5419 
F-statistic: 932.4 on 13 and 10224 DF,  p-value: < 0.00000000000000022

Code

# Evaluate Model 1 with testing data set
(Linear_Regression_eval6 <- model_test_perf(lm6, trainX, trainY, testX, testY))

         RMSE      Rsquared           MAE           aic           bic 
    1.3298984     0.5259845     1.0402884 34490.5364790 34599.0444025

This Model selected the most important predictors using the stepAIC method and achieved strong predictive performance with an RMSE of 1.30 and an R-squared of 0.55. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive effects on wine sales, while VolatileAcidity and AcidIndex negatively influenced TARGET. Compared to the full linear regression model, this reduced model maintained similar predictive accuracy while using fewer variables, making it more efficient and interpretable. Based on AIC and BIC values, Model 6 provided the best overall balance between model simplicity and predictive performance.

Model Evaluation and Selection

Model performance is evaluated using metrics such as RMSE, R-squared, MAE, AIC, and BIC. Among all models, the multiple linear regression models demonstrate superior predictive performance.

In particular, the stepwise-selected model (Model 6) achieves the best balance across evaluation metrics, with lower AIC and BIC values indicating improved model efficiency.

Code

models_summary <- rbind(Poission_Evaluate1, Poission_Evaluate2, Negative_Binomial_eval3, Negative_Binomial_eval4, Linear_Regression_eval5, Linear_Regression_eval6)

kable(
  round(models_summary,3),
  caption = "Model Performance Comparison"
) %>%
  kable_styling(
    bootstrap_options = c(
      "striped",
      "hover",
      "condensed",
      "responsive"
    ),
    full_width = FALSE,
    position = "center"
  )

Model Performance Comparison
	RMSE	Rsquared	MAE	aic	bic
Poission_Evaluate1	2.594	0.502	2.223	36459.74	36589.95
Poission_Evaluate2	2.594	0.504	2.222	36463.04	36535.38
Negative_Binomial_eval3	2.594	0.502	2.223	36462.08	36599.52
Negative_Binomial_eval4	2.594	0.506	2.223	45654.33	45743.81
Linear_Regression_eval5	1.329	0.526	1.040	34495.72	34633.16
Linear_Regression_eval6	1.330	0.526	1.040	34490.54	34599.04

This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.

Across all model types, STARS and LabelAppeal consistently showed positive coefficients, indicating that wines with higher ratings and stronger label appeal tend to generate higher sales volumes. AcidIndex generally showed a negative relationship with TARGET, suggesting that higher acidity may reduce customer demand. Alcohol content displayed a moderate positive effect in both Poisson and Negative Binomial models, which may indicate customer preference for wines with higher alcohol concentration. Some variables changed sign across different model types. For example, pH showed inconsistent effects between the Poisson and linear regression models. This may be due to multicollinearity or differing distributional assumptions among the models. Despite these inconsistencies, variables with unstable coefficients were retained if they improved overall predictive performance and model fit statistics.

Although the multiple linear regression models achieved slightly better RMSE and R-squared performance, linear regression is not ideal for count-based response variables because predictions may fall outside the valid range and assumptions of normality may be violated.

Since TARGET represents count data with substantial overdispersion and many zero values, the Negative Binomial regression model provides a more theoretically appropriate framework for deployment.

Therefore, Negative Binomial Model 4 was selected as the final deployment model because it balances predictive performance, interpretability, and suitability for count data.

Sales Prediction Results

The selected model is applied to the evaluation dataset to generate predictions. The resulting distribution of predicted values closely resembles that of the training data, suggesting that the model generalizes well.

Code

eval_data <- cleaned_evaluate %>% select(-TARGET)
predictions <- predictions <- predict(nb4, eval_data, type="response")

eval_data$TARGET <- predictions

hist(predictions)

Code

write.csv(eval_data, 'DATA621_HW5_G2-Predictions.csv', row.names=FALSE)

head(eval_data)

  FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
1          5.4           0.860       0.27          10.7     0.092
2         12.4           0.385       0.76          19.7     1.169
3          7.2           1.750       0.17          33.0     0.065
4          6.2           0.100       1.80           1.0     0.179
5         11.4           0.210       0.28           1.2     0.038
6         17.6           0.040       1.15           1.4     0.535
  FreeSulfurDioxide TotalSulfurDioxide Density   pH Sulphates Alcohol
1                23                398 0.98527 5.02      0.64   12.30
2                37                 68 0.99048 3.37      1.09   16.00
3                 9                 76 1.04641 4.61      0.68    8.55
4               104                 89 0.98877 3.20      2.11   12.30
5                70                 53 1.02899 2.54      0.07    4.80
6               250                140 0.95028 3.06      0.02   11.40
  LabelAppeal AcidIndex STARS   TARGET
1           1         6     0 1.223122
2           2         6     2 4.306267
3           2         8     1 2.463577
4           1         8     1 2.274533
5           2        10     0 1.016116
6           3         8     4 5.519498

The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.

Final Conclusions

Several predictive modeling approaches were explored, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Exploratory analysis revealed that the TARGET variable contains many zero observations and demonstrates characteristics of overdispersion, making count-based regression models more appropriate.

Although the multiple linear regression models achieved slightly stronger RMSE and R-squared performance, the Negative Binomial models provided a better theoretical fit for the count nature of the response variable. In particular, Negative Binomial Model 4 achieved a strong balance between predictive accuracy, interpretability, and model simplicity.

Key predictors such as STARS, Alcohol, and LabelAppeal consistently demonstrated positive relationships with wine sales, while AcidIndex showed a generally negative relationship with TARGET.

Overall, the Negative Binomial regression framework provided the most appropriate balance between predictive performance and theoretical suitability for count-based sales data. The final model successfully captured key drivers of wine purchasing behavior while addressing overdispersion present in the response variable. These findings demonstrate how statistical modeling can support data-driven decision-making in consumer sales forecasting.