Data-driven decision-making is essential for achieving effective business outcomes. In this analysis, we begin by exploring the dataset to identify potential issues such as outliers, missing values, encoding inconsistencies, and multicollinearity. After addressing these challenges through appropriate data preprocessing techniques, we develop and evaluate several predictive models for estimating sales.
The dataset is divided into training and evaluation subsets, allowing us to train models on one portion and validate their performance on unseen data. Multiple modeling approaches are compared, and the final model is selected based on its ability to balance predictive accuracy with model simplicity.
Code
library(summarytools)library(MASS)library(rpart.plot)library(ggplot2)library(ggfortify)library(gridExtra)library(forecast)library(fpp2)library(fma)library(kableExtra)library(e1071)library(mlbench)library(ggcorrplot)library(DataExplorer)library(timeDate)library(caret)library(GGally)library(corrplot)library(RColorBrewer)library(tibble)library(tidyr)library(tidyverse)library(dplyr)library(reshape2)library(mixtools)library(tidymodels)library(ggpmisc)library(regclass)library(skimr)library(corrgram)library(mice)#' Print a side-by-side Histogram and QQPlot of Residuals#'#' @param model A model#' @examples#' residPlot(myModel)#' @return null#' @exportresidPlot <-function(model) {# Make sure a model was passedif (is.null(model)) { return }layout(matrix(c(1,1,2,3), 2, 2, byrow =TRUE))plot(residuals(model))hist(model[["residuals"]], freq =FALSE, breaks ="fd", main ="Residual Histogram",xlab ="Residuals",col="lightgreen")lines(density(model[["residuals"]], kernel ="ep"),col="blue", lwd=3)curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)qqnorm(model[["residuals"]], main ="Residual Q-Q plot")qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")par(mfrow =c(1, 1))}#' Print a Variable Importance Plot for the provided model#'#' @param model The model#' @param chart_title The Title to show on the plot#' @examples#' variableImportancePlot(myLinearModel, 'My Title)#' @return null#' @exportvariableImportancePlot <-function(model=NULL, chart_title='Variable Importance Plot') {# Make sure a model was passedif (is.null(model)) { return }# use caret and gglot to print a variable importance plotvarImp(model) %>%as.data.frame() %>%ggplot(aes(x =reorder(rownames(.), desc(Overall)), y = Overall)) +geom_col(aes(fill = Overall)) +theme(panel.background =element_blank(),panel.grid =element_blank(),axis.text.x =element_text(angle =90)) +scale_fill_gradient() +labs(title = chart_title,x ="Parameter",y ="Relative Importance")}#' Print a Facet Chart of histograms#'#' @param df Dataset#' @param box Facet size (rows)#' @examples#' histbox(my_df, 3)#' @return null#' @exporthistbox <-function(df, box) {par(mfrow = box) ndf <-dimnames(df)[[2]]for (i inseq_along(ndf)) { data <-na.omit(unlist(df[, i]))hist(data, breaks ="fd", main =paste("Histogram of", ndf[i]),xlab = ndf[i], freq =FALSE)lines(density(data, kernel ="ep"), col ='red') }par(mfrow =c(1, 1))}#' Extract key performance results from a model#'#' @param model A linear model of interest#' @examples#' model_performance_extraction(my_model)#' @return data.frame#' @exportmodel_performance_extraction <-function(model=NULL) {# Make sure a model was passedif (is.null(model)) { return }data.frame("RSE"= model$sigma,"Adj R2"= model$adj.r.squared,"F-Statistic"= model$fstatistic[1])}
# A tibble: 6 × 15
TARGET FixedAcidity VolatileAcidity CitricAcid ResidualSugar Chlorides
<lgl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA 5.4 -0.86 0.27 -10.7 0.092
2 NA 12.4 0.385 -0.76 -19.7 1.17
3 NA 7.2 1.75 0.17 -33 0.065
4 NA 6.2 0.1 1.8 1 -0.179
5 NA 11.4 0.21 0.28 1.2 0.038
6 NA 17.6 0.04 -1.15 1.4 0.535
# ℹ 9 more variables: FreeSulfurDioxide <dbl>, TotalSulfurDioxide <dbl>,
# Density <dbl>, pH <dbl>, Sulphates <dbl>, Alcohol <dbl>, LabelAppeal <int>,
# AcidIndex <int>, STARS <int>
Code
#str(evaluate_df)
The training dataset consists of 12,795 observations and 15 variables, including 14 predictors and one response variable (TARGET). The INDEX column serves only as an identifier and is removed prior to analysis. Most predictors represent chemical properties of wine, while LabelAppeal and STARS reflect subjective quality and presentation ratings.
The response variable, TARGET, represents the number of wine cases purchased and takes integer values between 0 and 8. A large proportion of observations contain zero values, indicating many instances with no purchases.
Code
library(vtable)# st(train_df)summary(train_df)
TARGET FixedAcidity VolatileAcidity CitricAcid
Min. :0.000 Min. :-18.100 Min. :-2.7900 Min. :-3.2400
1st Qu.:2.000 1st Qu.: 5.200 1st Qu.: 0.1300 1st Qu.: 0.0300
Median :3.000 Median : 6.900 Median : 0.2800 Median : 0.3100
Mean :3.029 Mean : 7.076 Mean : 0.3241 Mean : 0.3084
3rd Qu.:4.000 3rd Qu.: 9.500 3rd Qu.: 0.6400 3rd Qu.: 0.5800
Max. :8.000 Max. : 34.400 Max. : 3.6800 Max. : 3.8600
ResidualSugar Chlorides FreeSulfurDioxide TotalSulfurDioxide
Min. :-127.800 Min. :-1.17100 Min. :-555.00 Min. :-823.0
1st Qu.: -2.000 1st Qu.:-0.03100 1st Qu.: 0.00 1st Qu.: 27.0
Median : 3.900 Median : 0.04600 Median : 30.00 Median : 123.0
Mean : 5.419 Mean : 0.05482 Mean : 30.85 Mean : 120.7
3rd Qu.: 15.900 3rd Qu.: 0.15300 3rd Qu.: 70.00 3rd Qu.: 208.0
Max. : 141.150 Max. : 1.35100 Max. : 623.00 Max. :1057.0
NA's :616 NA's :638 NA's :647 NA's :682
Density pH Sulphates Alcohol
Min. :0.8881 Min. :0.480 Min. :-3.1300 Min. :-4.70
1st Qu.:0.9877 1st Qu.:2.960 1st Qu.: 0.2800 1st Qu.: 9.00
Median :0.9945 Median :3.200 Median : 0.5000 Median :10.40
Mean :0.9942 Mean :3.208 Mean : 0.5271 Mean :10.49
3rd Qu.:1.0005 3rd Qu.:3.470 3rd Qu.: 0.8600 3rd Qu.:12.40
Max. :1.0992 Max. :6.130 Max. : 4.2400 Max. :26.50
NA's :395 NA's :1210 NA's :653
LabelAppeal AcidIndex STARS
Min. :-2.000000 Min. : 4.000 Min. :1.000
1st Qu.:-1.000000 1st Qu.: 7.000 1st Qu.:1.000
Median : 0.000000 Median : 8.000 Median :2.000
Mean :-0.009066 Mean : 7.773 Mean :2.042
3rd Qu.: 1.000000 3rd Qu.: 8.000 3rd Qu.:3.000
Max. : 2.000000 Max. :17.000 Max. :4.000
NA's :3359
Feature Analysis
Among the predictor variables, several contain missing values, particularly STARS, Sulphates, and sulfur-related measurements. The dataset includes both continuous and discrete variables, with LabelAppeal, AcidIndex, and STARS treated as categorical features.
Some chemical variables contain negative values, which are not physically meaningful. These values are likely the result of prior transformations such as normalization or scaling. While this inconsistency is noted, the values are retained (with adjustments applied later) due to limited information about the data generation process.
The distribution of the response variable indicates that most observations correspond to low sales volumes, with a high frequency of zero values. For non-zero observations, the distribution appears approximately symmetric, suggesting moderate variability in purchase behavior.
Distribution Analysis
Most continuous variables exhibit approximately normal distributions, although some degree of skewness is present. Variables such as AcidIndex and STARS show noticeable right skewness, indicating that higher values occur less frequently.
Boxplot analysis reveals the presence of outliers across several variables, particularly in TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar. These variables also display wider ranges compared to others, suggesting greater variability.
Code
# Histogramtrain_df %>%gather(key ='key', value ='value') %>%# Include gather() to reshape the dataggplot(aes(value)) +facet_wrap(~key, scale ="free", ncol =3) +geom_histogram(binwidth =function(x) 2*IQR(x) / (length(x)^(1/3)), fill ="#FF69B4") +theme_minimal() +theme(panel.grid =element_blank()) +labs(title ="Histogram of Each Variable")
We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.
Code
# Reshape the data using 'melt'melted_df <-melt(train_df)# Create the box plotggplot(melted_df, aes(x =factor(variable), y = value)) +geom_boxplot() +stat_summary(fun.y = mean, color ="green", geom ="point") +stat_summary(fun.y = median, color ="red", geom ="point") +coord_flip() +theme_bw() +labs(title ="Box Plot of Each Variable")
In the box plot, Several variables exhibit noticeable outliers, particularly sulfur-related measurements and residual sugar, although the majority of predictors remain reasonably concentrated.
Categorical Feature Relationships
Exploration of categorical predictors shows meaningful relationships with the response variable. Higher values of STARS are generally associated with increased sales, indicating that perceived quality influences purchasing behavior. Similarly, LabelAppeal appears positively related to sales, suggesting that product presentation plays a role in consumer decisions.
The AcidIndex variable shows variation in sales across its levels, indicating that certain acidity levels may be more favorable in the market.
Code
#Create a bar chart for Discrete Variables long <-melt(train_df, id.vars=colnames(train_df)[1:12])%>%mutate(target =as.factor(TARGET))ggplot(data = long, aes(x = value)) +geom_bar(aes(fill = target)) +facet_wrap( ~ variable, scales ="free")
The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows the larger quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Higher STARS ratings appear associated with increased wine sales, indicating that customer-perceived quality influences purchasing behavior. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.
Correlation and Multicollinearity Analysis
Correlation analysis reveals that STARS and LabelAppeal have the strongest positive relationships with the response variable. In contrast, AcidIndex shows a mild negative correlation with sales. Overall, while some correlations exist among predictors, no severe multicollinearity issues are observed.
In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.
Data Preprocessing
Handling Negative Values
To address negative values in chemical variables, absolute transformations are applied to ensure all values are non-negative. For LabelAppeal, values are shifted by adding the minimum value to maintain relative differences while eliminating negative entries.
Several variables contain missing values, with STARS having the highest proportion. Missing values are assumed to be Missing at Random (MAR) and are imputed using the MICE (Multiple Imputation by Chained Equations) method.
Code
#Identify missing data by Feature and display percent breakoutplot_missing(train_df)
According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.
Code
train_df$STARS[is.na(train_df$STARS)] <-0evaluate_df$STARS[is.na(evaluate_df$STARS)] <-0# Perform multiple imputationmice_imputes <-mice(train_df, m =2, maxit =2, print =FALSE)# Visualize the imputed values with density plotdensityplot(mice_imputes)
The imputed distributions closely resemble the observed distributions, suggesting the missing values are reasonably consistent with a Missing at Random (MAR) assumption. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.
Code
mice_train <-mice(train_df, m =1, maxit =1, print =FALSE)cleaned_train <-complete(mice_train)mice_evaluate <-mice(evaluate_df, m =1, maxit =1, print =FALSE)cleaned_evaluate <-complete(mice_evaluate)cleaned_train$STARS <-as.factor(cleaned_train$STARS)cleaned_evaluate$STARS <-as.factor(cleaned_evaluate$STARS)
Statistical Summary and Correlation Review
Following data cleaning and imputation, updated summaries and visualizations confirm that the dataset is well-structured and suitable for modeling. The distributions remain consistent with earlier observations, and correlation patterns are preserved.
Code
plot_histogram(cleaned_train, title ="Revised Histogram of Cleaned Training Data")
Code
# Create a summary tablesummary_table <-descr(cleaned_train)# Transpose the summary tabletransposed_summary_table <-t(summary_table)# Display transposed summary table as an HTML tableknitr::kable(transposed_summary_table, "html", escape =FALSE) %>% kableExtra::kable_styling("striped", full_width =FALSE) %>% kableExtra::column_spec(1, bold =TRUE) %>% kableExtra::scroll_box(height ="500px")
The dataset is divided into training (80%) and testing (20%) subsets. Models are trained on the training data and evaluated on the testing data to ensure unbiased performance assessment.
Code
options(scipen =999)# Get training/test splity_raw <-as.matrix(cleaned_train$TARGET)trainingRows <-createDataPartition(y_raw, p=0.8, list=FALSE)# Build training data setstrainX <- cleaned_train[trainingRows,] %>%select(-TARGET)trainY <- cleaned_train[trainingRows,] %>%select(TARGET)# Build testing data settestX <- cleaned_train[-trainingRows,] %>%select(-TARGET)testY <- cleaned_train[-trainingRows,] %>%select(TARGET)# Build a DFtrainingData <-as.data.frame(trainX)trainingData$TARGET <- trainY$TARGETprint(paste('Number of Training Samples: ', dim(trainingData)[1]))
[1] "Number of Training Samples: 10238"
Code
testingData <-as.data.frame(testX)testingData$TARGET <- testY$TARGETprint(paste('Number of Testing Samples: ', dim(testingData)[1]))
[1] "Number of Testing Samples: 2557"
Code
model_test_perf <-function(model, trainX, trainY, testX, testY) {# Evaluate Model 1 with testing data set predictedY <-predict(model, newdata=testX) model_results <-data.frame(obs = testY, pred=predictedY)colnames(model_results) =c('obs', 'pred')# This grabs RMSE, Rsquaredand MAE by default model_eval <-defaultSummary(model_results)# Add AIC score to the resultsif ('aic'%in% model) { model_eval[4] <- model$aic } else { model_eval[4] <-AIC(model) }names(model_eval)[4] <-'aic'# Add BIC score to the results model_eval[5] <-BIC(model)names(model_eval)[5] <-'bic'return(model_eval)}
The Poisson regression model assumes that the mean and variance of the response variable are approximately equal. However, the TARGET variable shows evidence of overdispersion because the variance is substantially larger than the mean.
This indicates that the Poisson assumption may not hold for the wine sales data. As a result, the Negative Binomial regression model becomes more appropriate because it includes an additional dispersion parameter that can better account for variability in count-based outcomes.
Model Development
Multiple models are developed to predict wine sales, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Both full models (using all predictors) and reduced models (using selected predictors) are evaluated.
Full Poisson Regression Model - Model 1
In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
Code
options(scipen =999)Poisson_Model1 <-glm(TARGET ~ ., data = trainingData, family = poisson)summary(Poisson_Model1)
# Evaluate Model 1 with testing data set(Poission_Evaluate1 <-model_test_perf(Poisson_Model1, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
2.5944814 0.5021363 2.2226635 36459.7403979 36589.9499061
This model identified several significant predictors of wine sales, particularly LabelAppeal, STARS, Alcohol, and AcidIndex. Higher STARS ratings and stronger label appeal were associated with increased wine purchases, while higher AcidIndex negatively impacted sales. The model achieved moderate predictive performance with an RMSE of 2.58 and an R-squared of 0.54. However, because the wine sales data showed overdispersion, the Poisson model may not fully capture the variability in the response variable, suggesting that a Negative Binomial model may provide a better fit.
Reduced Poisson Regression Model - Model 2
In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
Code
Poisson_Model2 <-glm(TARGET ~ VolatileAcidity + TotalSulfurDioxide + Alcohol + LabelAppeal + AcidIndex + STARS, data = trainingData, family = poisson)summary(Poisson_Model2)
Call:
glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson,
data = trainingData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.46626756 0.05003945 9.318 < 0.0000000000000002 ***
VolatileAcidity -0.03913821 0.01049292 -3.730 0.000192 ***
TotalSulfurDioxide 0.00008022 0.00003469 2.313 0.020746 *
Alcohol 0.00313860 0.00153549 2.044 0.040950 *
LabelAppeal 0.16110929 0.00681836 23.629 < 0.0000000000000002 ***
AcidIndex -0.07914181 0.00503497 -15.718 < 0.0000000000000002 ***
STARS1 0.79185023 0.02192795 36.111 < 0.0000000000000002 ***
STARS2 1.10813011 0.02059574 53.804 < 0.0000000000000002 ***
STARS3 1.22548288 0.02164467 56.618 < 0.0000000000000002 ***
STARS4 1.34106185 0.02740438 48.936 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 18256 on 10237 degrees of freedom
Residual deviance: 10869 on 10228 degrees of freedom
AIC: 36463
Number of Fisher Scoring iterations: 6
Code
# Evaluate Model 2 with testing data set(Poission_Evaluate2 <-model_test_perf(Poisson_Model2, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
2.5937565 0.5037147 2.2219422 36463.0377979 36535.3764136
This Model retained the most significant predictors of wine sales, including VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS. The results show that higher STARS ratings, stronger label appeal, and higher alcohol content positively influenced wine purchases, while higher volatile acidity and AcidIndex negatively affected sales. The model achieved moderate predictive performance with an RMSE of 2.60 and an R-squared of 0.53. Compared to the full Poisson model, this reduced model provided a simpler and more interpretable structure while maintaining similar predictive accuracy.
Full Negative Binomial Model - Model 3
Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
Code
nb3 <-glm.nb(TARGET ~ ., data = trainingData)summary(nb3)
# Evaluate Model 1 with testing data set(Negative_Binomial_eval3 <-model_test_perf(nb3, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
2.5944813 0.5021364 2.2226634 36462.0802450 36599.5236148
This Model retained the most significant predictors of wine sales, including VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, and STARS. The results show that higher STARS ratings, stronger label appeal, and higher alcohol content positively influenced wine purchases, while higher volatile acidity and AcidIndex negatively affected sales. The model achieved moderate predictive performance with an RMSE of 2.60 and an R-squared of 0.53. Compared to the full Poisson model, this reduced model provided a simpler and more interpretable structure while maintaining similar predictive accuracy.
Reduced Negative Binomial Model - Model 4
Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
# Evaluate Model 1 with testing data set(Negative_Binomial_eval4 <-model_test_perf(nb4, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
2.5938778 0.5059605 2.2228947 45654.3265603 45743.8082773
This Model identified STARS, LabelAppeal, Alcohol, and TotalSulfurDioxide as significant positive predictors of wine sales, while VolatileAcidity and AcidIndex showed negative relationships with TARGET. The model achieved moderate predictive performance with an RMSE of 2.58 and an R-squared of 0.53. Compared to the Poisson models, the Negative Binomial model is more appropriate because it better handles overdispersion in the count-based wine sales data. Overall, Model 4 provided a good balance between predictive accuracy, interpretability, and suitability for the dataset.
Full Multiple Linear Regression Model - Model 5
The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
Code
lm5 <-lm(TARGET ~ ., data = trainingData)summary(lm5)
# Evaluate Model 1 with testing data set(Linear_Regression_eval5 <-model_test_perf(lm5, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
1.3294009 0.5263324 1.0402172 34495.7176418 34633.1610116
This Model demonstrated strong predictive performance with an RMSE of 1.30 and an R-squared of 0.55, outperforming the Poisson and Negative Binomial models. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive relationships with wine sales, while VolatileAcidity and AcidIndex negatively affected TARGET. The overall model was statistically significant, indicating that the predictors collectively explained a substantial portion of the variation in wine sales. However, despite its strong performance, linear regression may not be ideal for count-based response data because it does not account for overdispersion or non-negative count constraints.
Stepwise Linear Regression Model - Model 6
For the final Linear Model, we leverage stepAIC on our Linear Model 5 to choose the most important features.
# Evaluate Model 1 with testing data set(Linear_Regression_eval6 <-model_test_perf(lm6, trainX, trainY, testX, testY))
RMSE Rsquared MAE aic bic
1.3298984 0.5259845 1.0402884 34490.5364790 34599.0444025
This Model selected the most important predictors using the stepAIC method and achieved strong predictive performance with an RMSE of 1.30 and an R-squared of 0.55. Variables such as STARS, LabelAppeal, and Alcohol showed significant positive effects on wine sales, while VolatileAcidity and AcidIndex negatively influenced TARGET. Compared to the full linear regression model, this reduced model maintained similar predictive accuracy while using fewer variables, making it more efficient and interpretable. Based on AIC and BIC values, Model 6 provided the best overall balance between model simplicity and predictive performance.
Model Evaluation and Selection
Model performance is evaluated using metrics such as RMSE, R-squared, MAE, AIC, and BIC. Among all models, the multiple linear regression models demonstrate superior predictive performance.
In particular, the stepwise-selected model (Model 6) achieves the best balance across evaluation metrics, with lower AIC and BIC values indicating improved model efficiency.
This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.
Across all model types, STARS and LabelAppeal consistently showed positive coefficients, indicating that wines with higher ratings and stronger label appeal tend to generate higher sales volumes. AcidIndex generally showed a negative relationship with TARGET, suggesting that higher acidity may reduce customer demand. Alcohol content displayed a moderate positive effect in both Poisson and Negative Binomial models, which may indicate customer preference for wines with higher alcohol concentration. Some variables changed sign across different model types. For example, pH showed inconsistent effects between the Poisson and linear regression models. This may be due to multicollinearity or differing distributional assumptions among the models. Despite these inconsistencies, variables with unstable coefficients were retained if they improved overall predictive performance and model fit statistics.
Although the multiple linear regression models achieved slightly better RMSE and R-squared performance, linear regression is not ideal for count-based response variables because predictions may fall outside the valid range and assumptions of normality may be violated.
Since TARGET represents count data with substantial overdispersion and many zero values, the Negative Binomial regression model provides a more theoretically appropriate framework for deployment.
Therefore, Negative Binomial Model 4 was selected as the final deployment model because it balances predictive performance, interpretability, and suitability for count data.
Sales Prediction Results
The selected model is applied to the evaluation dataset to generate predictions. The resulting distribution of predicted values closely resembles that of the training data, suggesting that the model generalizes well.
The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.
Final Conclusions
Several predictive modeling approaches were explored, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Exploratory analysis revealed that the TARGET variable contains many zero observations and demonstrates characteristics of overdispersion, making count-based regression models more appropriate.
Although the multiple linear regression models achieved slightly stronger RMSE and R-squared performance, the Negative Binomial models provided a better theoretical fit for the count nature of the response variable. In particular, Negative Binomial Model 4 achieved a strong balance between predictive accuracy, interpretability, and model simplicity.
Key predictors such as STARS, Alcohol, and LabelAppeal consistently demonstrated positive relationships with wine sales, while AcidIndex showed a generally negative relationship with TARGET.
Overall, the Negative Binomial regression framework provided the most appropriate balance between predictive performance and theoretical suitability for count-based sales data. The final model successfully captured key drivers of wine purchasing behavior while addressing overdispersion present in the response variable. These findings demonstrate how statistical modeling can support data-driven decision-making in consumer sales forecasting.