Data-driven decision-making is essential for achieving effective business outcomes. In this analysis, we begin by exploring the dataset to identify potential issues such as outliers, missing values, encoding inconsistencies, and multicollinearity. After addressing these challenges through appropriate data preprocessing techniques, we develop and evaluate several predictive models for estimating sales.
The dataset is divided into training and evaluation subsets, allowing us to train models on one portion and validate their performance on unseen data. Multiple modeling approaches are compared, and the final model is selected based on its ability to balance predictive accuracy with model simplicity.
Code
library(summarytools)library(MASS)library(rpart.plot)library(ggplot2)library(ggfortify)library(gridExtra)library(forecast)library(fpp2)library(fma)library(kableExtra)library(e1071)library(mlbench)library(ggcorrplot)library(DataExplorer)library(timeDate)library(caret)library(GGally)library(corrplot)library(RColorBrewer)library(tibble)library(tidyr)library(tidyverse)library(dplyr)library(reshape2)library(mixtools)library(tidymodels)library(ggpmisc)library(regclass)library(skimr)library(corrgram)library(mice)#' Print a side-by-side Histogram and QQPlot of Residuals#'#' @param model A model#' @examples#' residPlot(myModel)#' @return null#' @exportresidPlot <-function(model) {# Make sure a model was passedif (is.null(model)) { return }layout(matrix(c(1,1,2,3), 2, 2, byrow =TRUE))plot(residuals(model))hist(model[["residuals"]], freq =FALSE, breaks ="fd", main ="Residual Histogram",xlab ="Residuals",col="lightgreen")lines(density(model[["residuals"]], kernel ="ep"),col="blue", lwd=3)curve(dnorm(x,mean=mean(model[["residuals"]]), sd=sd(model[["residuals"]])), col="red", lwd=3, lty="dotted", add=T)qqnorm(model[["residuals"]], main ="Residual Q-Q plot")qqline(model[["residuals"]],col="red", lwd=3, lty="dotted")par(mfrow =c(1, 1))}#' Print a Variable Importance Plot for the provided model#'#' @param model The model#' @param chart_title The Title to show on the plot#' @examples#' variableImportancePlot(myLinearModel, 'My Title)#' @return null#' @exportvariableImportancePlot <-function(model=NULL, chart_title='Variable Importance Plot') {# Make sure a model was passedif (is.null(model)) { return }# use caret and gglot to print a variable importance plotvarImp(model) %>%as.data.frame() %>%ggplot(aes(x =reorder(rownames(.), desc(Overall)), y = Overall)) +geom_col(aes(fill = Overall)) +theme(panel.background =element_blank(),panel.grid =element_blank(),axis.text.x =element_text(angle =90)) +scale_fill_gradient() +labs(title = chart_title,x ="Parameter",y ="Relative Importance")}#' Print a Facet Chart of histograms#'#' @param df Dataset#' @param box Facet size (rows)#' @examples#' histbox(my_df, 3)#' @return null#' @exporthistbox <-function(df, box) {par(mfrow = box) ndf <-dimnames(df)[[2]]for (i inseq_along(ndf)) { data <-na.omit(unlist(df[, i]))hist(data, breaks ="fd", main =paste("Histogram of", ndf[i]),xlab = ndf[i], freq =FALSE)lines(density(data, kernel ="ep"), col ='red') }par(mfrow =c(1, 1))}#' Extract key performance results from a model#'#' @param model A linear model of interest#' @examples#' model_performance_extraction(my_model)#' @return data.frame#' @exportmodel_performance_extraction <-function(model=NULL) {# Make sure a model was passedif (is.null(model)) { return }data.frame("RSE"= model$sigma,"Adj R2"= model$adj.r.squared,"F-Statistic"= model$fstatistic[1])}
tibble [12,795 × 15] (S3: tbl_df/tbl/data.frame)
$ TARGET : int [1:12795] 3 3 5 3 4 0 0 4 3 6 ...
$ FixedAcidity : num [1:12795] 3.2 4.5 7.1 5.7 8 11.3 7.7 6.5 14.8 5.5 ...
$ VolatileAcidity : num [1:12795] 1.16 0.16 2.64 0.385 0.33 0.32 0.29 -1.22 0.27 -0.22 ...
$ CitricAcid : num [1:12795] -0.98 -0.81 -0.88 0.04 -1.26 0.59 -0.4 0.34 1.05 0.39 ...
$ ResidualSugar : num [1:12795] 54.2 26.1 14.8 18.8 9.4 ...
$ Chlorides : num [1:12795] -0.567 -0.425 0.037 -0.425 NA 0.556 0.06 0.04 -0.007 -0.277 ...
$ FreeSulfurDioxide : num [1:12795] NA 15 214 22 -167 -37 287 523 -213 62 ...
$ TotalSulfurDioxide: num [1:12795] 268 -327 142 115 108 15 156 551 NA 180 ...
$ Density : num [1:12795] 0.993 1.028 0.995 0.996 0.995 ...
$ pH : num [1:12795] 3.33 3.38 3.12 2.24 3.12 3.2 3.49 3.2 4.93 3.09 ...
$ Sulphates : num [1:12795] -0.59 0.7 0.48 1.83 1.77 1.29 1.21 NA 0.26 0.75 ...
$ Alcohol : num [1:12795] 9.9 NA 22 6.2 13.7 15.4 10.3 11.6 15 12.6 ...
$ LabelAppeal : int [1:12795] 0 -1 -1 -1 0 0 0 1 0 0 ...
$ AcidIndex : int [1:12795] 8 7 8 6 9 11 8 7 6 8 ...
$ STARS : int [1:12795] 2 3 3 1 2 NA NA 3 NA 4 ...
Code
str(evaluate_df)
tibble [3,335 × 15] (S3: tbl_df/tbl/data.frame)
$ TARGET : logi [1:3335] NA NA NA NA NA NA ...
$ FixedAcidity : num [1:3335] 5.4 12.4 7.2 6.2 11.4 17.6 15.5 15.9 11.6 3.8 ...
$ VolatileAcidity : num [1:3335] -0.86 0.385 1.75 0.1 0.21 0.04 0.53 1.19 0.32 0.22 ...
$ CitricAcid : num [1:3335] 0.27 -0.76 0.17 1.8 0.28 -1.15 -0.53 1.14 0.55 0.31 ...
$ ResidualSugar : num [1:3335] -10.7 -19.7 -33 1 1.2 1.4 4.6 31.9 -50.9 -7.7 ...
$ Chlorides : num [1:3335] 0.092 1.169 0.065 -0.179 0.038 ...
$ FreeSulfurDioxide : num [1:3335] 23 -37 9 104 70 -250 10 115 35 40 ...
$ TotalSulfurDioxide: num [1:3335] 398 68 76 89 53 140 17 381 83 129 ...
$ Density : num [1:3335] 0.985 0.99 1.046 0.989 1.029 ...
$ pH : num [1:3335] 5.02 3.37 4.61 3.2 2.54 3.06 3.07 2.99 3.32 4.72 ...
$ Sulphates : num [1:3335] 0.64 1.09 0.68 2.11 -0.07 -0.02 0.75 0.31 2.18 -0.64 ...
$ Alcohol : num [1:3335] 12.3 16 8.55 12.3 4.8 11.4 8.5 11.4 -0.5 10.9 ...
$ LabelAppeal : int [1:3335] -1 0 0 -1 0 1 0 1 0 0 ...
$ AcidIndex : int [1:3335] 6 6 8 8 10 8 12 7 12 7 ...
$ STARS : int [1:3335] NA 2 1 1 NA 4 3 NA NA NA ...
The training dataset consists of 12,795 observations and 15 variables, including 14 predictors and one response variable (TARGET). The INDEX column serves only as an identifier and is removed prior to analysis. Most predictors represent chemical properties of wine, while LabelAppeal and STARS reflect subjective quality and presentation ratings.
The response variable, TARGET, represents the number of wine cases purchased and takes integer values between 0 and 8. A large proportion of observations contain zero values, indicating many instances with no purchases.
Code
library(vtable)st(train_df)
Summary Statistics
Variable
N
Mean
Std. Dev.
Min
Pctl. 25
Pctl. 75
Max
TARGET
12795
3
1.9
0
2
4
8
FixedAcidity
12795
7.1
6.3
-18
5.2
9.5
34
VolatileAcidity
12795
0.32
0.78
-2.8
0.13
0.64
3.7
CitricAcid
12795
0.31
0.86
-3.2
0.03
0.58
3.9
ResidualSugar
12179
5.4
34
-128
-2
16
141
Chlorides
12157
0.055
0.32
-1.2
-0.031
0.15
1.4
FreeSulfurDioxide
12148
31
149
-555
0
70
623
TotalSulfurDioxide
12113
121
232
-823
27
208
1057
Density
12795
0.99
0.027
0.89
0.99
1
1.1
pH
12400
3.2
0.68
0.48
3
3.5
6.1
Sulphates
11585
0.53
0.93
-3.1
0.28
0.86
4.2
Alcohol
12142
10
3.7
-4.7
9
12
26
LabelAppeal
12795
-0.0091
0.89
-2
-1
1
2
AcidIndex
12795
7.8
1.3
4
7
8
17
STARS
9436
2
0.9
1
1
3
4
Predictor Variables
Among the predictor variables, several contain missing values, particularly STARS, Sulphates, and sulfur-related measurements. The dataset includes both continuous and discrete variables, with LabelAppeal, AcidIndex, and STARS treated as categorical features.
Some chemical variables contain negative values, which are not physically meaningful. These values are likely the result of prior transformations such as normalization or scaling. While this inconsistency is noted, the values are retained (with adjustments applied later) due to limited information about the data generation process.
Response Variables
The distribution of the response variable indicates that most observations correspond to low sales volumes, with a high frequency of zero values. For non-zero observations, the distribution appears approximately symmetric, suggesting moderate variability in purchase behavior.
Variables Distributions
Most continuous variables exhibit approximately normal distributions, although some degree of skewness is present. Variables such as AcidIndex and STARS show noticeable right skewness, indicating that higher values occur less frequently.
Boxplot analysis reveals the presence of outliers across several variables, particularly in TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar. These variables also display wider ranges compared to others, suggesting greater variability.
We observe that continuous variables exhibit a somewhat normal steep distribution. However, variables such as AcidIndex and STARS display right skewness.
In the box plot, there are not many outliers in the variables. However, TotalSulfurDioxide, FreeSulfurDioxide, and ResidualSugar variables have large ranges compared to other variables.We can tell a high number of variables have numerous outliers.
Relationship Between Categorical and Response Variables
Exploration of categorical predictors shows meaningful relationships with the response variable. Higher values of STARS are generally associated with increased sales, indicating that perceived quality influences purchasing behavior. Similarly, LabelAppeal appears positively related to sales, suggesting that product presentation plays a role in consumer decisions.
The AcidIndex variable shows variation in sales across its levels, indicating that certain acidity levels may be more favorable in the market.
The bar charts compare the three discrete categorical variables to the TARGET variable. AcidIndex shows large quantity of wine were sold with the index number 7 and 8. LabelAppeal shows us generic label does yield higher number of wine samples per order. Lastly, STARS shows high star wine bottles have high price tags. For each of these predictors, there appears to be a significant relationship between the ordered levels and the number of wine cases sold.
Multicolinearity
Correlation analysis reveals that STARS and LabelAppeal have the strongest positive relationships with the response variable. In contrast, AcidIndex shows a mild negative correlation with sales. Overall, while some correlations exist among predictors, no severe multicollinearity issues are observed.
x
TARGET
1.0000000
FixedAcidity
-0.0125381
VolatileAcidity
-0.0759979
CitricAcid
0.0023450
ResidualSugar
0.0035196
Chlorides
-0.0304301
FreeSulfurDioxide
0.0226398
TotalSulfurDioxide
0.0216021
Density
-0.0475989
pH
0.0002199
Sulphates
-0.0212204
Alcohol
0.0737771
LabelAppeal
0.4979465
AcidIndex
-0.1676431
STARS
0.5546857
In the correlation table, we can see that STARS and LabelAppeal are most positively correlated variables with the response variable. Also, we some mild negative correlation between the response variable and AcidIndex variable.
Data Preparation
Negative Values
To address negative values in chemical variables, absolute transformations are applied to ensure all values are non-negative. For LabelAppeal, values are shifted by adding the minimum value to maintain relative differences while eliminating negative entries.
Several variables contain missing values, with STARS having the highest proportion. Missing values are assumed to be Missing at Random (MAR) and are imputed using the MICE (Multiple Imputation by Chained Equations) method.
According to the graph, the data set has multiple variables with missing variables. The STARS variable has the most NA values. The Sulphates variable records missing values in roughly 10% of observations, while the remaining six predictors have missing values ranging from 3% to 5%. These missing values will be imputed later on during the data preparation using the MICE package and random forest prediction method.
Code
train_df$STARS[is.na(train_df$STARS)] <-0evaluate_df$STARS[is.na(evaluate_df$STARS)] <-0# Perform multiple imputationmice_imputes <-mice(train_df, m =2, maxit =2, print =FALSE)# Visualize the imputed values with density plotdensityplot(mice_imputes)
We can see that each of the remaining variables with missing values seem to be MAR, as the mice imputation distributions roughly match the existing. We’ll also run the mice imputation again on both the train and test set. Instead of using it for our models, however, we’ll simplify our run and fill in our data. Finally, after our analysis, we can use it in in our model, we’ll update STARS to become a factor variable.
Code
mice_train <-mice(train_df, m =1, maxit =1, print =FALSE)cleaned_train <-complete(mice_train)mice_evaluate <-mice(evaluate_df, m =1, maxit =1, print =FALSE)cleaned_evaluate <-complete(mice_evaluate)cleaned_train$STARS <-as.factor(cleaned_train$STARS)cleaned_evaluate$STARS <-as.factor(cleaned_evaluate$STARS)
Descriptive Summaries and Correlation Review
Following data cleaning and imputation, updated summaries and visualizations confirm that the dataset is well-structured and suitable for modeling. The distributions remain consistent with earlier observations, and correlation patterns are preserved.
Code
# Create a summary tablesummary_table <-descr(cleaned_train)# Transpose the summary tabletransposed_summary_table <-t(summary_table)# Display transposed summary table as an HTML tableknitr::kable(transposed_summary_table, "html", escape =FALSE) %>% kableExtra::kable_styling("striped", full_width =FALSE) %>% kableExtra::column_spec(1, bold =TRUE) %>% kableExtra::scroll_box(height ="500px")
The dataset is divided into training (80%) and testing (20%) subsets. Models are trained on the training data and evaluated on the testing data to ensure unbiased performance assessment.
[1] "Number of Training Samples: 10238"
[1] "Number of Testing Samples: 2557"
Build Models
Multiple models are developed to predict wine sales, including Poisson regression, Negative Binomial regression, and Multiple Linear Regression. Both full models (using all predictors) and reduced models (using selected predictors) are evaluated.
Poisson Regression Model 1
In this first model, we include all available features: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
RMSE Rsquared MAE aic bic
2.5894545 0.5202773 2.2187243 36519.6914823 36649.9009905
Poisson Regression Model 2
In this Model 2, we only include the most predictive features : VolatileAcidity, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
Call:
glm(formula = TARGET ~ VolatileAcidity + TotalSulfurDioxide +
Alcohol + LabelAppeal + AcidIndex + STARS, family = poisson,
data = trainingData)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.46404014 0.05004844 9.272 < 0.0000000000000002 ***
VolatileAcidity -0.03779725 0.01057921 -3.573 0.000353 ***
TotalSulfurDioxide 0.00007139 0.00003497 2.042 0.041176 *
Alcohol 0.00346452 0.00154517 2.242 0.024951 *
LabelAppeal 0.15704568 0.00689219 22.786 < 0.0000000000000002 ***
AcidIndex -0.07878071 0.00503813 -15.637 < 0.0000000000000002 ***
STARS1 0.79022846 0.02197602 35.959 < 0.0000000000000002 ***
STARS2 1.10830042 0.02050642 54.046 < 0.0000000000000002 ***
STARS3 1.23542281 0.02159526 57.208 < 0.0000000000000002 ***
STARS4 1.34776225 0.02722184 49.510 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 18365 on 10237 degrees of freedom
Residual deviance: 10971 on 10228 degrees of freedom
AIC: 36512
Number of Fisher Scoring iterations: 6
RMSE Rsquared MAE aic bic
2.5896019 0.5196981 2.2189296 36512.1625275 36584.5011431
Negative Binomial Regression Model 3
Similar to Poisson Model 1, the predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
RMSE Rsquared MAE aic bic
2.5894544 0.5202774 2.2187242 36522.0294485 36659.4728183
Negative Binomial Regression Model 4
Similar to Poisson Model 2, the predictors for the following model are: VolatileAcidity, FreeSulfurDioxide, TotalSulfurDioxide, Alcohol, LabelAppeal, AcidIndex, STARS
RMSE Rsquared MAE aic bic
2.587590 0.521018 2.219023 45652.300273 45741.781990
Multiple Linear Regression Model 5
The predictors for the following model are: FixedAcidity, VolatileAcidity, CitricAcid, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, Density, pH, Sulphates, Alcohol, LabelAppeal, AcidIndex, STARS
RMSE Rsquared MAE aic bic
1.3077413 0.5400503 1.0268622 34573.9257117 34667.9659121
Select Models
Model performance is evaluated using metrics such as RMSE, R-squared, MAE, AIC, and BIC. Among all models, the multiple linear regression models demonstrate superior predictive performance.
In particular, the stepwise-selected model (Model 6) achieves the best balance across evaluation metrics, with lower AIC and BIC values indicating improved model efficiency.
RMSE
Rsquared
MAE
aic
bic
Poission_Evaluate1
2.589454
0.5202773
2.218724
36519.69
36649.90
Poission_Evaluate2
2.589602
0.5196981
2.218930
36512.16
36584.50
Negative_Binomial_eval3
2.589454
0.5202774
2.218724
36522.03
36659.47
Negative_Binomial_eval4
2.587590
0.5210180
2.219023
45652.30
45741.78
Linear_Regression_eval5
1.307328
0.5403412
1.026727
34579.45
34716.89
Linear_Regression_eval6
1.307741
0.5400503
1.026862
34573.93
34667.97
This table summarizes the RMSE, RSQUARED, MAE, AIC and BIC for all SIX models. The Linear regressions (Linear Model 5 and Linear Model 6) had the overall best performance based on RMSE and RSQUARED; as well as Linear Model 6 had the best performance based on AIC and BIC.
Finally, we chose Multiple Linear Regression Model 6 as our final model since it had a far lower AIC and BIC.
Prediction
The selected model is applied to the evaluation dataset to generate predictions. The resulting distribution of predicted values closely resembles that of the training data, suggesting that the model generalizes well.
The histogram shows that our predictions have a similar shape to our training Target variable, the means and medians are almost identical, and the kurtosis values are close.
Conclusions
The final selected model, a multiple linear regression with stepwise feature selection, achieves an adjusted R-squared of approximately 0.53. Key predictors such as STARS, Alcohol, and LabelAppeal show positive relationships with sales, while AcidIndex is negatively associated.
Overall, the model aligns well with insights obtained during exploratory analysis and provides a reasonable balance between interpretability and predictive performance.