Quantitative Analysis of Wine Type Prediction via Logistic Regression

Author

Benedict Bautista

Wine classification is an important aspect of the wine business, where knowing the chemical make-up of a product can contribute to quality control, marketing, and customer satisfaction. Red wines and white wines, although both made from grapes, have different physicochemical properties that can be utilized to differentiate between them. Classical methods of classification are based on sensory analysis or specialist knowledge, which is subjective and unpredictable.

In this research, logistic regression is used as a statistical classifier to model wine type, red or white, according to numerous physicochemical properties like alcohol, pH, citric acid, density, and residual sugar. Logistic regression is suitable for binary classification and yields interpretable results in the form of odds ratios, and hence it’s a useful tool in interpreting the effects of every chemical property on wine type.

The aim of this analysis is to construct a predictive model employing logistic regression, evaluate its performance, and interpret the salient predictors distinguishing red wines from white wines. This method illustrates the benefits of statistical modeling in enological data analysis and aids in data-driven decision-making in viticulture and wine classification.

Data importing and sampling

The dataset come from kaggle.com or you can just click the link: wine.csv to download the dataset. The dataset contains 13 variables namely type or the type of the wine which is red and white, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, and quality. But for this study the variables will be used are type, density, sulphates, alcohol, and quality.

library(tidyverse)
library(here) # for better working directory
library(lmtest) # for regression
library(psych) # for histogram matrix
library(car) # plots related to lm
library(olsrr) # for ols models
library(nortest) # for normality test
library(pscl) # for McFadden's R^2  test (goodness of fit)
library(pander)  # for readable outputs

set.seed(100) # for reproduceablity
data <- read.csv(here("winequalityN.csv"))
any(is.na(data)) #checking for missing values

[1] TRUE

colSums(is.na(data)) # counting how many missing values

                type        fixed.acidity     volatile.acidity 
                   0                   10                    8 
         citric.acid       residual.sugar            chlorides 
                   3                    2                    2 
 free.sulfur.dioxide total.sulfur.dioxide              density 
                   0                    0                    0 
                  pH            sulphates              alcohol 
                   9                    4                    0 
             quality 
                   0

data <- data %>% drop_na() # deleting missing values
data <- data %>% sample_n(500) %>% select(type,
                                          density,
                                          sulphates,
                                          alcohol,
                                          quality) # 500 samples
head(data, n = 10)

    type density sulphates alcohol quality
1  white 0.99120      0.52    11.1       5
2  white 0.99340      0.45     9.1       5
3  white 0.98926      0.38    12.2       6
4  white 0.99382      0.62     9.8       5
5    red 0.99632      1.15     9.3       5
6  white 0.98978      0.45    12.0       6
7  white 0.99572      0.46     9.1       5
8  white 0.99496      0.44    10.5       7
9  white 0.99640      0.60     9.2       5
10 white 0.99790      0.67    10.4       6

To proceed further to our logistic regression later we need to recode our variables to binary values. We assign red as 0 and white as 1.

data <- data %>% mutate(
  type = factor(type, 
                labels = c(0, 1), 
                levels = c("red", "white")))
head(data, n = 10)

   type density sulphates alcohol quality
1     1 0.99120      0.52    11.1       5
2     1 0.99340      0.45     9.1       5
3     1 0.98926      0.38    12.2       6
4     1 0.99382      0.62     9.8       5
5     0 0.99632      1.15     9.3       5
6     1 0.98978      0.45    12.0       6
7     1 0.99572      0.46     9.1       5
8     1 0.99496      0.44    10.5       7
9     1 0.99640      0.60     9.2       5
10    1 0.99790      0.67    10.4       6

Exploratory Data Analysis

Now that we’ve cleaned and sampled our data, we can proceed in exploring our data by inspecting its dimensions, structure, summary, and distribution.

pander(dim(data))

500 and 5

str(data)

'data.frame':   500 obs. of  5 variables:
 $ type     : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 2 2 2 ...
 $ density  : num  0.991 0.993 0.989 0.994 0.996 ...
 $ sulphates: num  0.52 0.45 0.38 0.62 1.15 0.45 0.46 0.44 0.6 0.67 ...
 $ alcohol  : num  11.1 9.1 12.2 9.8 9.3 12 9.1 10.5 9.2 10.4 ...
 $ quality  : int  5 5 6 5 5 6 5 7 5 6 ...

pander(summary(data))

type	density	sulphates	alcohol	quality
0:137	Min. :0.9872	Min. :0.2700	Min. : 8.50	Min. :3.00
1:363	1st Qu.:0.9924	1st Qu.:0.4400	1st Qu.: 9.50	1st Qu.:5.00
NA	Median :0.9951	Median :0.5100	Median :10.20	Median :6.00
NA	Mean :0.9949	Mean :0.5343	Mean :10.51	Mean :5.78
NA	3rd Qu.:0.9972	3rd Qu.:0.6025	3rd Qu.:11.40	3rd Qu.:6.00
NA	Max. :1.0026	Max. :1.6200	Max. :14.00	Max. :9.00

From the data, the type of variable has been correctly transformed into a factor, where 0 signifies red wine and 1 signifies white wine. Of the predictors chosen for use in this analysis; density, sulphates, alcohol, and quality, the first three are numeric variables, while quality has been measured as an integer, reflecting the rating of the wine.

The variable of density varies from 0.9879 to 1.0103, has a first quartile (Q1) of 0.9921, a third quartile (Q3) of 0.9969, a mean of 0.9946, and a median of 0.9948. For sulphates, the range is 0.26 to 2.00 with Q1 being 0.43, Q3 being 0.59, a mean of 0.5356, and median of 0.5050.

The alcohol content ranges from 8.5 to 14.0 with Q1 being 9.5, Q3 being 11.3, a mean of 10.54, and a median of 10.4. Finally, the quality variable, which is used to measure the wine rating, varies between 3 and 9, with Q1 at 5, Q3 at 6, a mean of 5.872, and a median of 6.

In relation to wine type distribution, among 500 sampled observations, 137 are red wines (27.4%), whereas 363 are white wines (72.6%). This represents a moderate class imbalance, with the majority of white wines in the dataset.

To further understand the underlying structure of the data, we assess the distribution of the numeric variables by performing the Shapiro-Wilk test for normality and by visualizing the data through histograms.

columns <- c("density", "sulphates", "alcohol", "quality")
shapiro_results <- lapply(data[columns], shapiro.test)
shapiro_summary <- data.frame(
  Variable = names(shapiro_results),
  Statistic = sapply(shapiro_results, function(x) round(x$statistic, 4)),
  P_Value = sapply(shapiro_results, function(x) x$p.value),
  Normality = sapply(shapiro_results, function(x) ifelse(x$p.value > 0.05, "Assumed Normal", 
        "Not Normal")))

print(shapiro_summary)

             Variable Statistic      P_Value  Normality
density.W     density    0.9861 1.042758e-04 Not Normal
sulphates.W sulphates    0.9096 1.129872e-16 Not Normal
alcohol.W     alcohol    0.9408 3.183606e-13 Not Normal
quality.W     quality    0.8756 1.407227e-19 Not Normal

# histograms
# pivoting to reshape data
long_data <- data %>%
  pivot_longer(cols = c(density, sulphates, alcohol, quality),
               names_to = "Variable",
               values_to = "Value")

ggplot(long_data, aes(x = Value)) +
  geom_histogram(fill = "steelblue", bins = 30, color = "black") +
  facet_wrap(~ Variable, scales = "free") +
  theme_minimal() +
  labs(title = "Distribution of Wine Characteristics",
       x = "Value",
       y = "Frequency")

The density variable produced a Shapiro-Wilk statistic of 0.9775 on a p-value of 5.83 × 10⁻⁷, which signifies statistically significant departure from normality. Also, the sulphates variable had even more significant indication of non-normality with a value of 0.8431 and p-value of 7.61 × 10⁻²². The histogram perfectly depicts a right-skewed distribution, consistent with previous descriptive statistics indicating that the mean is larger than the median—a characteristic feature of positive skewness.

For the variable for alcohol, the Shapiro-Wilk test gave a statistic of 0.9570 and a p-value of 6.65 × 10⁻¹¹, once more indicating non-normality. Its distribution also seems to be positively skewed, as indicated by the fact that the mean is greater than the median and implying an aggregation of lower values with a few very high extreme values dragging the distribution to the right.

Finally, the quality variable is also non-normally distributed, with Shapiro-Wilk statistic of 0.8887 and p-value of 1.54 × 10⁻¹⁸. This is potentially because quality is observed as discrete integer values (e.g., ratings between 3 and 9), which necessarily restricts its potential to conform to a smooth bell-shaped curve.

Since the dependent variable is binary, it is not appropriate to assess its distribution using formal normality tests. Instead, we examine its distribution by visualizing it with a bar plot and summarizing its frequency and proportion.

freq <- table(data$type) 
names(freq) <- c("Red", "White")
pander(freq)

Red	White
137	363

proportion <- prop.table(freq)*100
pander(proportion)

Red	White
27.4	72.6

ggplot(data, aes(x = type)) + 
  geom_bar(fill = "steelblue", color = "black") + 
  theme_minimal() +
  labs(
    title = "Distribution of wine type",
    x = "Type of wine",
    y = "Frequency"
  )

From the bar graph above, it can be observed that white wine dominates the sampled dataset, comprising 363 out of 500 observations (72.6%), while red wine accounts for 137 observations (27.4%). This indicates a moderate class imbalance, with white wine being nearly three times more prevalent than red wine in the sample.

Binary Logistic Regression

Here, we look at binary logistic regression, a statistical technique employed to model how one or more independent variables may relate to a binary outcome variable. In contrast to linear regression, which attempts to predict a continuous outcome, binary logistic regression comes into play in cases where the dependent variable can take only two values, usually symbolizing success/failure, yes/no, or presence/absence. This method is employed widely across different disciplines like health, marketing, and social sciences to estimate the likelihood of a specific event happening, on the basis of predictor variables. We will learn the underlying principles, assumptions, model interpretation, and prevalent uses of binary logistic regression.

In this scenario, the dependent variable is the type of wine, categorized as either red or white, making it a binary outcome suitable for logistic regression. The goal is to model and predict the likelihood that a given wine sample is red (versus white) based on a set of continuous independent variables. These predictors include density, which refers to the mass per unit volume of the wine; sulphates, which are added to help preserve the wine and prevent bacterial growth; alcohol content, which can influence the taste and perception of the wine; and quality, which is a score typically based on sensory evaluations. By applying binary logistic regression, we aim to understand how each of these variables contributes to the probability of a wine being red or white and to identify which characteristics are most strongly associated with wine type.

model <- glm(type ~ density + sulphates + alcohol + quality, 
             data = data,
             family = "binomial")
model_summary <- summary(model)
pander(model_summary)

	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	582.2	76.26	7.635	2.262e-14
density	-572.9	75.3	-7.609	2.771e-14
sulphates	-8.535	1.204	-7.086	1.376e-12
alcohol	-0.9643	0.1835	-5.256	1.474e-07
quality	0.698	0.1959	3.562	0.0003677

(Dispersion parameter for binomial family taken to be 1 )

Null deviance:	587.2 on 499 degrees of freedom
Residual deviance:	360.1 on 495 degrees of freedom

pander(pR2(model)) # for peseudo r^2

fitting null model for pseudo-r2

llh	llhNull	G2	McFadden	r2ML	r2CU
-180	-293.6	227.1	0.3868	0.3651	0.5284

# computing for the odds ratio because there is no odds ratio in the results
# we have to manually compute it and bind it to the results
exp(cbind(Odds_Ratio = coef(model), confint(model)))

Waiting for profiling to be done...

               Odds_Ratio         2.5 %        97.5 %
(Intercept) 6.979940e+252 5.116672e+190           Inf
density     1.533163e-249 1.088195e-316 3.568833e-188
sulphates    1.965614e-04  1.668233e-05  1.890170e-03
alcohol      3.812350e-01  2.627625e-01  5.404032e-01
quality      2.009687e+00  1.383055e+00  2.986118e+00

or <- exp(coef(model))

res <- tibble(
  Term = rownames(model_summary$coefficients),
  Coefficient = model_summary$coefficients[, "Estimate"],
  SE = model_summary$coefficients[, "Std. Error"],
  Odds = exp(model_summary$coefficients[, "Estimate"]),
  p_value = model_summary$coefficients[, "Pr(>|z|)"],
  Lower = exp(confint(model)[, 1]),
  Upper = exp(confint(model)[, 2])
)

Waiting for profiling to be done...
Waiting for profiling to be done...

pander(res)

Table continues below
Term	Coefficient	SE	Odds	p_value	Lower
(Intercept)	582.2	76.26	6.98e+252	2.262e-14	5.117e+190
density	-572.9	75.3	1.533e-249	2.771e-14	1.088e-316
sulphates	-8.535	1.204	0.0001966	1.376e-12	1.668e-05
alcohol	-0.9643	0.1835	0.3812	1.474e-07	0.2628
quality	0.698	0.1959	2.01	0.0003677	1.383

Upper
Inf
3.569e-188
0.00189
0.5404
2.986

The intercept is highly significant (p < .05) with a coefficient of 582.2. It signifies the log-odds of the wine being white when all the predictors are at zero, which in this case is largely theoretical.

The variable density produced a coefficient of -572.9 and a very significant p-value (p = 2.77 × 10-14). The odds ratio was around zero (1.53 × 10-249), indicating that as density lowers, the likelihood of the wine being classified as red.

For sulphates, the coefficient was -8.535 and p-value of 1.38 × 10⁻¹² (p < .05), which shows that it’s statistically significant. Its odds ratio was far less than 1, which implies there’s a very low chance (about 0.02%) of the wine being white as sulphate levels rise.

The alcohol variable was also significant, with a p-value of 1.47 × 10⁻⁷ (p < .05) and a coefficient of -0.9643. Its odds ratio (0.3812) indicates a 38.12% probability of the wine being white when alcohol content is greater, which is a negative relationship.

Finally, quality was highly related to the type of wine with a p-value of 0.00036. Its odds ratio was higher than 1 (1.01), which shows that there is slightly higher chance (about 101%) that the wine will be white as quality increases.

For the model fit, McFadden’s R² yielded a value of 0.3868, which indicates a good model fit. According to commonly accepted benchmarks, McFadden’s R² values between 0.3 and 0.4 are considered indicative of models with a strong explanatory power in the context of logistic regression. Therefore, the model provides a satisfactory fit to the data and explains a substantial portion of the variation in the outcome variable.

Summary

In general, all independent variables: density, sulphates, alcohol, and quality, were significant predictors of wine type. Low levels of density, sulphates, and alcohol were linked with increased likelihood of the wine being red, while high quality ratings were linked with increased likelihood of the wine being white. These observations point to the different physicochemical features which can distinguish red from white wines.

Assumptions

Statistical assumptions are essential in assessing the validity of a model, particularly when logistic regression is applied. Logistic regression relies on several key assumptions to ensure reliable and interpretable results. These include: (1) a binary dependent variable, meaning the outcome must consist of two categories; (2) the absence of significant outliers, which can disproportionately influence the model; (3) no multicollinearity among independent variables, ensuring each predictor contributes uniquely to the model; (4) independence of observations, meaning that the observations are not related or clustered; and (5) linearity of the logit, where the log odds of the outcome should have a linear relationship with continuous predictors. Verifying these assumptions helps confirm the robustness and validity of the logistic regression analysis.

Binary Dependent Variable

As discussed earlier, the dependent variable in this analysis consists of two distinct categories: red wine and white wine. This satisfies one of the key assumptions of logistic regression, which requires the outcome variable to be binary. A binary outcome ensures that the model can estimate the probability of one category occurring relative to the other. In this case, the model predicts the likelihood of a wine being white (or red), given the values of the independent variables. This setup confirms that the data structure is appropriate for applying binary logistic regression.

Absence of Significant Outliers

cook_values <- cooks.distance(model) %>% head()
outliers <- which(cook_values > 1)

if (length(outliers) > 0) {
  cat("Influential outliers detected at the following observations:\n")
  print(outliers)
} else {
  cat("No influential outliers detected based on Cook's Distance > 1.\n")
}

No influential outliers detected based on Cook's Distance > 1.

Upon conducting a formal test for outliers using Cook’s Distance, the results indicate that no observation in the dataset has a Cook’s Distance value greater than 1. This suggests that there are no influential outliers present in the model, as values exceeding 1 are typically considered indicative of observations that exert a disproportionately large influence on the regression coefficients. The absence of such high Cook’s Distance values supports the stability and robustness of the model estimates, indicating that no single observation unduly influences the overall model fit.

No multicollinearity

GVIF <- function(model){
  vifs <- car::vif(glm(formula(model), data=model$data, family=binomial))
  return(vifs)
}

GVIF(model) %>% pander()

density	sulphates	alcohol	quality
2.243	1.062	2.595	1.3

A formal assessment of multicollinearity was performed using the Variance Inflation Factor (VIF). The analysis showed that none of the predictors had a VIF value exceeding 10, which is the commonly accepted threshold for indicating problematic multicollinearity. This finding implies that the independent variables in the model are not highly correlated with one another. The absence of elevated VIF values supports the interpretability and reliability of the model, indicating that each predictor contributes unique information without redundancy.

Independence of Observations

durbinWatsonTest(model) %>% print()

 lag Autocorrelation D-W Statistic p-value
   1     -0.02210985      2.043748   0.652
 Alternative hypothesis: rho != 0

A statistical test for independence of observations was formally carried out with the Durbin-Watson (DW) test. The outcome showed that the DW statistic was roughly 2, meaning that there is no notable autocorrelation in the residuals. Because values close to 2 mean independence, and values near 1 or 3 indicate positive or negative autocorrelation, respectively, it confirms that the observations in the dataset are independent of each other.

Linearity of Logit Function

log_resid <- model$residuals
plot(model$linear.predictors, model$log_resid,
     xlab = "Linear Predictors", 
     ylab = "Residuals",
     col = "black",
     bg = "steelblue",
     pch = 21)

To verify this assumption, a plot of the residuals against the linear predictor was inspected. The plot evidenced no visible pattern or systematic structure, which means that the relation between the predictors and the logit of the dependent variable is roughly linear. The fact that there was no curvature or structure in the plot of residuals confirms the linearity assumption and implies that the model does capture the relation between the predictors and the outcome well.