2026-03-15

Dataset Overview & Source

Wine Quality Dataset
We are analyzing different properties of 1500 samples of red wine and seeing what attributes affect things like wine quality, alcohol levels, density, etc.

Variables
- 1. fixed acidity (tartaric acid - g/dm^3) Acids that do not evaporate readily in wine.
- 2. volatile acidity (acetic acid - g/dm^3) Acids that evaporate readily in wine.
- 3. citric acid (g/dm^3) Acid present in grapes.
- 4. residual sugar (g/dm^3) Remaining sugar not fermented.
- 5. chlorides (NaCl - g/dm^3) Amt. of salt in wine.
- 6. free sulfur dioxide (mg/dm^3) Sulfur that protects the wine from going bad.

Dataset Overview & Source (cont.)

Variables (cont.)
- 7. total sulfur dioxide (mg/dm^3) Sum total of free and bound sulfur dioxide.
- 8. density (g/cm^3) Density of wine.
- 9. pH Measure of acidity in wine.
- 10. sulphates (g/dm^3) Potassium sulfate added for better wine quality.
- 11. alcohol (% by vol) How strong the wine is in alcohol.
- 12. quality (0-10) Quality of wine.

Datset Source & Citation
Website: Kaggle
Author: UCI ML Repository
Link: https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009
Full Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Data Preparation

# Libraries
library(ggplot2)
library(dplyr)
library(plotly)
library(tidyr)

# Dataset + any cleaning necessary
wine_df = read.csv("winequality-red.csv", sep = ",", header = TRUE)
amt_of_na = sum(is.na(wine_df)) # Was 0
wine_df = wine_df %>% drop_na() # Had no effect.

# I sample a subset of the wine data for plots to reference later. 
# This is to make visualization less overwhelming for scatterplots.
set.seed(123)
sample_wine_df = wine_df |>
  slice_sample(n=300, replace=FALSE)

Data Head

Example data to get a general idea of the data.

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

3D Plotly Scatterplot: Relating residual sugar and alcohol content to wine density

3D Plotly Analysis

Observations

- Alcohol: Alcohol levels seem to be negatively correlated with wine density; alcohol goes up, wine density goes down.

- Residual Sugar: More sugar seems to be positvely correlated with wine density; sugar goes up, wine density goes up.

- Issues: Not a lot samples have higher residual sugar. Most have lower levels around 2-3 (g/dm^3) which can make it hard to conclude that sugar does increase wine density.

Furthermore, I do not know much about wine so I can not say how significant, if any, a wine with density 0.99 g/cm^3 is to a wine with density 1.

ggplot Boxplot: Volatile acid in wine separated by quality

ggplot Box Plot Analysis

Observations

- Volatile Acid: Lower quality levels of wine seem to have more volatile acid as suggested by higher medians. This makes sense since “too high of levels [of volatile acid] can lead to an unpleasant, vinegar taste” (Cortez et al. 2009).

- Quality: Higher quality wine (7-8) looks like it has a median volatile acid amount of around less than 0.4 g/dm^3. Wine with higher quality have shorter whiskers compared to wine of lower quality. In other words, it has less variance. It could suggest something about overall quality control in higher quality wine.

Again though, there’s not a lot of data points around higher wine quality so it’s hard to make such a concluding statement without more samples.

ggplot Bar Graph: Averages of different factors by quality in wine

ggplot Bar Plot Analysis

Observations

- Irrelevant Variables: It would seem that properties like wine density, pH, residual sugar, and fixed acidity do not really matter for wine quality. Regardless of the quality the average for these variables stay relatively consistent.

- Relevant Variables: Attributes of wine such as volatile acidity, free and total sulfur dioxides, sulphates, alcohol, chlorides, and citric acid seem to be more relevant variables for wine quality. In relation to wine quality some wine characteristics appear to have a positive relationship, some appear to have a negative relationship, some are a bit harder to tell.

ggplot Scatterplot: Free Sulfur Dioxide vs Total Sulfur Dioxide by Quality

plotly Scatter Plot Analysis

Observations

- Relationship: It would seem so that free SO2 and total SO2 are positively related. Generally as free SO2 increases, there is more total SO2.

- Relationship (cont.): While it does appear that free SO2 and total SO2 are positively related; the vertical variance (meaning that a given point in x, there’s a wide range of what it could be on y), suggests that free SO2 and some other(s) variables predict total SO2. Cursory Google searches say that bound SO2 is the other part making up total SO2; this graph would suggest that the ratio of bound and free SO2 is not always even.

Statistical Analysis (cont) : Mean Statistics of Relevant Factors

# Linear Regression Model between Volatile Acidity & Quality
model = lm(quality ~ volatile.acidity + alcohol 
           + total.sulfur.dioxide + sulphates, data = wine_df)
# Summarize coefficients and relevant p-values
summary(model)$coefficients
##                          Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)           2.825812794 0.2006891701  14.080545 1.580503e-42
## volatile.acidity     -1.198563212 0.0966011054 -12.407345 8.213997e-34
## alcohol               0.295310476 0.0160331035  18.418797 7.735842e-69
## total.sulfur.dioxide -0.002235398 0.0005107762  -4.376472 1.284518e-05
## sulphates             0.712139597 0.1005146293   7.084935 2.080348e-12
summary(model)$adj.r.squared
## [1] 0.3421357

Multiple Linear Regression Model Interpretation

Interpretation:

- p-val for slope(s): The p-val for the slope of volatile acidity, alcohol, total sulfur dioxide, and sulphates all have p-vals that would reject any reasonable value of alpha. This suggests that the relationship between all these variables and wine quality is statistically significant.

- r^2 value: The R^2 value of 0.342 means 34.2% of the variation in wine quality is accounted for by these wine attributes.

- Key Insights & Conclusions: Important aspects in wine quality include volatile acidity, alcohol content, total sulfur dioxide and sulphates. There might be more, but a more complex multi-regression model would need to take place in order to know.

Limitations

  • One issue noticed during analysis is that there is not actually any samples with qualities of 1,2, 9 or 10. Because of this we might not see the full picture of trends associated with properties of wine and how it affects quality.

  • Furthermore, it would seem that most of the samples is clustered in the 5-6 quality level with little samples in 4 and 7 and basically next to no samples in quality levels 3 to 8. Thus it might be harder to generalize what variables truly affect wine qualities to larger data sets.

  • Without too much further investigation into the dataset, quality can be somewhat nebulous to describe the goodness of wine. There are some objective factors of good wine, yes, but taste probably is very subjective from person to person.