White Wine Quality Modelling
Group L01G08 – Data Science Project
Diana, Gary Liu, Jason, Esra, Henry Dai
1. Predicting white wine quality using chemical properties
Our goal is to model white wine quality using chemical properties. We aim to find which factors most influence expert ratings and ensure the model is both statistically valid and practically meaningful for winemakers.
2. Data overview
'data.frame': 4898 obs. of 12 variables:
$ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
$ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
$ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
$ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
$ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
$ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
$ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
$ density : num 1.001 0.994 0.995 0.996 0.996 ...
$ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
$ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
$ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
$ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Data set: White Wine Quality Data Set (UCI ML Repository); Observations: 4 898 samples; Variables:12 (11 predictors + 1 response variable).
3. Variable Structure
| Acidity |
fixed.acidity, volatile.acidity, citric.acid |
sourness |
| Sweetness |
residual.sugar |
g/L |
| Sulphur |
free.sulfur.dioxide, total.sulfur.dioxide, sulphates |
preservatives |
| Physical |
density, pH |
structure |
| Others |
chlorides, alcohol |
saltiness, ethanol % |
4. Data Cleaning
[1] "fixed_acidity" "volatile_acidity" "citric_acid"
[4] "residual_sugar" "chlorides" "free_sulfur_dioxide"
[7] "total_sulfur_dioxide" "density" "pH"
[10] "sulphates" "alcohol" "quality"
No missing values found. Duplicates (~937) were removed, leaving around 3960 unique wine samples. Variable names have been simplified for easier use in later modelling.
5. Summary Statistics
fixed_acidity volatile_acidity citric_acid residual_sugar
Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.600
Median : 6.800 Median :0.2600 Median :0.3200 Median : 4.700
Mean : 6.839 Mean :0.2805 Mean :0.3343 Mean : 5.915
3rd Qu.: 7.300 3rd Qu.:0.3300 3rd Qu.:0.3900 3rd Qu.: 8.900
Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
chlorides free_sulfur_dioxide total_sulfur_dioxide density
Min. :0.00900 Min. : 2.00 Min. : 9.0 Min. :0.9871
1st Qu.:0.03500 1st Qu.: 23.00 1st Qu.:106.0 1st Qu.:0.9916
Median :0.04200 Median : 33.00 Median :133.0 Median :0.9935
Mean :0.04591 Mean : 34.89 Mean :137.2 Mean :0.9938
3rd Qu.:0.05000 3rd Qu.: 45.00 3rd Qu.:166.0 3rd Qu.:0.9957
Max. :0.34600 Max. :289.00 Max. :440.0 Max. :1.0390
pH sulphates alcohol quality
Min. :2.720 Min. :0.2200 Min. : 8.00 Min. :3.000
1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50 1st Qu.:5.000
Median :3.180 Median :0.4800 Median :10.40 Median :6.000
Mean :3.195 Mean :0.4904 Mean :10.59 Mean :5.855
3rd Qu.:3.290 3rd Qu.:0.5500 3rd Qu.:11.40 3rd Qu.:6.000
Max. :3.820 Max. :1.0800 Max. :14.20 Max. :9.000
Alcohol: 8–14% Quality: mostly 5–7 Density ≈ 0.99
7. Quality Distribution
![]()
Distribution is narrow — most wines are average quality.
8. Pairwise Relationships
![]()
Higher alcohol content tends to increase wine quality, while higher density and volatile acidity reduce it.
9. Correlation Heatmap
![]()
Alcohol (+0.44) Density (−0.31) Volatile acidity (−0.19)
10. Relationship Between Key Predictors and Quality
![]()
Higher alcohol improves quality, while higher acidity and density reduce it; moderate sulphates slightly enhance freshness and ratings.
11. Model Summary — Backward Selection
Call:
lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar +
free_sulfur_dioxide + density + pH + sulphates + alcohol,
data = train_data)
Residuals:
Min 1Q Median 3Q Max
-3.8957 -0.4693 -0.0312 0.4780 2.8512
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.149e+02 2.622e+01 8.197 3.54e-16 ***
fixed_acidity 1.114e-01 2.696e-02 4.131 3.71e-05 ***
volatile_acidity -1.677e+00 1.342e-01 -12.496 < 2e-16 ***
residual_sugar 9.917e-02 1.004e-02 9.879 < 2e-16 ***
free_sulfur_dioxide 3.578e-03 8.228e-04 4.349 1.41e-05 ***
density -2.162e+02 2.654e+01 -8.145 5.40e-16 ***
pH 1.001e+00 1.344e-01 7.449 1.21e-13 ***
sulphates 6.876e-01 1.237e-01 5.559 2.94e-08 ***
alcohol 1.187e-01 3.483e-02 3.408 0.000662 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.7484 on 3159 degrees of freedom
Multiple R-squared: 0.3, Adjusted R-squared: 0.2983
F-statistic: 169.3 on 8 and 3159 DF, p-value: < 2.2e-16
Backward stepwise kept 7 key predictors: Alcohol (+) improves quality, Density (–) and Volatile acidity (–) reduce it.
12. Coefficient Plot
![]()
Alcohol has the strongest positive impact. Density and volatile acidity show clear negative effects. The direction and size of coefficients align with physical chemistry expectations.
R-squared (Test set): 0.2779
RMSE ≈ 0.74, R² ≈ 0.29 on test data. Indicates moderate predictive power — chemical variables explain about 30% of wine quality variation. Suggests that non-chemical factors (e.g., aroma, taste) influence expert ratings.
14. Q–Q Plot (Normality Check)
![]()
Residuals closely follow the red reference line → the normality assumption is reasonably satisfied. No major outliers are present, suggesting the model fits well.
15.Residual Plot
![]()
Residuals are randomly scattered around 0 → indicates homoscedasticity (constant variance) and no obvious pattern. Model assumptions hold, and linearity is acceptable.
16. AIC and BIC Model Comparison
![]()
AIC suggests the backward model is efficient. BIC favors a slightly simpler model with ~6 predictors. Key predictors across all criteria: alcohol, volatile acidity, sulphates, and residual sugar.
17.Robustness and Coefficients
Robust coefficient tests confirm alcohol has the strongest positive effect, while volatile acidity has a significant negative impact.
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3810628 0.2224940 10.7017 < 2.2e-16 ***
alcohol 0.3848760 0.0127889 30.0945 < 2.2e-16 ***
volatile_acidity -1.8577825 0.1397446 -13.2941 < 2.2e-16 ***
residual_sugar 0.0205266 0.0036140 5.6797 1.472e-08 ***
free_sulfur_dioxide 0.0036422 0.0015832 2.3006 0.0214782 *
fixed_acidity -0.0775894 0.0166583 -4.6577 3.330e-06 ***
sulphates 0.4154264 0.1208911 3.4364 0.0005972 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
18.Limitations
Quality scores are subjective — based on human taste panels
Dataset includes only chemical properties, not sensory data. Linear model assumes additive effects between predictors. Possible non-linear relationships not captured. While the model performs well statistically, real-world wine quality also depends on taste, aroma, and balance — not just chemistry.
19.Future Improvements
Try non-linear models such as Random Forest or SVM to capture complex relationships
Include sensory and aroma data to better represent human perception of taste
Use cross-validation for stronger model validation and generalization
Compare with the red wine dataset to explore broader wine quality patterns
Interpretation:
Future studies should combine chemical and sensory data to build a model that better reflects real-world wine quality.
20. Thank You 🍷
Group L01G08
Key Conclusion:
- Alcohol improves quality.
- Volatile acidity and density reduce it.
- Balanced sulphates help preserve freshness.
Ready for Q&A