White Wine Quality Modelling

Group L01G08 – Data Science Project

Diana, Gary Liu, Jason, Esra, Henry Dai

1. Predicting white wine quality using chemical properties

Our goal is to model white wine quality using chemical properties. We aim to find which factors most influence expert ratings and ensure the model is both statistically valid and practically meaningful for winemakers.

2. Data overview

'data.frame':   4898 obs. of  12 variables:
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
[1] 1021906

Data set: White Wine Quality Data Set (UCI ML Repository); Observations: 4 898 samples; Variables:12 (11 predictors + 1 response variable).

3. Variable Structure

Category Variables Description
Acidity fixed.acidity, volatile.acidity, citric.acid sourness
Sweetness residual.sugar g/L
Sulphur free.sulfur.dioxide, total.sulfur.dioxide, sulphates preservatives
Physical density, pH structure
Others chlorides, alcohol saltiness, ethanol %

4. Data Cleaning

[1] 0
[1] 937
[1] 3961
 [1] "fixed_acidity"        "volatile_acidity"     "citric_acid"         
 [4] "residual_sugar"       "chlorides"            "free_sulfur_dioxide" 
 [7] "total_sulfur_dioxide" "density"              "pH"                  
[10] "sulphates"            "alcohol"              "quality"             

No missing values found. Duplicates (~937) were removed, leaving around 3960 unique wine samples. Variable names have been simplified for easier use in later modelling.

5. Summary Statistics

 fixed_acidity    volatile_acidity  citric_acid     residual_sugar  
 Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
 1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.600  
 Median : 6.800   Median :0.2600   Median :0.3200   Median : 4.700  
 Mean   : 6.839   Mean   :0.2805   Mean   :0.3343   Mean   : 5.915  
 3rd Qu.: 7.300   3rd Qu.:0.3300   3rd Qu.:0.3900   3rd Qu.: 8.900  
 Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
   chlorides       free_sulfur_dioxide total_sulfur_dioxide    density      
 Min.   :0.00900   Min.   :  2.00      Min.   :  9.0        Min.   :0.9871  
 1st Qu.:0.03500   1st Qu.: 23.00      1st Qu.:106.0        1st Qu.:0.9916  
 Median :0.04200   Median : 33.00      Median :133.0        Median :0.9935  
 Mean   :0.04591   Mean   : 34.89      Mean   :137.2        Mean   :0.9938  
 3rd Qu.:0.05000   3rd Qu.: 45.00      3rd Qu.:166.0        3rd Qu.:0.9957  
 Max.   :0.34600   Max.   :289.00      Max.   :440.0        Max.   :1.0390  
       pH          sulphates         alcohol         quality     
 Min.   :2.720   Min.   :0.2200   Min.   : 8.00   Min.   :3.000  
 1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50   1st Qu.:5.000  
 Median :3.180   Median :0.4800   Median :10.40   Median :6.000  
 Mean   :3.195   Mean   :0.4904   Mean   :10.59   Mean   :5.855  
 3rd Qu.:3.290   3rd Qu.:0.5500   3rd Qu.:11.40   3rd Qu.:6.000  
 Max.   :3.820   Max.   :1.0800   Max.   :14.20   Max.   :9.000  

Alcohol: 8–14% Quality: mostly 5–7 Density ≈ 0.99

7. Quality Distribution

Distribution is narrow — most wines are average quality.

8. Pairwise Relationships

Higher alcohol content tends to increase wine quality, while higher density and volatile acidity reduce it.

9. Correlation Heatmap

Alcohol (+0.44) Density (−0.31) Volatile acidity (−0.19)

10. Relationship Between Key Predictors and Quality

Higher alcohol improves quality, while higher acidity and density reduce it; moderate sulphates slightly enhance freshness and ratings.

11. Model Summary — Backward Selection


Call:
lm(formula = quality ~ fixed_acidity + volatile_acidity + residual_sugar + 
    free_sulfur_dioxide + density + pH + sulphates + alcohol, 
    data = train_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8957 -0.4693 -0.0312  0.4780  2.8512 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          2.149e+02  2.622e+01   8.197 3.54e-16 ***
fixed_acidity        1.114e-01  2.696e-02   4.131 3.71e-05 ***
volatile_acidity    -1.677e+00  1.342e-01 -12.496  < 2e-16 ***
residual_sugar       9.917e-02  1.004e-02   9.879  < 2e-16 ***
free_sulfur_dioxide  3.578e-03  8.228e-04   4.349 1.41e-05 ***
density             -2.162e+02  2.654e+01  -8.145 5.40e-16 ***
pH                   1.001e+00  1.344e-01   7.449 1.21e-13 ***
sulphates            6.876e-01  1.237e-01   5.559 2.94e-08 ***
alcohol              1.187e-01  3.483e-02   3.408 0.000662 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7484 on 3159 degrees of freedom
Multiple R-squared:    0.3, Adjusted R-squared:  0.2983 
F-statistic: 169.3 on 8 and 3159 DF,  p-value: < 2.2e-16

Backward stepwise kept 7 key predictors: Alcohol (+) improves quality, Density (–) and Volatile acidity (–) reduce it.

12. Coefficient Plot

Alcohol has the strongest positive impact. Density and volatile acidity show clear negative effects. The direction and size of coefficients align with physical chemistry expectations.

13. Model Performance (Evaluation)

RMSE (Test set): 0.7477 
R-squared (Test set): 0.2779 

RMSE ≈ 0.74, R² ≈ 0.29 on test data. Indicates moderate predictive power — chemical variables explain about 30% of wine quality variation. Suggests that non-chemical factors (e.g., aroma, taste) influence expert ratings.

14. Q–Q Plot (Normality Check)

Residuals closely follow the red reference line → the normality assumption is reasonably satisfied. No major outliers are present, suggesting the model fits well.

15.Residual Plot

Residuals are randomly scattered around 0 → indicates homoscedasticity (constant variance) and no obvious pattern. Model assumptions hold, and linearity is acceptable.

16. AIC and BIC Model Comparison

[1] 7164.747

AIC suggests the backward model is efficient. BIC favors a slightly simpler model with ~6 predictors. Key predictors across all criteria: alcohol, volatile acidity, sulphates, and residual sugar.

17.Robustness and Coefficients

Robust coefficient tests confirm alcohol has the strongest positive effect, while volatile acidity has a significant negative impact.


t test of coefficients:

                      Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)          2.3810628  0.2224940  10.7017 < 2.2e-16 ***
alcohol              0.3848760  0.0127889  30.0945 < 2.2e-16 ***
volatile_acidity    -1.8577825  0.1397446 -13.2941 < 2.2e-16 ***
residual_sugar       0.0205266  0.0036140   5.6797 1.472e-08 ***
free_sulfur_dioxide  0.0036422  0.0015832   2.3006 0.0214782 *  
fixed_acidity       -0.0775894  0.0166583  -4.6577 3.330e-06 ***
sulphates            0.4154264  0.1208911   3.4364 0.0005972 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

18.Limitations

Quality scores are subjective — based on human taste panels
Dataset includes only chemical properties, not sensory data. Linear model assumes additive effects between predictors. Possible non-linear relationships not captured. While the model performs well statistically, real-world wine quality also depends on taste, aroma, and balance — not just chemistry.

19.Future Improvements

Try non-linear models such as Random Forest or SVM to capture complex relationships
Include sensory and aroma data to better represent human perception of taste
Use cross-validation for stronger model validation and generalization
Compare with the red wine dataset to explore broader wine quality patterns
Interpretation:
Future studies should combine chemical and sensory data to build a model that better reflects real-world wine quality.

20. Thank You 🍷

Group L01G08

Key Conclusion:
- Alcohol improves quality.
- Volatile acidity and density reduce it.
- Balanced sulphates help preserve freshness.

Ready for Q&A