Cortez, P., Cerdeira, A.L., Almeida, F., Matos, T., & Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decis. Support Syst., 47, 547-553. DOI:10.1016/j.dss.2009.05.016. The dataset is analytical and sensory data collected for Vinho Verde testing, a wine originating from Minho, Portugal. Data was collected for 1,599 red wine-variant samples of vinho verde, collected between May-2004 and February-2007.

This project analyzes sulphate concentration variability in vinho verde wine dataset using classicial and hierarchial statistical modeling. Grouped summary statistics, Levene’s Test and ANOVA were used to evaluate whether sulphate means and variances differ across sensory quality groups. Comparatively, Mixed Effects model decomposed variance into nested random effects between quality, alcohol content, or pH. The analysis compares red and white wine variants to quantify how much sulphate variation is explained by quality group structure, versus residual within-group variability. This helps for interpretation of wine chemistry and sensory classification.

Summary Statistics By Quality Grade for Both Red Wine and White Wine, respectively.

quality n mean_SO4 StdDev_SO4 median_SO4 Variance_SO4
3 10 0.57 0.12 0.54 0.01
4 53 0.60 0.24 0.56 0.06
5 681 0.62 0.17 0.58 0.03
6 638 0.68 0.16 0.64 0.03
7 199 0.74 0.14 0.74 0.02
8 18 0.77 0.12 0.74 0.01
quality n mean_SO4 StdDev_SO4 median_SO4 Variance_SO4
3 20 0.47 0.12 0.44 0.01
4 163 0.48 0.12 0.47 0.01
5 1457 0.48 0.10 0.47 0.01
6 2198 0.49 0.11 0.48 0.01
7 880 0.50 0.13 0.48 0.02
8 175 0.49 0.15 0.46 0.02
9 5 0.47 0.09 0.46 0.01

Red Wine: Histogram-density plot shows strong right-skew (mean>median) sulphates content mainly clustered around ~(0.5-0.8)g/L and not normally distributed, with a long upper tail.

White Wine: Histogram-density plot shows right-skewed (mean>median) sulphates content mainly clustered around ~(0.4-0.5)g/L and not normally distributed, with a long upper tail.

Red Wine: Box plot shows that median sulphates content increases as quality grade increases. Higher quality grade Portuguese wines tend to have slightly higher sulphates content, and there are more outliers in mid-grade qualities.

White Wine: Conversely, the box plot shows that median sulphates content appears broadly similar across wine quality grades, with the median sulfate content remaining relatively stable between ~0.4-0.5 g/L. Strong presence of outlier bands outside of box plots for quality grades 5-7, suggesting many wines have unusually high sulphate values relative to their quality group. This suggests sulphates alone are not a strong discriminator of white wine quality.

A strong curvature away from the Q-Q line reaffirms non-normality observed in both histogram-density plots, and therefore normality assumption is violated for both Red Wine and White Wine.

Levene’s test shows a high p-value, and this means fail to reject equal variances hypothesis (null hypothesis) and therefore variance is similar across groups. Both Red Wine and White Wine pass Levene’s test. ##Levene’s Test for Homogeneity of Variance for Red Wine and White Wine (Respectively).

## ## Levene Test for Homogeneity of Variance - Red Wine
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    5  0.2301 0.9495
##       1593
## ## Levene Test for Homogeneity of Variance - White Wine
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    6  16.265 < 2.2e-16 ***
##       4891                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Red Wine: ANOVA output from regression shows both quality grades and pH grouping are statistically significant markers for modeling sulphates content, due to low p-values well below critical value a=0.05. Alcohol grouping however is borderline and weak, therefore not a significant marker for modeling sulphates content.

White Wine: ANOVA output from regression shows that quality grades, alcohol grouping, and pH grouping are all statistically significant markers for modeling sulphates content.

## ## Analysis of Variance (ANOVA) - Red Wine
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        1   2.90  2.9018 111.509  < 2e-16 ***
## alcohol_bin    2   0.15  0.0738   2.836   0.0589 .  
## pH_bin         2   1.41  0.7050  27.092 2.69e-12 ***
## Residuals   1593  41.46  0.0260                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## ## Analysis of Variance (ANOVA) - White Wine
##               Df Sum Sq Mean Sq F value   Pr(>F)    
## quality        1   0.18  0.1814  14.305 0.000157 ***
## alcohol_bin    2   0.23  0.1151   9.076 0.000116 ***
## pH_bin         2   1.36  0.6796  53.600  < 2e-16 ***
## Residuals   4890  62.00  0.0127                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness

Red Wine: Quality-pH grouping as a random effect has the largest random variance (~0.0081), and therefore biggest grouping effect. Quality-alcohol grouping has a small variance (~0.00013) and therefore weak grouping effect. Qualtiy alone has very little random variance, almost negligible. The residual has the largest unexplained observation-level variance, according to Random Mixed Effects Model.

White Wine: Similar to Red Wine, quality-pH grouping as a random effect has the largest random variance (~0.0013), and therefore biggest grouping effect. Quality-alcohol grouping has a very small variance (~0.000093) and therefore weak grouping effect. Qualtiy alone has no detected variance. Similar to Red Wine, the residual has the largest unexplained observation-level variance, though it is about half the amount of variance observed.

## ## Random Mixed Effects Model - Red Wine
## Linear mixed model fit by REML ['lmerMod']
## Formula: sulphates ~ 1 + (1 | quality) + (1 | quality:alcohol_bin) + (1 |  
##     quality:pH_bin)
##    Data: redwine
## 
## REML criterion at convergence: -1271.8
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.1441 -0.5900 -0.1701  0.3572  8.1699 
## 
## Random effects:
##  Groups              Name        Variance  Std.Dev. 
##  quality:pH_bin      (Intercept) 8.089e-03 0.0899376
##  quality:alcohol_bin (Intercept) 1.254e-04 0.0111997
##  quality             (Intercept) 2.161e-10 0.0000147
##  Residual                        2.562e-02 0.1600555
## Number of obs: 1599, groups:  
## quality:pH_bin, 18; quality:alcohol_bin, 17; quality, 6
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  0.68318    0.02415   28.29
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
## ## Random Mixed Effects Model - White Wine
## Linear mixed model fit by REML ['lmerMod']
## Formula: sulphates ~ 1 + (1 | quality) + (1 | quality:alcohol_bin) + (1 |  
##     quality:pH_bin)
##    Data: whitewine
## 
## REML criterion at convergence: -7477.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3846 -0.7330 -0.1231  0.5014  5.1785 
## 
## Random effects:
##  Groups              Name        Variance  Std.Dev.
##  quality:pH_bin      (Intercept) 1.278e-03 0.035750
##  quality:alcohol_bin (Intercept) 9.258e-05 0.009622
##  quality             (Intercept) 0.000e+00 0.000000
##  Residual                        1.256e-02 0.112088
## Number of obs: 4896, groups:  
## quality:pH_bin, 20; quality:alcohol_bin, 20; quality, 7
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 0.497625   0.009825   50.65
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')