1. Abstract

The purpose of this project is to determine what physicochemical properties affect white wine quality through exploratory data analysis of a data set containing attributes for approximately 5,000 white variants of the Portuguese “Vinho Verde” wine.

For more details about the wine, consult: Cortez et al., 2009. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

2. Dataset

This data set consists of 12 variables, with almost 5,000 observations.

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3) 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

3. Exploratory Analysis

3.1 Dataset Preparation and Transformation

## [1] 4898   13
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000
## 
##    bad normal   good 
##   1640   2198   1060

Create a new variable quality.f2 which has fewer quality levels and see if it can provide new insights.

3.2 Univariate Plots Section

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Density of Independent Variables

Most wines are grade 5, 6, 7.

Alcohol level skews to the lower end of the distribution range.

Density of water depends on the percent alcohol and sugar content.

There is a large count of wine variants contain less than 2 g per dm^3. I wonder what kind of quality of those wine.

chlorides is the amount of salt in the wine. Median amount is 0.043 g per dm^3.

Total Sulfur Dioxide (S02) in low concentrations is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Most wines contain less than 260 mg per dm^3.

Free Sulfur Dioxide prevents microbial growth and the oxidation of wine. Its amount ranges from 2 to 289 but most fall under 62 mg per dm^3.

Fixed Acidity refers to most acids involved with wine or fixed or nonvolatile (do not evaporate readily). Most fall between 6 and 7.3 g per dm^3.

Volatile Acidity amount skew to the lower end because too high of its levels can lead to an unpleasant, vinegar taste. Median level is 0.26 g per dm^3. I wonder if volatile acidity amount is related to wine quality.

Citric Acid found in small quantities, can add ‘freshness’ and flavor to wines. Most white wine contain 0.2 to 0.4 per dm^3.

pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-3.4 on the pH scale

Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. Sulphates additive skew to the right lower end of the distribution range.

3.3 Univariate Analysis

Structure of data set:

There are 4898 white wine observations in the data set with 12 features. The variables quality is ordered factor variables with the following levels.

(worst) 1 —————-> (best) 12

Other Observations:

The main features in the data set are density, alcohol and quality. I’d like to determine which features are best for predicting the quality of wine. I suspect acidity and some combination of the other variables can be used to build a predictive model to determine quality of wine.

Fixed acidity, volatile acidity, residual sugar, chlorides, total sulfur.dioxide could also contribute to the quality of wine.

There is high count of certain low level residual sugar of around 2.

3.4 Bivariate Plots Section

##                      fxd.c vltl. rsdl. chlrd ttl.. dnsty pH    alchl qulty
## fixed.acidity         1.00                                                
## volatile.acidity     -0.02  1.00                                          
## residual.sugar        0.09  0.06  1.00                                    
## chlorides             0.02  0.07  0.09  1.00                              
## total.sulfur.dioxide  0.09  0.09  0.40  0.20  1.00                        
## density               0.27  0.03  0.84  0.26  0.53  1.00                  
## pH                   -0.43 -0.03 -0.19 -0.09  0.00 -0.09  1.00            
## alcohol              -0.12  0.07 -0.45 -0.36 -0.45 -0.78  0.12  1.00      
## quality              -0.11 -0.19 -0.10 -0.21 -0.17 -0.31  0.10  0.44  1.00

Based on correlation study, citric acid, free SCO2, sulphate are not correlated to quality, fixed acidity, residual sugar, pH are weakly correlated to quality. My study will concentrate on how volatile acidity, chloride, total SCO2, alcohol, density would affect wine quality.

Most best quality wine tends to have higher level of alcohol range from 10.5 to 13.5 % by volume, while most worst quality wine alcohol level range from 8 to 12 % by volume with bi-modal distribution.

Density distribution for most best quality wines has a narrow range.

There is no significant distribution difference for residual sugar amount between best quality wines or worse ones, although more variants of worst quality wines tend to have low level residual sugar.

Total sulfur dioxide amount in most best quality wine falls in a narrow range of 70 to 190, while that in most worst quality wine falls in the range of 5 to 250.

The amount of free sulfur dioxide in Worst quality wine skews to lower level. After log transformation, its distribution for worst quality wines shows bi-modal while that for best quality wine stays in a higher range.

There is no significant difference of Chlorides distribution between Best Quality Wine and Worst Quality Wine. The same is true for pH.

Quality

Best quality wines have highest median alcohol level and wider range.

Best quality wines have lowest median density and smallest range.

Median Chlorides amount for best quality wines is slightly lower than other grade of wines.

Alcohol

Alcohol is strongly correlated to density, and weakly correlated to residual sugar, total sulfur dioxide, and chlorides.

Density

Density is strongly correlated to residual sugar, and weakly correlated to chlorides, total sulfur dioxide, free sulfur dioxide.

Residual Sugar

Residual sugar level is strongly positively correlated to density, weakly negatively correlated to alcohol and positively correlated to total sulfur dioxide. Residual sugar is not correlated to pH.

Total Sulfur Dioxide

Total sulfur dioxide is strongly correlated to alcohol.

Chlorides

Chlorides is positively correlated to density although not a strong relationship. Chlorides is slightly correlated to total.sulfur.dioxide.

Fixed acidity

Fixed acidity is negatively correlated to pH and positively correlated to density weakly.

3.5 Bivariate Analysis

Two more significant independent variables correlating to quality are alcohol and density.

In terms of relationships between independent variables, some strong correlations are observed.

0.84 residual.sugar - density 0.78 alcohol - density 0.62 free.sulfur.dioxide - total.sulfur.dioxide 0.53 total.sulfur.dioxide - density 0.45 residual sugar - alcohol 0.36 chloride - alcohol

3.6 Multivariate Plots and Analysis

Within the same range of density, best quality wines have highest level of alcohol.

The above plot residual sugar, alcohol, and total sulfur dioxide against density separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding density (mostly from the lower end of density), higher residual.sugar or alcohol, or total sulfur dioxide seem to have better quality respectively.

The above plot residual sugar, chlorides against alcohol separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding residual.sugar or chlorides respectively, higher alcohol level seem to have better quality.

Looking at the above plots, it seems that there are more better quality wine under 150 g dm^3 total sulfur dioxide.

4. Linear Regression

Based on the exploratory analysis in the previous section, there does not seem to be any simple linear relationship between quality and physicochemical properties. If this observation is correct, linear regression model would not perform so well in terms of quality prediction by physicochemical properties.

## 
## Calls:
## m1: lm(formula = I(alcohol) ~ I(quality), data = white)
## m2: lm(formula = I(alcohol) ~ I(quality) + density, data = white)
## 
## ==========================================
##                      m1          m2       
## ------------------------------------------
##   (Intercept)      6.957***   300.640***  
##                   (0.106)      (3.652)    
##   I(quality)       0.605***     0.301***  
##                   (0.018)      (0.012)    
##   density                    -293.647***  
##                                (3.651)    
## ------------------------------------------
##   R-squared            0.2         0.7    
##   adj. R-squared       0.2         0.7    
##   sigma                1.1         0.7    
##   F                 1146.4      4565.8    
##   p                    0.0         0.0    
##   Log-likelihood   -7450.7     -5387.7    
##   Deviance          6009.1      2588.1    
##   AIC              14907.3     10783.4    
##   BIC              14926.8     10809.4    
##   N                 4898        4898      
## ==========================================
## 
## Calls:
## m1: lm(formula = I(density) ~ I(quality), data = white)
## m2: lm(formula = I(density) ~ I(quality) + alcohol, data = white)
## 
## ========================================
##                      m1         m2      
## ----------------------------------------
##   (Intercept)      1.000***   1.014***  
##                   (0.000)    (0.000)    
##   I(quality)      -0.001***   0.000***  
##                   (0.000)    (0.000)    
##   alcohol                    -0.002***  
##                              (0.000)    
## ----------------------------------------
##   R-squared            0.1        0.6   
##   adj. R-squared       0.1        0.6   
##   sigma                0.0        0.0   
##   F                  509.9     3827.1   
##   p                    0.0        0.0   
##   Log-likelihood   21761.2    23824.2   
##   Deviance             0.0        0.0   
##   AIC             -43516.4   -47640.3   
##   BIC             -43497.0   -47614.4   
##   N                 4898       4898     
## ========================================

Looking at the statistics summary of two linear model. It seems that only 20% of variance of quality is explained by alcohol, 10 % by density. The rest of the result do not make sense. Linear model is not a suitable approach to predict quality.

5.Final Plots and Summary

Best quality wines have highest median alcohol level.

This is a density plot of density and alcohol by quality level. As quality goes up, the center of distribution of density gets smaller, that of alcohol gets higher.

The above plot residual sugar, alcohol, and total sulfur dioxide against density separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding density (mostly from the lower end of density), higher residual.sugar or alcohol, or total sulfur dioxide seem to have better quality respectively.

6. Reflection

The white wine data set contains information on physicochemical properties affect white wine quality for approximately 5,000 white variants of the Portuguese “Vinho Verde” wine from 2009 source. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality across many variables and created a linear model to predict quality of wine.

Linear model didn’t turn out right. Wine is nothing like vinegar or soy sauce. It is full of delicateness and subtleness. Linear regression is not suitable to predict wine quality as alcohol accounts for about 20% variance of quality while density for 10% variance of quality. Other attributes do not correlate to quality significantly.

7. Reference

[Creating Effective Plots] (https://docs.google.com/document/d/1-f3wM3mJSkoWxDmPjsyRnWvNgM57YUPloucOIl07l4c/pub)

[Colors (ggplot2)] (http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer)

[My Commonly Done ggplot2 graphs: Part 2] (https://www.r-bloggers.com/my-commonly-done-ggplot2-graphs-part-2/)

[Teru Watanabe] (https://rpubs.com/watanabe8760/white-wine)