The purpose of this project is to determine what physicochemical properties affect white wine quality through exploratory data analysis of a data set containing attributes for approximately 5,000 white variants of the Portuguese “Vinho Verde” wine.
For more details about the wine, consult: Cortez et al., 2009. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).
This data set consists of 12 variables, with almost 5,000 observations.
Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3) 2 - volatile acidity (acetic acid - g / dm^3) 3 - citric acid (g / dm^3) 4 - residual sugar (g / dm^3) 5 - chlorides (sodium chloride - g / dm^3) 6 - free sulfur dioxide (mg / dm^3) 7 - total sulfur dioxide (mg / dm^3) 8 - density (g / cm^3) 9 - pH 10 - sulphates (potassium sulphate - g / dm3) 11 - alcohol (% by volume)
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
## [1] 4898 13
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
##
## bad normal good
## 1640 2198 1060
Create a new variable quality.f2 which has fewer quality levels and see if it can provide new insights.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Density of Independent Variables
Most wines are grade 5, 6, 7.
Alcohol level skews to the lower end of the distribution range.
Density of water depends on the percent alcohol and sugar content.
There is a large count of wine variants contain less than 2 g per dm^3. I wonder what kind of quality of those wine.
chlorides is the amount of salt in the wine. Median amount is 0.043 g per dm^3.
Total Sulfur Dioxide (S02) in low concentrations is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. Most wines contain less than 260 mg per dm^3.
Free Sulfur Dioxide prevents microbial growth and the oxidation of wine. Its amount ranges from 2 to 289 but most fall under 62 mg per dm^3.
Fixed Acidity refers to most acids involved with wine or fixed or nonvolatile (do not evaporate readily). Most fall between 6 and 7.3 g per dm^3.
Volatile Acidity amount skew to the lower end because too high of its levels can lead to an unpleasant, vinegar taste. Median level is 0.26 g per dm^3. I wonder if volatile acidity amount is related to wine quality.
Citric Acid found in small quantities, can add ‘freshness’ and flavor to wines. Most white wine contain 0.2 to 0.4 per dm^3.
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-3.4 on the pH scale
Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant. Sulphates additive skew to the right lower end of the distribution range.
Structure of data set:
There are 4898 white wine observations in the data set with 12 features. The variables quality is ordered factor variables with the following levels.
(worst) 1 —————-> (best) 12
Other Observations:
The main features in the data set are density, alcohol and quality. I’d like to determine which features are best for predicting the quality of wine. I suspect acidity and some combination of the other variables can be used to build a predictive model to determine quality of wine.
Fixed acidity, volatile acidity, residual sugar, chlorides, total sulfur.dioxide could also contribute to the quality of wine.
There is high count of certain low level residual sugar of around 2.
## fxd.c vltl. rsdl. chlrd ttl.. dnsty pH alchl qulty
## fixed.acidity 1.00
## volatile.acidity -0.02 1.00
## residual.sugar 0.09 0.06 1.00
## chlorides 0.02 0.07 0.09 1.00
## total.sulfur.dioxide 0.09 0.09 0.40 0.20 1.00
## density 0.27 0.03 0.84 0.26 0.53 1.00
## pH -0.43 -0.03 -0.19 -0.09 0.00 -0.09 1.00
## alcohol -0.12 0.07 -0.45 -0.36 -0.45 -0.78 0.12 1.00
## quality -0.11 -0.19 -0.10 -0.21 -0.17 -0.31 0.10 0.44 1.00
Based on correlation study, citric acid, free SCO2, sulphate are not correlated to quality, fixed acidity, residual sugar, pH are weakly correlated to quality. My study will concentrate on how volatile acidity, chloride, total SCO2, alcohol, density would affect wine quality.
Most best quality wine tends to have higher level of alcohol range from 10.5 to 13.5 % by volume, while most worst quality wine alcohol level range from 8 to 12 % by volume with bi-modal distribution.
Density distribution for most best quality wines has a narrow range.
There is no significant distribution difference for residual sugar amount between best quality wines or worse ones, although more variants of worst quality wines tend to have low level residual sugar.
Total sulfur dioxide amount in most best quality wine falls in a narrow range of 70 to 190, while that in most worst quality wine falls in the range of 5 to 250.
The amount of free sulfur dioxide in Worst quality wine skews to lower level. After log transformation, its distribution for worst quality wines shows bi-modal while that for best quality wine stays in a higher range.
There is no significant difference of Chlorides distribution between Best Quality Wine and Worst Quality Wine. The same is true for pH.
Best quality wines have highest median alcohol level and wider range.
Best quality wines have lowest median density and smallest range.
Median Chlorides amount for best quality wines is slightly lower than other grade of wines.
Alcohol is strongly correlated to density, and weakly correlated to residual sugar, total sulfur dioxide, and chlorides.
Density is strongly correlated to residual sugar, and weakly correlated to chlorides, total sulfur dioxide, free sulfur dioxide.
Residual sugar level is strongly positively correlated to density, weakly negatively correlated to alcohol and positively correlated to total sulfur dioxide. Residual sugar is not correlated to pH.
Total sulfur dioxide is strongly correlated to alcohol.
Chlorides is positively correlated to density although not a strong relationship. Chlorides is slightly correlated to total.sulfur.dioxide.
Fixed acidity is negatively correlated to pH and positively correlated to density weakly.
Two more significant independent variables correlating to quality are alcohol and density.
In terms of relationships between independent variables, some strong correlations are observed.
0.84 residual.sugar - density 0.78 alcohol - density 0.62 free.sulfur.dioxide - total.sulfur.dioxide 0.53 total.sulfur.dioxide - density 0.45 residual sugar - alcohol 0.36 chloride - alcohol
Within the same range of density, best quality wines have highest level of alcohol.
The above plot residual sugar, alcohol, and total sulfur dioxide against density separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding density (mostly from the lower end of density), higher residual.sugar or alcohol, or total sulfur dioxide seem to have better quality respectively.
The above plot residual sugar, chlorides against alcohol separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding residual.sugar or chlorides respectively, higher alcohol level seem to have better quality.
Looking at the above plots, it seems that there are more better quality wine under 150 g dm^3 total sulfur dioxide.
Based on the exploratory analysis in the previous section, there does not seem to be any simple linear relationship between quality and physicochemical properties. If this observation is correct, linear regression model would not perform so well in terms of quality prediction by physicochemical properties.
##
## Calls:
## m1: lm(formula = I(alcohol) ~ I(quality), data = white)
## m2: lm(formula = I(alcohol) ~ I(quality) + density, data = white)
##
## ==========================================
## m1 m2
## ------------------------------------------
## (Intercept) 6.957*** 300.640***
## (0.106) (3.652)
## I(quality) 0.605*** 0.301***
## (0.018) (0.012)
## density -293.647***
## (3.651)
## ------------------------------------------
## R-squared 0.2 0.7
## adj. R-squared 0.2 0.7
## sigma 1.1 0.7
## F 1146.4 4565.8
## p 0.0 0.0
## Log-likelihood -7450.7 -5387.7
## Deviance 6009.1 2588.1
## AIC 14907.3 10783.4
## BIC 14926.8 10809.4
## N 4898 4898
## ==========================================
##
## Calls:
## m1: lm(formula = I(density) ~ I(quality), data = white)
## m2: lm(formula = I(density) ~ I(quality) + alcohol, data = white)
##
## ========================================
## m1 m2
## ----------------------------------------
## (Intercept) 1.000*** 1.014***
## (0.000) (0.000)
## I(quality) -0.001*** 0.000***
## (0.000) (0.000)
## alcohol -0.002***
## (0.000)
## ----------------------------------------
## R-squared 0.1 0.6
## adj. R-squared 0.1 0.6
## sigma 0.0 0.0
## F 509.9 3827.1
## p 0.0 0.0
## Log-likelihood 21761.2 23824.2
## Deviance 0.0 0.0
## AIC -43516.4 -47640.3
## BIC -43497.0 -47614.4
## N 4898 4898
## ========================================
Looking at the statistics summary of two linear model. It seems that only 20% of variance of quality is explained by alcohol, 10 % by density. The rest of the result do not make sense. Linear model is not a suitable approach to predict quality.
Best quality wines have highest median alcohol level.
This is a density plot of density and alcohol by quality level. As quality goes up, the center of distribution of density gets smaller, that of alcohol gets higher.
The above plot residual sugar, alcohol, and total sulfur dioxide against density separately. The left column uses the simplified quality level and the right one uses the original quality level. In both kinds of plots, holding density (mostly from the lower end of density), higher residual.sugar or alcohol, or total sulfur dioxide seem to have better quality respectively.
The white wine data set contains information on physicochemical properties affect white wine quality for approximately 5,000 white variants of the Portuguese “Vinho Verde” wine from 2009 source. I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the quality across many variables and created a linear model to predict quality of wine.
Linear model didn’t turn out right. Wine is nothing like vinegar or soy sauce. It is full of delicateness and subtleness. Linear regression is not suitable to predict wine quality as alcohol accounts for about 20% variance of quality while density for 10% variance of quality. Other attributes do not correlate to quality significantly.
[Creating Effective Plots] (https://docs.google.com/document/d/1-f3wM3mJSkoWxDmPjsyRnWvNgM57YUPloucOIl07l4c/pub)
[Colors (ggplot2)] (http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/#palettes-color-brewer)
[My Commonly Done ggplot2 graphs: Part 2] (https://www.r-bloggers.com/my-commonly-done-ggplot2-graphs-part-2/)
[Teru Watanabe] (https://rpubs.com/watanabe8760/white-wine)