Understanding Wine Quality through the Lens of Data Analysis Using R
by George Liu
Introduction
We have always relied on wine experts who use their esoteric jargons to rate wine qualities for us. But what exactly is wine quality based on? What are the criteria? In this project, we look at the Wine Quality dataset and use data analysis methods with R to explore the relationship between wine quality and various attributes such as acidity, sugar and alcohol.
We start with summary statistics of the dataset and making some exploratory analysis plots.
Univariate Plots Section
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000

The above is some summary stats and the distribution plot for wine quality. There are only six levels of quality and they follow a nearly normal distribution. The following is the summary stats the quality variable (when treated as numeric):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000

The above plots show that while both fixed and volatile acidity exhibiting somewhat normal distribution, citric acid is more uniform with a peak at the lower end. The following is some summary stats for these above variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
The previous is the histogram and summary stats of residual.sugar. As it shows, the distribution is unimodal, nearly normal and right skewed. It seems that there are outliers in the higher end, i.e. high residual sugar levels. This may potentially be the wines that have higher quality or otherwise

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The above plot shows alcohol distribution and the summary stats. Althogh it is not strictly unimodal, it dose exhibit some strend as the alcohol level goes up - the count decreases.

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
The above plots and summary stats correspond to the rest of the variables in the dataset, i.e. chlorides, free sulfur dioxide, total sulfur dioxide, sulphates, pH and density. Chlorides are very concentrated at lower levels and some outliers are present in the higher spectrum. This also may be the diffentiating factor between different quality levels of wine. Free and total sulfur dioxide present similiar patterns of distribution, peaking at lower levels, reducing in count at higher levels. Sulphates levels are right skewed, with some outlier at the higher end. When it comes to pH and density distribution, an increasling normal pattern is visible.
Univariate Analysis
Structure of the Dataset
The dataset contains objetive and subjective quality data for 1599 red wines. There are a total of 12 variables, of which 11 are objective quality factors obtained from quality tests such as pH test, and 1 subjective factor that contains median expert evaluation score.
After an initial assessment, it seems that the subjective test variables can be further broken down into six major categories:
- Acid: fixed acidity, volatile acidity, citric acid
- Sugar: residual suga
- Salt: chlorides
- Alcohol: alcohol
- Chemicals: sulfur dioxide(free and total), sulphates, pH
- Physical: density
This might be useful later as the variables within a group may be correlated and hence may not be included together.
Main Features of Interest
After some research about wine quality assessment, it appears to me that acid, sugar and alcohol levels are the most important features when tasting and deciding wine quality. In particular, the balance among these factors to give a harmonized overall taste seems to be the main concern. This will become more clear as we progress through this study.
Other Features
Aside from a balanced taste containing sugar, acid and alcohol, the other chemical ingredients may also be important since they all contribute to the taste of wine. However, at this point, it is not immediately clear whether pH level and density have a direct link with subjective wine quality.
New Variable
I did create a new variable “grade”. Since quality should be a categorical variable, but is given as numeric, it has to be converted. Furthermore, 6 levels may be hard to handle and understand. The quality variable is then transformed to a new variable “grade” with the following mapping:
Unusual Distributions
Citric acid is a bit unusual in that it displays an overal uniform distribution while having a huge peak at the lower level. This indicates citric acid level can be a very useful feature in following analysis.
Aside from that, for the other distributions, it seems some sorts of combination between the variables may be necessary to further explore the relationship between wine quality and different criteria - since the distributions vary from variable to variable.
Bivariate Plots Section

The above scatterplot matrix visualizes the relationship between each variable pairs in the dataset, with the correlations between the pairs marked at the intersection of the two variables plotted.
Note: the variables have been renamed to be properly displayed on the graph. The following is the mapping: “acid.f”: fixed.acidity, “acid.v”: volatile.acidity, “acid.c”: citric.acid, “sgr”: residual.sugar, “chld”: chlorides, “sd.fr”: free.sulfur.dioxide, “sd.to”: total.sulfur.dioxide, “den”: density, “ph”: pH, “sul”: sulphates, “alc”: alcohol, “qa”: quality, “gr”: grade

This is the relationship between different objective criteria and quality. The distribution varies greatly.

The above plot shows the different grades’ fixed.acidity distribution against each other. It’s weird that “OK” grades are the group with most variability across acidity levels as I would expect a distribution polygon curve order corresponding to the grade levels. An alternative way is to use boxplots and scatterplots to visualize as follows, which definitely is more intuitive.






These plots visualize the relationship between residual.sugar, chlorides, alcohol and grade. Again, alcohol levels are more uniformly distributed for good and bad grades, while ok grade peaks at lower levels of alcohol. For comparison, the corresponding box and scatter plots are below:




Similiar to previous plots, these plots show the frequency ploygon plots for the following variables across different grades: free.sulfur.dioxide, total.sulfur.dioxide, sulphates, pH. They present similiar patterns as before.

This is the frequency polygon plot for density. We see the same theme, mediocre wines centre around a certain level.

This above boxplot between fixed acidity and quality levels show some relationship is present, but not very consistent.

This graph showing correlation between quality and residual.sugar is similiar to the previous one as the relationship between residual sugar and quality is not consistent.
Bivariate Analysis
Some of the Relationships Observed in This Part of the Investigation
It seems that for almost all the test results, “ok” wines have distributions more “peaked”, whereas both “good” and “bad” wines tend to have more uniform distributions. This is quite surprising to me since I was expecting a correspondingly sequential order in distribution pattern. The data form was not changed since all operations were carried out without the need of adjusting.
We are most concerned about the factors affecting wine quality, and by looking at the relationship between quality/grade and other variables, it is clear that there is a relationship between wine quality and these variables:
- fixed.acidity (+)
- volatile.acidity (-)
- citric.acid (+)
- density (-)
- pH (-)
- sulphates (+)
- alcohol (+)
“+” indicates a positive correlation, “-” for negative relationship.
Interesting Relationships between the other Features
I selected acid, sugar and alcohol levels as the main features. However, from the above observations, it appears that sugar(residual sugar) does not have a clear relationship with quality. On the other hand, other features such as density, pH and sulphates show correlation with wine quality.
Furthermore, free sulfur dioxide and total dioxide appear to have strong correlation (0.668). This is expected as these two factors are related - free is part of total. The implication is we may need to choose only one of them for model building.
The strongest Relationship
The relationship between volatile acidity and quality is very strong (citrid acity and quality is very close if not on par). The plot clearly shows as the quality goes up, the volatile acidity level decreases significantly.
In terms of correlation, fixed acidity and pH have a correlation of -0.683, which is totally sensible as pH is a measure gauging acidity level. Again, this indicates we may only need to choose either variable for model buidling.
Multivariate Plots Section



The above plots are the scatterplots between fixed. acidity and volatile.acidity, residual.sugar and pH respectively. Volatile acidity levels seem to vary among different quality grades. Other than that, there does not seem to be any pattern.


The above plots are boxplots between quality and 2 ratios: apr (alcohol/pH/residual.sugar) and ap (alcohol/pH). The second one is an improved version by dropping the residual.sugar variable, as unlike alcohol and pH, this variable is unit based. The plots are insightful as they clearly show a difference in the apr ratio across qualities, which validates the theory of good wines are defined by their balance of different tastes.

These histograms for the apr ratios are interesting, but still not enough to explain quality difference.

This previous plot is an attempt at identifying “wine contents profiles” - it’s a scatterplot between pH and residual.sugar, coloring by alcohol and faceting by grade. There is not any recongnizable pattern.




The above are scatterplots between apr ratio (alcohol/pH/residual.sugar) and chlorides, total.sulfur.dioxide, density and sulphates. Again, these are interesting, but not informative.

This is a boxplot between grade and arvc ratio (alcohol/residual.sugar/volatile.acidity/chlorides). This is very intersting, it clearly differentiates quality groups. In particular, the “good” grade has much less outliers than “ok” grade. When chlorides is factored into the ratio, the difference among groups is much larger. Cleary a pattern is emerging - as wine quality increases, there are less and less outliers and the points cluster towards the x-axis.



The above plots tried different combinations of ratios: alcohol/residual.sugar/volatile.acidity/sulphates and alcohol/residual.sugar/volatile.acidity/chlorides/sulphates and created boxplots between them and wine grade. These relationships, as shown in the graph, are clearly not as indicative as the arvc ratio. The final plot is a scatterplot between alcohol and residual.sugar/citric.acid, faceted by quality.
Multivariate Analysis
Relationships Observed in this Part of the Investigation
With more information, it becomes clear to me that the main features of interest should be “pH + residual sugar + alcohol”“, instead of”acidity + residual sugar + alcohol“. This is because pH level measures the acidity level and is directly correlated with the level of human taste of acidity level. As these three factors represent different tastes, there is not a clear interaction relationship among them.
Interesting interactions between Features
After exploring relationships among various groups of variables, it seems that no clear correlation is available. I then did further research online to understand the factors affecting wine taste and quality. The pointer finally turns again to the balance of tastes, i.e. acidity(pH), sweetness(residual sugar), alcohol and tannin. Except for tannin, we have other factors available. Therefore, I resorted to the “apr ratio” (alcohol:ph:residual sugar). In the boxplot showing apr ratio’s relationship with quality, it’s clear that as the quality increases, apr also increases. This validates the “balance taste” theory. Furthermore, when I realized both pH and alcohol are unit-less variables, I switched to using the boxplot of alcohol/ph ratio based on quality. This plot immediately shows difference between different quality groups, indicating pH and alcohol might be the two most important factors in determining wine quality.
Final Plots and Summary
Plot One

Description One
This boxplot has different wine quality levels on its x-axis, the APR ratio (alcohol:pH:residual sugar) on its y-axis. It clearly shows the increasing trend in the apr ratio when wine quality moves from lower level to higher level. The plot validates the concept of “balancing taste” attribute of good quality wine. Here is the supporting statistics:
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5263 1.0120 1.3310 1.3520 1.5390 2.4600
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2701 1.0210 1.4050 1.3510 1.5710 2.4200
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1902 1.1490 1.3640 1.3470 1.5690 2.5520
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1838 1.2410 1.4540 1.4660 1.6860 4.8830
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3723 1.2570 1.5320 1.4850 1.7620 3.0350
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6095 1.3430 1.6930 1.6480 1.9510 2.5880
Plot Two

Description Two
Similiar to plot one, this plot uses a boxplot to show the different attributes across different quality levels. However, a minor improvement is made by changing the y-axis to the AP ratio (alcohol:pH). Albeit small, the change leads to a much more clear comparison amongst all the levels. This change is driven by the idea that both alcohol and pH levels are unit-less while being the main quality determinants. The following is the supporting summary statistics:
## quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.658 2.834 2.939 2.926 2.988 3.161
## --------------------------------------------------------
## quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.663 2.853 2.994 3.037 3.206 3.659
## --------------------------------------------------------
## quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.586 2.844 2.969 2.999 3.098 5.000
## --------------------------------------------------------
## quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.547 2.978 3.170 3.207 3.396 4.394
## --------------------------------------------------------
## quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.811 3.257 3.494 3.487 3.735 4.086
## --------------------------------------------------------
## quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.096 3.494 3.692 3.700 3.899 4.161
Plot Three

Description Three
Using a faceted scatterplot, this graph is aimed at identifying the unique “content profile” of quality wines by combining the three most important factors (sugar, acidity, alcohol). By using citric acid instead of pH, this plot ensures unit consistency. The graph shows that as wine quality improves, RC ratio decreases and alcohol level increases.
Reflection
I started by looking at the data and trying to find patterns. By examing different variables and their relationship using plots, I was able to have a clear understanding of factors affecting wine qualities. This signifies the importance of Exploratory Data Analysis(EDA) in data science. Therefore, I would treat this as one of my successes.
On the other hand, although initial online research pointed me to the direction of “finding balanced taste”, I overlooked the unit of variables. Afterall, when doing calculations for variables with different units, the meaningfulness is questionable. Therefore, better understanding of the data, particularly, the variables - what they represent, what are the units, how are they generated, what is the relationship with other variables - these are all questions that are worthwhile of being asked and can greatly speed up the feature selection and analysis process. Plus, I have always been trying to create a consistent “thoughts flow”, when at times, it proved somewhat difficult. An example is when I was deep analyzing the relationship between an objective criterion and quality, I found it hard to link back to the main theme so that the flow is consistent. Thus, in terms of struggles, building a consistent overall work plan and link it back to the big picture is one.
The next step in the project should involve actual model building using general linear model or other machine learning algorithms. By doing this, we can then conclude which features are useful, and compare with our conclusion in this report to understand the effectiveness of the feature selection method using EDA approach.