Exploratory Data Analysis on Red Wine Data Set

Introduction
Exploring Data
Univariate Section
Bivariate Section
Multivariate Section
Final PLots and Summary
Reflection

Introduction

This project aims to use exploratory data analysis (EDA) techniques to explore relationships in one variable to multiple variables and to explore selected red wine data set for visualizations, distributions, outliers, and anomalies.

The main question is “Which chemical properties influence the quality of red wines?” During my exploratory analysis, I will try to answer this question and implement EDA tehniques using R programming language.

A brief summary of the data set

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Red wine data set can be downloaded here

More information about the data set (data collection method and variable explanations) can be found here

1. Univariate Section

In this section, I will investigate attributes individually.

1.1 Univariate Plots

Wine Quality

Let’s start exploring by investigating wine quality first, which is measured with a score range between 0, 10.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

In the given data set, wine scores are in range [3,8] and most of them have a score of 5.

Alcohol rate

Next, let’s investigate the alcohol rate in each wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Median score of alcohol is 10.2%, the mean value is 10.42% and the third quartile is 11.1%. As seen on above graph, alcohol rate is left skewed, meaning most of the wines in the given data set have an alcohol rate below 11.1% and only 25% of the given wines have an alcohol rate over 11.1%

Residual Sugar

It is the amount of sugar remaining after fermentation stops. Let’s investigate residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

This plot has a very long tail on the right side. Third quartile, 2.6, showing that 75% of the wines have a residual sugar value below 2.6 g/dm3. However, the remaining 25% of the wines have a residual sugar value in range (2.6, 15.5]

This attribute describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), where 7 is neutral.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Again we see a great bell shaped plot, with a mean and and a median value nearly the same (3.31). So we expect to see a pH level of 3.31 at most wines.

A few of them are more acidic and the most acidic wine has a pH value of 2.74, which is very close to pH level of cola and lemon juice.

Citric Acid

Found in small quantities, citric acid can add ‘freshness’ and flavor to wines. Next, we will investigate this attribute.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

There are two main peaks in this plot. First one is between [0, 0.02] and the second one is in range [0.48, 0.5]. It is hard to say its distribution by looking at the plot.

Volatile Acidity

This attribute gives the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

This attribute has a mean and median values that are nearly equal (0.52 g/dm3) and it seems to be a bell shaped plot with a normal distribution. However, there is a small tail on the right side of the plot.

Fixed Acidity

This is a total of most acids involved with wine or fixed or nonvolatile. Let’s investigate this attribute now

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Again we see a bell shaped plot with attribute having range [4.6 g/dm3, 15.9 g/dm3]. Median value is 7.9 and the mean value is 8.32

Density g/cm3

Next, let’s plot density attribute

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density plot looks normally distributed, with mean eqals to 0.9967 and median equals to 0.9968.

Sulphates

This is a wine additive which can contribute to sulfur dioxide gas (S02) levels and acts as an antimicrobial and antioxidant.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The graph above, looks like a bell shaped plot with a long tail on the right. Sulphate level ranges between 0.33 g/dm3 to 2 g/dm3, with a mean value of 0.6581 and a median value of 0.6581 which are very close to each other. We can conclude, in most wines (in the given data set), sulphate amount is 0.62 g/dm3

Total Sulfur Dioxide

Represents the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 mg/L, SO2 becomes evident in the nose and taste of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

As from the above description, it is not a surprise for us to see such low sulfur dioxide levels. 75% of the wines in this dataset has a sulfur dioxide value below 62 mg/dm3

Free Sulfur Dioxide

After investigating the total sulfur dioxide levels, it will be a good practice to investigate free sulfur dioxide attribute, which prevents microbial growth and the oxidation of wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Again we see a left skewed plot, in which most of the values are below 21 mg/dm3

Chlorides

In the final plot of univariate plots, I will investigate chlorides attribute, which gives the amount of salt in the wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

This plot also looks like normally distributed however, there is a long tail in the right side, which corresponds to less then 25% of the data (since our third quartile is 0.09 g/dm3 and tail is starting from 0.15)

1.2 Univariate Analysis

What is the structure of your dataset?

This tidy data set contains 1599 red wine observations and a total of 12 attributes in the data set. 11 of the attributes are numeric physicochemical test results of wines and 1 attribute (quality) consists of sensory data ranging from 0 to 10, which is a categorical variable and is the median of at least 3 evaluations made by wine experts. There aren’t any missing values in the data set.

What is/are the main feature(s) of interest in your dataset?

As this project aims to find which chemical properties influence the quality of red wines, the main feature is quality.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Altough I’m not an expert at wines, I expect these 4 variables to affect the quality of the wine and have an insight that these variables will support my further investigation.

Alcohol: the percent alcohol content of the wine
Volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

Did you create any new variables from existing variables in the dataset?

No, I did not create any new variables.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Some plots are positively skewed and might be log-normally distributed:

Free sulfur dioxide plot,
Total sulfur dioxide plot,
Alcohol plot and
Citric acid plot, which looks very weird

Since the given data set is very clean, I have only changed some variable names for ease of use but haven’t done any additional process to tidy the data set.

2. Bivariate Section

In this section, I will create plots using 2 features.

2.1 Bivariate Plots

First, let’s investigate the correlations between variables.

ggpair

Alcohol, Volatile Acidity, Citric Acid And Sulphates vs. Quality

Alcohol, volatile acidity, citric acid and sulphates are the most correlated attributes with quality. Next, I will dig in these variables to see their relationship with quality.

After looking at ggpair plot and boxplots, we can say that quality is positively correlated with alcohol, citric acid and sulphates and negatively correlated with volatile acidity.

Let’s go one step beyond and calculate a linear model and summarize its results

## 
## Call:
## lm(formula = quality ~ alcohol + vol.acidity + citric.acid + 
##     sulphates, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.71408 -0.38590 -0.06402  0.46657  2.20393 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.64592    0.20106  13.160  < 2e-16 ***
## alcohol      0.30908    0.01581  19.553  < 2e-16 ***
## vol.acidity -1.26506    0.11266 -11.229  < 2e-16 ***
## citric.acid -0.07913    0.10381  -0.762    0.446    
## sulphates    0.69552    0.10311   6.746 2.12e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6588 on 1594 degrees of freedom
## Multiple R-squared:  0.3361, Adjusted R-squared:  0.3345 
## F-statistic: 201.8 on 4 and 1594 DF,  p-value: < 2.2e-16

These 4 attributes explains 33.45% of the variability in quality and citric acid is statistically unsignificant, in other words, there is likely to be no relationship between citric acid and quality.

Top Corralated Attributes

Also, let’s look at other variable pairs to see top correlated attributes.

Free SO2 vs. Total SO2

Free SO2 and total So2 attributes are positively correlated with each other, having value of 0.668

Fixed Acidity vs. Density & Alcohol vs. Density

Plot on the left side shows a strong and positive correlation between fixed acidity and density so the more acids involved with wine, the more dense it is.
Plot on the right side shows a strong negative correlation between alcohol and density. Since alcohol has a lower density than water, any increase in alcohol percentage decreases density.

Fixed Acidity vs. Citric Acid & Volatile Acidity vs. Citric Acid

It is an expected thing to see citrid acid and fixed acidity to have positive and strong correlation but it seems there is an interesting relationship between volatile acidity and citrid acid. These two variables have a negative correlation and we can expect an inverse proportion between these two attributes. So if acetic acid amount (volatile acidity) increases, we expect citric acid to decrease and vice versa.

Volatile Acidity vs. pH

Another interesting relationship: Volatile acidity vs. pH.

Although these 2 variables have a correlation coefficient of 0.235, it looks like they are so weakly correlated. One misleading point is the positive sign of the coefficient. So one can argue as volatile acidity increases, pH is expected to increase but how can an acid would have a positive impact on pH? After my investigations, I found out excess amount of volatile acid is removed from wine, because of its vinegar smell, using reverse osmosis or steam distillation methods, which may increase the pH level.1

2.2 Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

After investigating the ggpair plot, I found 4 attributes were related with quality:

The percent alcohol directly effects quality. By a majority of the time, more alcohol rate means better quality wines.
Volatile acidity is inversely proportional with quality. The lower the acetic acid (volatile acidity), the better the quality of wine.
Sulphates and quality has a positive and weak relationship, therefore increase in sulphates may make a wine better
Although citric acid is correlated with quality, the relation between these two is statistically not significant

After removing citric acid attribute, remaining 3 variables explained 33.46% of the variability in quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

It was very interesting to observe negative correlation between volatile acidity and citric acid.

Since volatile acidity and pH had a positive correlation, that plot was also very interesting.

What was the strongest relationship you found?

The strongest correlation was -0.683 between pH and fixed acidity

3. Multivariate Section

This section includes plots and analysis of multiple variables

3.1 Multivariate Plots

Alcohol vs. Other Variables over Quality

In the following 3 plots, darker points indicates better quality wines

According to above 3 plots, better quality wines mostly have:

High alcohol rate,
High sulphate level
Low volatile acidity

It is not possible to say anything about pH level.

In the final plot it is much clear to identify darker points.

Other Multivariate Plots

In the first plot, darker points indicates better quality wines again and High citric acid and low acetic acid (volatile acidity)seems like a good combination for a quality wine
In the second plot, there seems to be clusters in the plot. High quality wines are having low free SO2 and total SO2 values, where low amd mid quality wines are having more higher free and total SO2 values.

3.2 Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

After investigating many multivariate plots, it seems there are some combinations that makes a high quality wine:

High alcohol rate and high sulphate level,
High alcohol rate and low volatile acidity
Low free SO2 and low total SO2

Were there any interesting or surprising interactions between features?

While looking for interesting multivariate plots, I created a dozen plots. In free SO2 vs. total SO2 plot, I added quality as color and as a result, very interesting plot occured. There were clusters in the plot; high quality wines had low free and total SO2 values, where mid and low quality wines had higher free and total SO2 values. It was a big surprise for me since I didn’t expect such a plot.

Final PLots and Summary

Plot 1

Description 1

This chart shows how alcohol percent highly effects the quality level.

Next time, try buying a red wine with the highest alcohol percent. You may receive the highest quality wine and it may let you experience an unforgettable pleasent flavor.

To support our description, we can also plot a density graph

Plot 2

Description 2

Combinations with high alcohol percent and low acetic acid (volatie acidity) seems to produce better wines.

So instead of just looking for the wine with the highest alcohol percentage, we should also look for low acetic acid concentrations to increase our chance of buying a high quality red wine.

Plot 3

Description 3

I found this plot very interesting and informative because it seems like there is an imaginary straight line splitting the top 3 quality wines from the rest of the data set.

Most of the wines having a quality of 6, 7 and 8 are:

Below the 75 mg/dm3 total sulfur dioxide concentration and,
Below the 30 mg/dm3 free sulfur dioxide concentration

To prove this finding, let’s look at 90% quantiles:

## 90% of the red wines in this data set with quality scores 6, 7 & 8,

## Have a total sulfur dioxide concentration below 74  mg/dm3

## Have a free sulfur dioxide concentration below 29  mg/dm3

Reflection

Overall, it was a great experience investigating and exploring red wine data. Before this project, I knew a little about wine quality but now I learned a lot from data.

Let’s briefly sum up what we discussed.

First, I investigated each attribute individually, and tried to analyze distributions,
Secondly, I examined the correlations and general plots in ggpair plot, which helped me a lot during this project and provided guidance
Then, in bivariate section, I explored the relationships between variable combinations. At this point project became more exploratory.
In the last part, I did some multivariate analysis, which gave me more insight about the data.

In conclusion, after analyzing the data I came up with the following results:

When alcohol percentage increases, quality mostly increases.
There are some variable combinations that increases wine quality such as high percentage of alcohol and low concentration of volatile acidity.
Free sulfur dioxide and total sulfur dioxide are positively correlated and high quality wines are having low concentration of free and total sulfur dioxide.
In this data set, none of the attributes are strongly correlated with quality; the top four attributes, that have the highest correlation coefficient with quality, explains only 34% of the variability in quality, which is not enough to predict high quality wines with a good precision.
Most frequent quality levels of red wines are 5 and 6.
There are more wines with low alcohol percentage in this data set.
Wine with the highest alcohol percentage has a quality level of 7, where wine with the lowest alcohol percentage has a quality level of 5.

Where did I run into difficulties in the analysis?

My main difficulty was, I knew almost nothing about wine making and its procedures. So I started my project by reading many articles and blog posts about wine making. Learned a lot about technical terms and attributes used in this data set, however this learning process took longer than I expected

Where did I find successes?

In this project, we asked a question to dataset, “Which chemical properties influence the quality of red wines?”, and investigated it in many ways. The best part of this project and for me the main success was exploring and somehow predicting a wine quality with a few technical variables without actually tasting it. Just by exploring data, anyone can figure out basic trends and answer this question “What really effects red wine quality?”

How could the analysis be enriched in future work?

Although it looks not possible to add new features to the data, since all records are anonymous, expert reviews could be added to enrich this data set. I think it is important to get feedback from reviewers because there isn’t any explanation of how these reviewers rate a wine and also no information about evaluation criterias. Not only having scoring numbers but also some more information about the score and reviewers notes about the wine would increase the productivity of this investigation. If we had that information, some text learning algorithms could be applied on reviewer comments to gather more information about what makes a wine great or bad from reviewers eyes. Thus making sensory data more clear and understandable.