RED WINE QUALITY EDA by Hiraku Shibuya

In this paper, I’m going to analyse what factor makes red wine taste good. Actually, I don’t drink wines so much, so I don’t know which wine is good from the viewpoint of wine experts, but there might be some tendency comes from data.

Structure of the data

There are 1,599 wines in the dataset with 13 features. All the variables are defined as number, but quality is kind of categorical variable with the following levels.

quality:(worst) 1 —> 10 (best)

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Indivisual data overview

Quality

First of all, let me see how quality of wines are scattered.

Most of wines are evaluated to 5 or 6(over 50% of wines are 5 or 6). There is no wine that evaluated under 3 or over 8.

So wines evaluated in this dataset seems to be averagely good, but not too good.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Acidity

Fixed acidity has positively skewed distribution. Most frequent acidity is 7.2. Data points are between 4.6 and 15.9, but the points over 15 looks like outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Histogram of volatile acidity is positively skewed. Data points are between 0.12 and 1.58, most frequent point is 0.6. There also seems to be some outliers. I regard the points over 1.2 as outliers.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

It looks like citric acid have three peaks at 0, 0.24, 0.45. It has outlier at 1.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Sugar

Residual sugar has highly skewed graph and there are some outliers. So I thought I need to get rid of them.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

I made histogram that cut off upper 10% of data points. It seems to be normal distributed. The peak is at 2.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.100   2.171   2.400   3.500

Chlorides

Chlorides also has highly skewed graph. When it comes to lower 95% of data points, it looks like normal distributed. Data points are between 0.012 and 0.611.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07800 0.07914 0.08800 0.12600

Sulfur dioxide

There seems to be some outliers. Data points are between 1 and 72, most frequent value is 6.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Total sulfur dioxide exist between 6 and 289. It has two clear outliers at 278 and 289.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Density of Wines

The histogram of density of wines looks like normally distributed and has a peak at 0.997. Meanwhile water density is 1.000(g / cm^3), wine density in this dataset is between 0.990 to 1.004. Almost all of wines have lower density than water.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

pH of Wines

It looks like normally distributed. Range of data points is 2.74 to 4.01.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Sulphates of Wines

Sulphates dataset also has some outliers. Similarly above, I cut off upper 2% of data points. All data points in Sulphates are between 0.33 and 2.0.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6434  0.7200  1.1200

Alcohol of Wines

The histogram is positively skewed. Alcohol percentage spread between 8.4 and 14.9 and most frequent percentage is 9.50.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Consideration

The main feature of interest in this dataset is the quality of wine. In this EDA, I’d like to find out which parameter is strongly related to quality of wines. As I’m not so much familiar with components of wines, it is hard to expect what factor would affect to quality of wines. But I think the balance of acidity is the key factor to predict quality because too much acid makes taste of wine bad, so there would be some tendency between them.

As you can see in histogram of quality of wines, 1319 wines are evaluated to quality 5 or 6, on the other hand, there are only 18 wines evaluated to quality 8. What is the difference between them? I’m going to look through relationship between each parameters below.

Relationship between quality and other variables

Correlation between each variable

Graph above shows correlation between each variable. I found that volatile acidity, sulphates, alcohol, have ralatively strong correlattionship with quality. So I’m going to tackle with them first.

Also, some variables such as fixed acidity and pH, fixed acidity and density, free sulfur dioxide and total sulfur dioxide, show strong correlationship.

Volatile acidity

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

It seems like high quality wines have less volatile acidity. I googled how volatile acidity affect to wine quality, and I found that volatile acid makes wine aroma strong. So it could be say that too much aroma is not suit for high quality wine.

Let’s see how each points are scatterd, and regression line that shows trend of this relationship.

Sulphates

I conducted same analysis between quality and sulphates. There also seems to be tendency.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

Alcohol

Alcohol also seems to have strong relationship with quality. High quality wines have much more alcohol content than low one. Mean and median value of alcohol differ over 2% between quality 3 and quality 8.

## 
##  Pearson's product-moment correlation
## 
## data:  wines$quality and wines$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

## wines$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wines$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wines$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wines$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wines$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wines$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Correlation between fixed acidity vs pH, density

I also found strong correlation between fixed acidity and pH, density. Correlation coefficient of pH and fixed acidity is about -0.683, density and fixed acidity is about 0.668.

Consideration

The strongest relationship I found was between quality and alcohol. I didn’t expext alcohol to correlate with quality because I think how strong the alcohol doesn’t affect to how the wine tastes good.

Sulphates and acidity are also correlate with quality. Higher quality wines tend to appear in low volatile acidity and high sulphates zone. On the other hand, I don’t know why fixed acidity doesn’t show specific relationship with quality. Actually, correlation coefficient between fixed acidity and volatile acidity is -0.256, so there is seems to be little correlation.

I also found that pH and fixed acidity correlate with each other. The larger fixed acidity, the lesser pH. This is intuitively understandable because of the definition of pH.

Multivariate Analysis

In analysis about relationship between quality and other variables above, I could identify some interesting correlations. Let’s see if there is more interesting relationship through multivariate plots. In the multivariate plots below, I colored the ploints that indicate quality of wines so that it could be clear if there is some tendency.

multivariate plots with alcohol and some variables

As I found that alcohol strongly correlate to quality, I made two multivaliate plot with it.

There seems to be much better visualization. Upper left side of the both graph are the area of high quality wines.

multivariate plots with fixed acidity and pH, dinsity

Let’s see if there is any tendency between fixed acidity and pH, density. There was strong correlation between fixed acidity and pH, density.

Consideration

It became more clear that alcohol has strong impact to the quality of wines. I thought it is so difficult to explain how people define good wine in their own sense, but in visualizations above, I could find some factor that affect quality of wines.

Final Plots and Summary

Plot One

The histogram above describes the distribution of quality of wines. Since I don’t drink wines and there are so many observation in wine dataset, I thought distribution of wine quality would be wide spread and normally distributed. But it was surprising for me that most of wines(over 50%) are evaluated to 5 or 6, and there are no wines evaluated to 1, 2, 9 or 10. Also, there are only 18 wines out of 1599 that scored to 8. So I was curious about what parameter makes wine high-quality.

Plot Two

When I explore each variable, I found strong relationship between quality and alcohol of wines. The scatter plot above describes how quality and alcohol correlate. The reason why I chose this graph is it was not like what I expected before analysis. I thought alcohol percentage does not affect to quality of wines. But actually, there was clear tendency. It was surprising.

Plot Three

I chose the graph above for final plot. It shows quality tendency more clear. Wine seems to have higher quality when alcohol is high and volatile acidity is low. It is interesting that quality 5 wine occurs the most in volatile acidity is between 0.4 and 0.8, and alcohol is between 8 and 10. I’m not familier with what volatile acidity is, so I’m not sure what this graph says scientifically. But there seems to be obvious trend.

Reflection

This wine dataset contains 1,599 red wines with 13 features. With this dataset, I tried to find out which feature strongly related to the quality of wine.

At first, I don’t know how to start analysis because I don’t drink wines and not familier with chemical terms, so I started analysis by plotting individual variables and see the distribution of them. Then, when I explored the correlation between each variables, I found that quality and volatile acidity have relatively strong correlation. I thought this could have some relationship between quality and started to explore more deeply.

However, they were not strong enough to explain how wines are good, so I made another plot that shows relationship between alcohol and volatile acidity with quality of wines. It describes how alcohol and volatile acidity affect to quality of wines well. So it seems I achieved my goal.

Also, I’m interested in if the quality of white wine follows similar tendency or not. I’m going to analyse the other dataset at a later date.

Throughout this analysis, I encountered some technical problem with R programming language, but I think I could do my best with my skillset at the moment.