Red Wine Quality Exploration by Min Qu

This report explores a dataset containing quality and eleven attributes for 1599 entries of red wine.

Univariate Plots Section

## [1] 1599   13
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The dataset consists of twelve variables and 1599 observations.

The quality scores for entries in this dataset are from 3 to 8, so there is no entry has a score extreme high or low. I created a new variable to categorize quality scores into three levels: low <= 4, medium = 4-6, high >= 6; obviously, most of the entries in the dataset are in the medium level.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most entries of red wine have fixed acidity between 6g/L and 10g/L: Median 7.90g/L and mean 8.32g/L. About 15% entries have volatile.acidity below 0.4g/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Most of the entries have citric acid less than 0.50 g/L, and some citric acid occur more than others, for example, when citric acid equals to 0.00 g/L, 0.24 g/L, or 0.49 g/L. I wonder if there are specific reasons exist, or they just occur randomly.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Most of the entries have residual sugar below 4 g/L, however, very few entries can even reach more than 12 g/L. I wonder whether there is a negative or positive impact when residual sugar reach a very high level and how this variable relate to other variables.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Most of the entries have chlrides between 0.05 g/L and 0.12 g/L: median 0.079 g/L and mean 0.08747 g/L.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

The histograms of free sulfur dioxide and total sulfur dioxide are both right skewed. Therefore I transformed the data using a log scale.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

There is no surprise that most entries have a density slightly below 1 g/ml, however, the max can reach more than 1 g/ml, probably affecteded by residual sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH of the entries are normaly distributed: median 3.310 and mean 3.311.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5400  0.6000  0.6448  0.7000  2.0000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7435  0.8200  1.3600

Above, we subset red wine entries with ‘high’ in quality level and compare sulphate to the entries with low or medium quality classes. In general, high quality wine has a higher density of sulphates.

Univariate Analysis

What is the structure of your dataset?

There are 1599 entries of red wine in this dataset with 12 attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). All the variables are quantitative variables.

Other observations:

  • Most entries have quality between 5 and 7.
  • The median of fixed acidity is 7.90 g/L and the max is 15.90 g/L
  • About 25% entries have citric acid less than 0.090 g/L
  • About 75% entries have a pH less than 3.400
  • Most entries have sulphates between 0.4 g/L and 1.0 g/L

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are sulphates and quality. I’d like to explore if there is a relationship between sulphates and quality and I suspect sulphates have a positive impact on quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I think pH, alchol, volatile acidity, fixed acidity, citric acid, residual sugar likely contribute to the quality of red wine. I guess pH of high quality red wine are located in a certain small range. After searching some domain knowledge, I think alchol is positively related to quality while volatile acidity probably has a negative impact on quality.

Did you create any new variables from existing variables in the dataset?

I created a new variale called ‘quality_level’, which is a factor variable including three levels: low, medium, high. This new variable is transformed from the numerical variable ‘quality’. The quality scores are categorized into three levels: low <= 4, medium = 4-6, high >= 6.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Free sulfur dioxide and total sulfur dioxide are right skewed, therefore I tried to log-tranform their distributions and the transformed distribution for total sulfur dioxide becomes normal distribution.

I haven’t performed any operations on the data, since the dataset is pretty much clean.

Bivariate Plots Section

From a subset of the data, sulphates, alcohol, and volatile acidity are moderately correlated with quality, other variables seem to have weak or no correlations with quality. Also, fixed acidity and citric acidity are strongly correlated with pH; alcohol and fixed acidity have strong correlation with density; volatile acidity have strong correaltion with citric acid.

## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

It is not very hard to find that high quality red wine tends to have more sulphates in it. based on some research, an increase in sulphates might be related to the fermenting nutrition, which is crucial to improve the wine aroma.

From the trend between alcohol and quality we can see that alcohol and quality are positively correlated. Red wine with higher quality score often has a higher median of alcohol, except a drop when quality equals 5.

Obviously, volatile acidity has a negative impact on red wine, since acetic acid is the key ingredient in vinegar. As the quality of the wine imporves, the median of volatile acidity decreases.

Since the density of alcohol is smaller than that of water, it is no surprise that alcohol and density of the wine are negatively correlated. Some entries’ density are larger than 1 g/cm^3 which suprises a little bit. I guess there is more residual sugar existing in thoses wines.

Since citric acid is a naturally occurring non-volatile organic acid, there is no surprise that citric acid and volatile acidity are negatively correlated.

Ph strongly negatively correlates with fixed acidity which is not a surprise. Lower pH means more acid the wine has.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality correlates moderately with sulphates, alcohol, and volatile acidity.

As sulphates in red wine increase, the quality of the wine tends to improve. Based on some research, an increase in sulphates might be related to the fermenting nutrition, which is crucial to improve the wine aroma.

High quality red wine tends to have more alcohol in it compared to low quality red wine. However, the median of alcohol is the lowerest when red wine’s quality is 5.

Volatile acidity has a negative impact on red wine, as quality of the red wine improves, volatile acidity tends to decrease.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I found that alcohol and density of the wine are negatively correlated. Since the density of alcohol is smaller than that of water, it’s no surprise that when wine contains more alcohol, the density of the wine tends to become smaller. Interestingly, some entries’s density are larger than 1g /cm^3 which is the density of water. I guess there is a high amount of residual sugar existing in those wines.

What was the strongest relationship you found?

The strongest relationship I found was between pH and fixed acidity. Since pH is a direct measure of liquid’s acid, the results I found met my expectation.

Multivariate Plots Section

As we have explored before, alcohol positively correlates with quality while volatile acidity negatively correlates with quality. The plot above seems reasonable, low quality wine are clustered at the upper left corner, where alcohol is low and volatile acidity is high; high quality wine are clustered at the bottom-right corner; however, we cannot find a strong correlation between alcohol and volatile acidity.

As we have explored before, citric acid and volatile acidity are negatively correlated. From the plot, we can also see that when volatile acidity is below 0.4 g/L and citric acid is between 0.25 g/L and 0.50 g/L, the red wine is very likely to have high quality.

Alcohol and density are negatively correalted. Alcohol has a postive correlation with quality, however, density does not seem to correlate with quality, since there is clear pattern of quality across wines of different density. Most high quality wine seem to cluster at the area where alcohol is more than 12%.

Citric acid seems to be a very important factor which can affect red wine’s quality. When the wine’s citric acid is lower (less than 0.25g /L), it is more likely to be low-quality; when the citric acid is in mediuam level (between 0.25g /L and 0.50 g/L), the red wind tends to be high-quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Holding density constant, red wine of higher quality tends to have more alcohol in it.

Holding alcohol constant, red wine with higher amount of volatile acidity always have a worse quality than those with lower amount of volatile acidity.

Were there any interesting or surprising interactions between features?

I found the interaction between citric acid and quality is very interesting. When the wine’s citric acid is in a low level (less than 0.25g /L), it is very likely to be low-quality; when the citric acid is in mediuam level (between 0.25g /L and 0.50 g/L), the red wind tends to be high-quality, however, when the amount citric acid grows to high level (more than 0.7g/L), it doesn’t seem to make a big difference in red wine’s quality.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

Volatile acidity has a negative impact on red wine, since acetic acid is the key ingredient in vinegar. The median of volatile acidity with high-quality red wine is much lower than that with low-quality red wine. Therefore, volatile acidity could be an important indicator of red wine’s quality.

Plot Two

Description Two

A greater proportion of red wine with high-quality have more citric acid compared to the proportion of red wine in citric acid distributions for worse levels of quality.

Plot Three

Description Three

Due to the density of alcohol, it is understandable that alcohol and density are negatively correlated. Red wine of higher quality tends to contain more alcohol and have a lower density overall.


Reflection

The dataset containing quality and eleven attributes for 1599 entries of red wine. Most entries have quality between 5 and 7. I first tried to plot several histograms to understand the individual variables in the data set, and then I plot a correaltion matrix to find relationship between variables, especially, between quality and other variables.

Quality positively correlates with sulphates and alcohol while negatively correlates with volatile acidity. Also, there are interesting relations existing between other variables. For example, alcohol and density are negatively correlated, which meet our commonsense; volatile acidity and citric acid are negatively correlated, both of which play an important role in red wine’s quality. I struggled to understand why some red wine’s density are larger than 1 g/cm^3 which is the density of water. I guess this might be due to a high amount of residual sugar existing in those wines.

Due to the limitation of my statistical knowledge, I haven’t built any statistical models with the dataset, this could be my future work. Also, I will try to learn more graphical techniques in R and polish the plots in the future.