Red Wine Quality Analysis

Introduction

This project will apply exploratory data analysis techniques using R to a dataset about chemical properties of Red Wines in order to answer the following question: “Which chemical properties influence the quality of red wines?” The dataset to be explored contains information about 1,599 red wines with 11 variables on the chemical properties of the wine. The quality of each wine is rated between 0(very bad) and 10(very excellent) by at least 3 wine experts.

Summary of the Data Set

## [1] 1599   12
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Observations from the Summary

The quality of wine has a median of 6 with min of 3 and max of 8. Some wine have no citric acid added, which are often added for ‘freshness’ and flavor to wines.The amount of free sulfur dioxide and total sulfur dioxide seem to vary greatly.

Distribution of Single Variables

I will now analyze single variables to look for unusual data and patterns that need further analysis.

Quality

Most wines are either 5 or 6 in quality.

Fixed Acidity

The plot has a peak at the center around 7.5. A small number of wines has extremely high acidity.

Volatile Acidity

By removing the outliers to the right, we can see that the distribution seems to be normal. Also, the majority of volatile acidity is from 0.25 to 0.75

Citric Acid

Citric Acid has a fairly even distribution except for peaks at 0 and 0.49.

Residual Sugar

The peak occurs around 2 for residual sugar. Removing the outliers resulted in a roughly normal distribution once again.

Chlorides

The peak is at approximately 0.075. Removing the outliers once again produced a normal distrubition.

Free Sulfur Dioxide

I transformed the x-axis using logscale to better understand the distribution. There is a slight peak around 6.

Total Sulfur Dioxide

Once again, I transformed the x-axis using logscale to better understand the distribution. There is a slight peak around 50. This data is showing a similar pattern to that of free sulfur dioxide probably because total sulfur dioxide includes free sulfur dioxide.

Density

Density has a very small range, from 0.9901 to 1.0037. The distribution is normal.

pH

There is a peak around 3.3 The pH level is probably affected by acidity. Once again, the distribution is normal.

Sulphates

The peak occurs around 0.6. Since sulphate can contribute to sulfur dioxide levels, it has a similar plot with the total sulfur dioxide data.

Alcohol Percentage

There is a peak around 9.5. Alochol percentage probably affects the pH level and the overall taste.

Just looking at the distribution is not enough to identify variables that affect wine quality.

Bivariate Analysis

I will now explore correlations between variables and find which variables to explore further.

Correlation Values

## 
## CORRELATIONS
## ============
## - correlation type:  pearson 
## - correlations shown only when both variables are numeric
## 
##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                    .           -0.256       0.672
## volatile.acidity            -0.256                .      -0.552
## citric.acid                  0.672           -0.552           .
## residual.sugar               0.115            0.002       0.144
## chlorides                    0.094            0.061       0.204
## free.sulfur.dioxide         -0.154           -0.011      -0.061
## total.sulfur.dioxide        -0.113            0.076       0.036
## density                      0.668            0.022       0.365
## pH                          -0.683            0.235      -0.542
## sulphates                    0.183           -0.261       0.313
## alcohol                     -0.062           -0.202       0.110
## quality                      0.124           -0.391       0.226
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                 0.115     0.094              -0.154
## volatile.acidity              0.002     0.061              -0.011
## citric.acid                   0.144     0.204              -0.061
## residual.sugar                    .     0.056               0.187
## chlorides                     0.056         .               0.006
## free.sulfur.dioxide           0.187     0.006                   .
## total.sulfur.dioxide          0.203     0.047               0.668
## density                       0.355     0.201              -0.022
## pH                           -0.086    -0.265               0.070
## sulphates                     0.006     0.371               0.052
## alcohol                       0.042    -0.221              -0.069
## quality                       0.014    -0.129              -0.051
##                      total.sulfur.dioxide density     pH sulphates alcohol
## fixed.acidity                      -0.113   0.668 -0.683     0.183  -0.062
## volatile.acidity                    0.076   0.022  0.235    -0.261  -0.202
## citric.acid                         0.036   0.365 -0.542     0.313   0.110
## residual.sugar                      0.203   0.355 -0.086     0.006   0.042
## chlorides                           0.047   0.201 -0.265     0.371  -0.221
## free.sulfur.dioxide                 0.668  -0.022  0.070     0.052  -0.069
## total.sulfur.dioxide                    .   0.071 -0.066     0.043  -0.206
## density                             0.071       . -0.342     0.149  -0.496
## pH                                 -0.066  -0.342      .    -0.197   0.206
## sulphates                           0.043   0.149 -0.197         .   0.094
## alcohol                            -0.206  -0.496  0.206     0.094       .
## quality                            -0.185  -0.175 -0.058     0.251   0.476
##                      quality
## fixed.acidity          0.124
## volatile.acidity      -0.391
## citric.acid            0.226
## residual.sugar         0.014
## chlorides             -0.129
## free.sulfur.dioxide   -0.051
## total.sulfur.dioxide  -0.185
## density               -0.175
## pH                    -0.058
## sulphates              0.251
## alcohol                0.476
## quality                    .

We notice from the pearson correlation above that the strongest correlations with quality occur with volatile acidity and alcohol percentage. The correlation coefficients are -0.391 and 0.476, respectively. Let’s look at the visual representation of the correlations.

Correlation Plot

We can clearly see from the size and color of the circles that quality has the strongest correlation with volatile acidity and alcohol percentage, as stated above. Thus, the next step will be making bivariate plot for each of the two variables

Quality vs Volatile Acidity

There seems to be a pattern where higher quality wines tend to have lower volatile acidity.

The boxplot shows a clearer pattern with the median volatile acidity decreasing as quality increases.

Quality vs Alcohol

In this case, higher quality wines tend to have higher alcohol percentage.

The boxplot shows that wines with quality 6,7, and 8 have higher alcohol percentages.

Multivariate Analysis

I will now plot alcohol and volatile acidity on the same plot with quality represented in different colors.

Alcohol vs Volatile Acidity with Quality as Color

Here we have a much better plot that shows wines with higher quality being in the lower right of the plot. We can infer that higher quality wines tend to have high alcohol percentage and low volatile acidity.

Final Plots and Summary

Final Plot 1

The majority of wines are either quality 5 or 6 while roughly 1/8 is quality 7. There are no wines with quality less than 3 or greater than 8 in this dataset.

Final Plot 2

The correlation matrix is a great way of finding the correlations between two variables. By looking at the very last row or very last column, we can see that quality seems to be affected the most by volatile acidity and alcohol percentage.

Final Plot 3

The quality of wine increases as we move towards the lower right of the plot. Wine seems to have better quality when volatile acidity is around 0.3 and alcohol is between 11 and 13. Interestingly, wine with quality 5 occurs the most when volatile acidity is between 0.4 and 0.8 and alcohol is between 8 and 10. Wines have lower quality when volatile acidity is 0.8 or higher as seen by the red points. Similarly, all wines with quality 8 except one wine, have volatile acidity lower than 0.8.

Reflection

This data set contains information on 1,599 different red wines from a 2009 study. My goal was to find which chemical properties affect wine quality. I started out by exploring the distribution of individual variables and looked for unusual behaviors in the histograms. I then calculated and plotted the correlations between quality and the variables. None of the correlations were above 0.7, however. The two variables that had relatively strong correlations were alcohol percentage and volatile acidity, but the individual correlations were not strong enough to make definitive conclusions with only bivariate analysis methods. However, plotting the multivariate plot shown as Final Plot 3 showed the increase in quality with certain volatile acidity and alcohol percentage values. One suggestion for this data set is to include storage time and storage method since these factors can influence the quality of wine as well. Further studies might include the relationship between price and quality of wine to investigate whether expensive wines lead to better quality. Also, it would be interesting to see whether white wines follow a similar pattern when it comes to quality wines.