This project will apply exploratory data analysis techniques using R to a dataset about chemical properties of Red Wines in order to answer the following question: “Which chemical properties influence the quality of red wines?” The dataset to be explored contains information about 1,599 red wines with 11 variables on the chemical properties of the wine. The quality of each wine is rated between 0(very bad) and 10(very excellent) by at least 3 wine experts.
Summary of the Data Set
## [1] 1599 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Observations from the Summary
The quality of wine has a median of 6 with min of 3 and max of 8. Some wine have no citric acid added, which are often added for ‘freshness’ and flavor to wines.The amount of free sulfur dioxide and total sulfur dioxide seem to vary greatly.
I will now analyze single variables to look for unusual data and patterns that need further analysis.
Quality
Most wines are either 5 or 6 in quality.
Fixed Acidity
The plot has a peak at the center around 7.5. A small number of wines has extremely high acidity.
Volatile Acidity
By removing the outliers to the right, we can see that the distribution seems to be normal. Also, the majority of volatile acidity is from 0.25 to 0.75
Citric Acid
Citric Acid has a fairly even distribution except for peaks at 0 and 0.49.
Residual Sugar
The peak occurs around 2 for residual sugar. Removing the outliers resulted in a roughly normal distribution once again.
Chlorides
The peak is at approximately 0.075. Removing the outliers once again produced a normal distrubition.
Free Sulfur Dioxide
I transformed the x-axis using logscale to better understand the distribution. There is a slight peak around 6.
Total Sulfur Dioxide
Once again, I transformed the x-axis using logscale to better understand the distribution. There is a slight peak around 50. This data is showing a similar pattern to that of free sulfur dioxide probably because total sulfur dioxide includes free sulfur dioxide.
Density
Density has a very small range, from 0.9901 to 1.0037. The distribution is normal.
pH
There is a peak around 3.3 The pH level is probably affected by acidity. Once again, the distribution is normal.
Sulphates
The peak occurs around 0.6. Since sulphate can contribute to sulfur dioxide levels, it has a similar plot with the total sulfur dioxide data.
Alcohol Percentage
There is a peak around 9.5. Alochol percentage probably affects the pH level and the overall taste.
Just looking at the distribution is not enough to identify variables that affect wine quality.
I will now explore correlations between variables and find which variables to explore further.
Correlation Values
##
## CORRELATIONS
## ============
## - correlation type: pearson
## - correlations shown only when both variables are numeric
##
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity . -0.256 0.672
## volatile.acidity -0.256 . -0.552
## citric.acid 0.672 -0.552 .
## residual.sugar 0.115 0.002 0.144
## chlorides 0.094 0.061 0.204
## free.sulfur.dioxide -0.154 -0.011 -0.061
## total.sulfur.dioxide -0.113 0.076 0.036
## density 0.668 0.022 0.365
## pH -0.683 0.235 -0.542
## sulphates 0.183 -0.261 0.313
## alcohol -0.062 -0.202 0.110
## quality 0.124 -0.391 0.226
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.115 0.094 -0.154
## volatile.acidity 0.002 0.061 -0.011
## citric.acid 0.144 0.204 -0.061
## residual.sugar . 0.056 0.187
## chlorides 0.056 . 0.006
## free.sulfur.dioxide 0.187 0.006 .
## total.sulfur.dioxide 0.203 0.047 0.668
## density 0.355 0.201 -0.022
## pH -0.086 -0.265 0.070
## sulphates 0.006 0.371 0.052
## alcohol 0.042 -0.221 -0.069
## quality 0.014 -0.129 -0.051
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.113 0.668 -0.683 0.183 -0.062
## volatile.acidity 0.076 0.022 0.235 -0.261 -0.202
## citric.acid 0.036 0.365 -0.542 0.313 0.110
## residual.sugar 0.203 0.355 -0.086 0.006 0.042
## chlorides 0.047 0.201 -0.265 0.371 -0.221
## free.sulfur.dioxide 0.668 -0.022 0.070 0.052 -0.069
## total.sulfur.dioxide . 0.071 -0.066 0.043 -0.206
## density 0.071 . -0.342 0.149 -0.496
## pH -0.066 -0.342 . -0.197 0.206
## sulphates 0.043 0.149 -0.197 . 0.094
## alcohol -0.206 -0.496 0.206 0.094 .
## quality -0.185 -0.175 -0.058 0.251 0.476
## quality
## fixed.acidity 0.124
## volatile.acidity -0.391
## citric.acid 0.226
## residual.sugar 0.014
## chlorides -0.129
## free.sulfur.dioxide -0.051
## total.sulfur.dioxide -0.185
## density -0.175
## pH -0.058
## sulphates 0.251
## alcohol 0.476
## quality .
We notice from the pearson correlation above that the strongest correlations with quality occur with volatile acidity and alcohol percentage. The correlation coefficients are -0.391 and 0.476, respectively. Let’s look at the visual representation of the correlations.
Correlation Plot
We can clearly see from the size and color of the circles that quality has the strongest correlation with volatile acidity and alcohol percentage, as stated above. Thus, the next step will be making bivariate plot for each of the two variables
Quality vs Volatile Acidity
There seems to be a pattern where higher quality wines tend to have lower volatile acidity.
The boxplot shows a clearer pattern with the median volatile acidity decreasing as quality increases.
Quality vs Alcohol
In this case, higher quality wines tend to have higher alcohol percentage.
The boxplot shows that wines with quality 6,7, and 8 have higher alcohol percentages.
I will now plot alcohol and volatile acidity on the same plot with quality represented in different colors.
Alcohol vs Volatile Acidity with Quality as Color
Here we have a much better plot that shows wines with higher quality being in the lower right of the plot. We can infer that higher quality wines tend to have high alcohol percentage and low volatile acidity.
Final Plot 1
The majority of wines are either quality 5 or 6 while roughly 1/8 is quality 7. There are no wines with quality less than 3 or greater than 8 in this dataset.
Final Plot 2
The correlation matrix is a great way of finding the correlations between two variables. By looking at the very last row or very last column, we can see that quality seems to be affected the most by volatile acidity and alcohol percentage.
Final Plot 3
The quality of wine increases as we move towards the lower right of the plot. Wine seems to have better quality when volatile acidity is around 0.3 and alcohol is between 11 and 13. Interestingly, wine with quality 5 occurs the most when volatile acidity is between 0.4 and 0.8 and alcohol is between 8 and 10. Wines have lower quality when volatile acidity is 0.8 or higher as seen by the red points. Similarly, all wines with quality 8 except one wine, have volatile acidity lower than 0.8.
This data set contains information on 1,599 different red wines from a 2009 study. My goal was to find which chemical properties affect wine quality. I started out by exploring the distribution of individual variables and looked for unusual behaviors in the histograms. I then calculated and plotted the correlations between quality and the variables. None of the correlations were above 0.7, however. The two variables that had relatively strong correlations were alcohol percentage and volatile acidity, but the individual correlations were not strong enough to make definitive conclusions with only bivariate analysis methods. However, plotting the multivariate plot shown as Final Plot 3 showed the increase in quality with certain volatile acidity and alcohol percentage values. One suggestion for this data set is to include storage time and storage method since these factors can influence the quality of wine as well. Further studies might include the relationship between price and quality of wine to investigate whether expensive wines lead to better quality. Also, it would be interesting to see whether white wines follow a similar pattern when it comes to quality wines.