This project aims to use exploratory data analysis (EDA) techniques to explore relationships in one variable to multiple variables and to explore selected red wine data set for visualizations, distributions, outliers, and anomalies.
The main question is “Which chemical properties influence the quality of red wines?” During my exploratory analysis, I will try to answer this question and implement EDA tehniques using R programming language.
A brief summary of the data set
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Red wine data set can be downloaded here
More information about the data set (data collection method and variable explanations) can be found here
In this section, I will investigate attributes individually.
Wine Quality
Let’s start exploring by investigating wine quality first, which is measured with a score range between 0, 10.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
In the given data set, wine scores are in range [3,8] and most of them have a score of 5.
Alcohol rate
Next, let’s investigate the alcohol rate in each wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Median score of alcohol is 10.2%, the mean value is 10.42% and the third quartile is 11.1%. As seen on above graph, alcohol rate is left skewed, meaning most of the wines in the given data set have an alcohol rate below 11.1% and only 25% of the given wines have an alcohol rate over 11.1%
Residual Sugar
It is the amount of sugar remaining after fermentation stops. Let’s investigate residual sugar
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
This plot has a very long tail on the right side. Third quartile, 2.6, showing that 75% of the wines have a residual sugar value below 2.6 g/dm3. However, the remaining 25% of the wines have a residual sugar value in range (2.6, 15.5]
pH
This attribute describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic), where 7 is neutral.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Again we see a great bell shaped plot, with a mean and and a median value nearly the same (3.31). So we expect to see a pH level of 3.31 at most wines.
A few of them are more acidic and the most acidic wine has a pH value of 2.74, which is very close to pH level of cola and lemon juice.
Citric Acid
Found in small quantities, citric acid can add ‘freshness’ and flavor to wines. Next, we will investigate this attribute.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
There are two main peaks in this plot. First one is between [0, 0.02] and the second one is in range [0.48, 0.5]. It is hard to say its distribution by looking at the plot.
Volatile Acidity
This attribute gives the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
This attribute has a mean and median values that are nearly equal (0.52 g/dm3) and it seems to be a bell shaped plot with a normal distribution. However, there is a small tail on the right side of the plot.
Fixed Acidity
This is a total of most acids involved with wine or fixed or nonvolatile. Let’s investigate this attribute now
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Again we see a bell shaped plot with attribute having range [4.6 g/dm3, 15.9 g/dm3]. Median value is 7.9 and the mean value is 8.32
Density g/cm3
Next, let’s plot density attribute
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density plot looks normally distributed, with mean eqals to 0.9967 and median equals to 0.9968.
Sulphates
This is a wine additive which can contribute to sulfur dioxide gas (S02) levels and acts as an antimicrobial and antioxidant.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The graph above, looks like a bell shaped plot with a long tail on the right. Sulphate level ranges between 0.33 g/dm3 to 2 g/dm3, with a mean value of 0.6581 and a median value of 0.6581 which are very close to each other. We can conclude, in most wines (in the given data set), sulphate amount is 0.62 g/dm3
Total Sulfur Dioxide
Represents the amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 mg/L, SO2 becomes evident in the nose and taste of wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
As from the above description, it is not a surprise for us to see such low sulfur dioxide levels. 75% of the wines in this dataset has a sulfur dioxide value below 62 mg/dm3
Free Sulfur Dioxide
After investigating the total sulfur dioxide levels, it will be a good practice to investigate free sulfur dioxide attribute, which prevents microbial growth and the oxidation of wine
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Again we see a left skewed plot, in which most of the values are below 21 mg/dm3
Chlorides
In the final plot of univariate plots, I will investigate chlorides attribute, which gives the amount of salt in the wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
This plot also looks like normally distributed however, there is a long tail in the right side, which corresponds to less then 25% of the data (since our third quartile is 0.09 g/dm3 and tail is starting from 0.15)
What is the structure of your dataset?
This tidy data set contains 1599 red wine observations and a total of 12 attributes in the data set. 11 of the attributes are numeric physicochemical test results of wines and 1 attribute (quality) consists of sensory data ranging from 0 to 10, which is a categorical variable and is the median of at least 3 evaluations made by wine experts. There aren’t any missing values in the data set.
What is/are the main feature(s) of interest in your dataset?
As this project aims to find which chemical properties influence the quality of red wines, the main feature is quality.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Altough I’m not an expert at wines, I expect these 4 variables to affect the quality of the wine and have an insight that these variables will support my further investigation.
Did you create any new variables from existing variables in the dataset?
No, I did not create any new variables.
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
Some plots are positively skewed and might be log-normally distributed:
Since the given data set is very clean, I have only changed some variable names for ease of use but haven’t done any additional process to tidy the data set.
In this section, I will create plots using 2 features.
First, let’s investigate the correlations between variables.
Alcohol, Volatile Acidity, Citric Acid And Sulphates vs. Quality
Alcohol, volatile acidity, citric acid and sulphates are the most correlated attributes with quality. Next, I will dig in these variables to see their relationship with quality.
After looking at ggpair plot and boxplots, we can say that quality is positively correlated with alcohol, citric acid and sulphates and negatively correlated with volatile acidity.
Let’s go one step beyond and calculate a linear model and summarize its results
##
## Call:
## lm(formula = quality ~ alcohol + vol.acidity + citric.acid +
## sulphates, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71408 -0.38590 -0.06402 0.46657 2.20393
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.64592 0.20106 13.160 < 2e-16 ***
## alcohol 0.30908 0.01581 19.553 < 2e-16 ***
## vol.acidity -1.26506 0.11266 -11.229 < 2e-16 ***
## citric.acid -0.07913 0.10381 -0.762 0.446
## sulphates 0.69552 0.10311 6.746 2.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6588 on 1594 degrees of freedom
## Multiple R-squared: 0.3361, Adjusted R-squared: 0.3345
## F-statistic: 201.8 on 4 and 1594 DF, p-value: < 2.2e-16
These 4 attributes explains 33.45% of the variability in quality and citric acid is statistically unsignificant, in other words, there is likely to be no relationship between citric acid and quality.
Top Corralated Attributes
Also, let’s look at other variable pairs to see top correlated attributes.
Free SO2 vs. Total SO2
Free SO2 and total So2 attributes are positively correlated with each other, having value of 0.668
Fixed Acidity vs. Density & Alcohol vs. Density
Fixed Acidity vs. Citric Acid & Volatile Acidity vs. Citric Acid
It is an expected thing to see citrid acid and fixed acidity to have positive and strong correlation but it seems there is an interesting relationship between volatile acidity and citrid acid. These two variables have a negative correlation and we can expect an inverse proportion between these two attributes. So if acetic acid amount (volatile acidity) increases, we expect citric acid to decrease and vice versa.
Volatile Acidity vs. pH
Another interesting relationship: Volatile acidity vs. pH.
Although these 2 variables have a correlation coefficient of 0.235, it looks like they are so weakly correlated. One misleading point is the positive sign of the coefficient. So one can argue as volatile acidity increases, pH is expected to increase but how can an acid would have a positive impact on pH? After my investigations, I found out excess amount of volatile acid is removed from wine, because of its vinegar smell, using reverse osmosis or steam distillation methods, which may increase the pH level.1
Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
After investigating the ggpair plot, I found 4 attributes were related with quality:
After removing citric acid attribute, remaining 3 variables explained 33.46% of the variability in quality.
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
It was very interesting to observe negative correlation between volatile acidity and citric acid.
Since volatile acidity and pH had a positive correlation, that plot was also very interesting.
What was the strongest relationship you found?
The strongest correlation was -0.683 between pH and fixed acidity
This section includes plots and analysis of multiple variables
Alcohol vs. Other Variables over Quality
In the following 3 plots, darker points indicates better quality wines
According to above 3 plots, better quality wines mostly have:
It is not possible to say anything about pH level.
In the final plot it is much clear to identify darker points.
Other Multivariate Plots
Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
After investigating many multivariate plots, it seems there are some combinations that makes a high quality wine:
Were there any interesting or surprising interactions between features?
While looking for interesting multivariate plots, I created a dozen plots. In free SO2 vs. total SO2 plot, I added quality as color and as a result, very interesting plot occured. There were clusters in the plot; high quality wines had low free and total SO2 values, where mid and low quality wines had higher free and total SO2 values. It was a big surprise for me since I didn’t expect such a plot.
Plot 1
Description 1
This chart shows how alcohol percent highly effects the quality level.
Next time, try buying a red wine with the highest alcohol percent. You may receive the highest quality wine and it may let you experience an unforgettable pleasent flavor.
To support our description, we can also plot a density graph
Plot 2
Description 2
Combinations with high alcohol percent and low acetic acid (volatie acidity) seems to produce better wines.
So instead of just looking for the wine with the highest alcohol percentage, we should also look for low acetic acid concentrations to increase our chance of buying a high quality red wine.
Plot 3
Description 3
I found this plot very interesting and informative because it seems like there is an imaginary straight line splitting the top 3 quality wines from the rest of the data set.
Most of the wines having a quality of 6, 7 and 8 are:
To prove this finding, let’s look at 90% quantiles:
## 90% of the red wines in this data set with quality scores 6, 7 & 8,
## Have a total sulfur dioxide concentration below 74 mg/dm3
## Have a free sulfur dioxide concentration below 29 mg/dm3
Overall, it was a great experience investigating and exploring red wine data. Before this project, I knew a little about wine quality but now I learned a lot from data.
Let’s briefly sum up what we discussed.
In conclusion, after analyzing the data I came up with the following results:
Where did I run into difficulties in the analysis?
My main difficulty was, I knew almost nothing about wine making and its procedures. So I started my project by reading many articles and blog posts about wine making. Learned a lot about technical terms and attributes used in this data set, however this learning process took longer than I expected
Where did I find successes?
In this project, we asked a question to dataset, “Which chemical properties influence the quality of red wines?”, and investigated it in many ways. The best part of this project and for me the main success was exploring and somehow predicting a wine quality with a few technical variables without actually tasting it. Just by exploring data, anyone can figure out basic trends and answer this question “What really effects red wine quality?”
How could the analysis be enriched in future work?
Although it looks not possible to add new features to the data, since all records are anonymous, expert reviews could be added to enrich this data set. I think it is important to get feedback from reviewers because there isn’t any explanation of how these reviewers rate a wine and also no information about evaluation criterias. Not only having scoring numbers but also some more information about the score and reviewers notes about the wine would increase the productivity of this investigation. If we had that information, some text learning algorithms could be applied on reviewer comments to gather more information about what makes a wine great or bad from reviewers eyes. Thus making sensory data more clear and understandable.