This report explores a dataset containing quality and eleven attributes for 1599 entries of red wine.
## [1] 1599 13
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The dataset consists of twelve variables and 1599 observations.
The quality scores for entries in this dataset are from 3 to 8, so there is no entry has a score extreme high or low. I created a new variable to categorize quality scores into three levels: low <= 4, medium = 4-6, high >= 6; obviously, most of the entries in the dataset are in the medium level.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most entries of red wine have fixed acidity between 6g/L and 10g/L: Median 7.90g/L and mean 8.32g/L. About 15% entries have volatile.acidity below 0.4g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Most of the entries have citric acid less than 0.50 g/L, and some citric acid occur more than others, for example, when citric acid equals to 0.00 g/L, 0.24 g/L, or 0.49 g/L. I wonder if there are specific reasons exist, or they just occur randomly.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most of the entries have residual sugar below 4 g/L, however, very few entries can even reach more than 12 g/L. I wonder whether there is a negative or positive impact when residual sugar reach a very high level and how this variable relate to other variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Most of the entries have chlrides between 0.05 g/L and 0.12 g/L: median 0.079 g/L and mean 0.08747 g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
The histograms of free sulfur dioxide and total sulfur dioxide are both right skewed. Therefore I transformed the data using a log scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
There is no surprise that most entries have a density slightly below 1 g/ml, however, the max can reach more than 1 g/ml, probably affecteded by residual sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH of the entries are normaly distributed: median 3.310 and mean 3.311.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5400 0.6000 0.6448 0.7000 2.0000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7435 0.8200 1.3600
Above, we subset red wine entries with ‘high’ in quality level and compare sulphate to the entries with low or medium quality classes. In general, high quality wine has a higher density of sulphates.
There are 1599 entries of red wine in this dataset with 12 attributes (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality). All the variables are quantitative variables.
Other observations:
The main features in the data set are sulphates and quality. I’d like to explore if there is a relationship between sulphates and quality and I suspect sulphates have a positive impact on quality.
I think pH, alchol, volatile acidity, fixed acidity, citric acid, residual sugar likely contribute to the quality of red wine. I guess pH of high quality red wine are located in a certain small range. After searching some domain knowledge, I think alchol is positively related to quality while volatile acidity probably has a negative impact on quality.
I created a new variale called ‘quality_level’, which is a factor variable including three levels: low, medium, high. This new variable is transformed from the numerical variable ‘quality’. The quality scores are categorized into three levels: low <= 4, medium = 4-6, high >= 6.
Free sulfur dioxide and total sulfur dioxide are right skewed, therefore I tried to log-tranform their distributions and the transformed distribution for total sulfur dioxide becomes normal distribution.
I haven’t performed any operations on the data, since the dataset is pretty much clean.
From a subset of the data, sulphates, alcohol, and volatile acidity are moderately correlated with quality, other variables seem to have weak or no correlations with quality. Also, fixed acidity and citric acidity are strongly correlated with pH; alcohol and fixed acidity have strong correlation with density; volatile acidity have strong correaltion with citric acid.
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5125 0.5450 0.5700 0.6150 0.8600
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.4900 0.5600 0.5964 0.6000 2.0000
##
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.370 0.530 0.580 0.621 0.660 1.980
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4000 0.5800 0.6400 0.6753 0.7500 1.9500
##
## $`7`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3900 0.6500 0.7400 0.7413 0.8300 1.3600
##
## $`8`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6300 0.6900 0.7400 0.7678 0.8200 1.1000
It is not very hard to find that high quality red wine tends to have more sulphates in it. based on some research, an increase in sulphates might be related to the fermenting nutrition, which is crucial to improve the wine aroma.
From the trend between alcohol and quality we can see that alcohol and quality are positively correlated. Red wine with higher quality score often has a higher median of alcohol, except a drop when quality equals 5.
Obviously, volatile acidity has a negative impact on red wine, since acetic acid is the key ingredient in vinegar. As the quality of the wine imporves, the median of volatile acidity decreases.
Since the density of alcohol is smaller than that of water, it is no surprise that alcohol and density of the wine are negatively correlated. Some entries’ density are larger than 1 g/cm^3 which suprises a little bit. I guess there is more residual sugar existing in thoses wines.
Since citric acid is a naturally occurring non-volatile organic acid, there is no surprise that citric acid and volatile acidity are negatively correlated.
Ph strongly negatively correlates with fixed acidity which is not a surprise. Lower pH means more acid the wine has.
Quality correlates moderately with sulphates, alcohol, and volatile acidity.
As sulphates in red wine increase, the quality of the wine tends to improve. Based on some research, an increase in sulphates might be related to the fermenting nutrition, which is crucial to improve the wine aroma.
High quality red wine tends to have more alcohol in it compared to low quality red wine. However, the median of alcohol is the lowerest when red wine’s quality is 5.
Volatile acidity has a negative impact on red wine, as quality of the red wine improves, volatile acidity tends to decrease.
I found that alcohol and density of the wine are negatively correlated. Since the density of alcohol is smaller than that of water, it’s no surprise that when wine contains more alcohol, the density of the wine tends to become smaller. Interestingly, some entries’s density are larger than 1g /cm^3 which is the density of water. I guess there is a high amount of residual sugar existing in those wines.
The strongest relationship I found was between pH and fixed acidity. Since pH is a direct measure of liquid’s acid, the results I found met my expectation.
As we have explored before, alcohol positively correlates with quality while volatile acidity negatively correlates with quality. The plot above seems reasonable, low quality wine are clustered at the upper left corner, where alcohol is low and volatile acidity is high; high quality wine are clustered at the bottom-right corner; however, we cannot find a strong correlation between alcohol and volatile acidity.
As we have explored before, citric acid and volatile acidity are negatively correlated. From the plot, we can also see that when volatile acidity is below 0.4 g/L and citric acid is between 0.25 g/L and 0.50 g/L, the red wine is very likely to have high quality.
Alcohol and density are negatively correalted. Alcohol has a postive correlation with quality, however, density does not seem to correlate with quality, since there is clear pattern of quality across wines of different density. Most high quality wine seem to cluster at the area where alcohol is more than 12%.
Citric acid seems to be a very important factor which can affect red wine’s quality. When the wine’s citric acid is lower (less than 0.25g /L), it is more likely to be low-quality; when the citric acid is in mediuam level (between 0.25g /L and 0.50 g/L), the red wind tends to be high-quality.
Holding density constant, red wine of higher quality tends to have more alcohol in it.
Holding alcohol constant, red wine with higher amount of volatile acidity always have a worse quality than those with lower amount of volatile acidity.
I found the interaction between citric acid and quality is very interesting. When the wine’s citric acid is in a low level (less than 0.25g /L), it is very likely to be low-quality; when the citric acid is in mediuam level (between 0.25g /L and 0.50 g/L), the red wind tends to be high-quality, however, when the amount citric acid grows to high level (more than 0.7g/L), it doesn’t seem to make a big difference in red wine’s quality.
Volatile acidity has a negative impact on red wine, since acetic acid is the key ingredient in vinegar. The median of volatile acidity with high-quality red wine is much lower than that with low-quality red wine. Therefore, volatile acidity could be an important indicator of red wine’s quality.
A greater proportion of red wine with high-quality have more citric acid compared to the proportion of red wine in citric acid distributions for worse levels of quality.
Due to the density of alcohol, it is understandable that alcohol and density are negatively correlated. Red wine of higher quality tends to contain more alcohol and have a lower density overall.
The dataset containing quality and eleven attributes for 1599 entries of red wine. Most entries have quality between 5 and 7. I first tried to plot several histograms to understand the individual variables in the data set, and then I plot a correaltion matrix to find relationship between variables, especially, between quality and other variables.
Quality positively correlates with sulphates and alcohol while negatively correlates with volatile acidity. Also, there are interesting relations existing between other variables. For example, alcohol and density are negatively correlated, which meet our commonsense; volatile acidity and citric acid are negatively correlated, both of which play an important role in red wine’s quality. I struggled to understand why some red wine’s density are larger than 1 g/cm^3 which is the density of water. I guess this might be due to a high amount of residual sugar existing in those wines.
Due to the limitation of my statistical knowledge, I haven’t built any statistical models with the dataset, this could be my future work. Also, I will try to learn more graphical techniques in R and polish the plots in the future.