title: “Red Wine Quality Data Analysis” author: Arata Kagan date: January 27th, 2018 output: html_document: toc: TRUE toc_depth: 3 toc_float: TRUE
## [1] 1599 12
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
This report is about red wine quality with a dataset of 1599 observations. The following are descriptions of each variable:
1 - Fixed acidity: most acids involved with wine are fixed or nonvolatile (do not evaporate readily.)
2 - Volatile acidity: the amount of acetic acid in wine, which at too high a level can lead to an unpleasant, vinegar taste.
3 - Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wine.
4 - Residual sugar: the amount of sugar remaining after fermentation stops. It is rare to find wine with less than 1 gram/liter and wine with greater than 4.5 grams/liter are considered sweet.
5 - Chlorides: the amount of salt in the wine.
6 - Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
7 - Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
8 - Density: the density of wine is close to that of water depending on the percent of alcohol and sugar content.
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
10 - Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
11 - Alcohol: the percent alcohol content of the wine.
12 - Quality (score between 0 and 10.)
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
Before seeing the dataset, my assumption from shopping for wine in daily life, is that cheap low quality wine is prevalent and expensive high quality wine is rarer on the shelves of supermarkets. I am assuming here that low quality is cheap to make and therefore abundant like clothing.
This plot shows a fascinating result. In fact, the lowest quality wine is rarest and high quality wine is just slightly less rare.
8 is the highest quality and 3 is the lowest quality of the wine surveyed. 5 has the highest frequency with 681 wines in the dataset and 3 has the lowest frequency with 10 wines. The highest wine quality at score 8 has 18 wines in the dataset.
In this project, I am going to determine which chemical properties of red wine affect quality.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
The minimum fixed acidity is 4.6 and the maximum is 15.9 with a median value of 7.9 and mean of 8.32.
##
## 0.12 0.16 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27
## 3 2 10 2 3 6 6 5 13 7 16 14
## 0.28 0.29 0.295 0.3 0.305 0.31 0.315 0.32 0.33 0.34 0.35 0.36
## 23 16 1 16 2 30 2 23 20 30 22 38
## 0.365 0.37 0.38 0.39 0.395 0.4 0.41 0.415 0.42 0.43 0.44 0.45
## 2 24 35 35 2 37 33 3 31 43 23 22
## 0.46 0.47 0.475 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.545 0.55
## 31 21 2 24 35 46 24 33 29 31 5 20
## 0.56 0.565 0.57 0.575 0.58 0.585 0.59 0.595 0.6 0.605 0.61 0.615
## 34 1 28 3 38 3 39 1 47 3 27 6
## 0.62 0.625 0.63 0.635 0.64 0.645 0.65 0.655 0.66 0.665 0.67 0.675
## 24 3 29 9 27 12 16 7 26 3 23 3
## 0.68 0.685 0.69 0.695 0.7 0.705 0.71 0.715 0.72 0.725 0.73 0.735
## 12 11 23 7 10 6 3 12 5 9 6 8
## 0.74 0.745 0.75 0.755 0.76 0.765 0.77 0.775 0.78 0.785 0.79 0.795
## 11 5 6 3 5 5 6 4 10 8 2 2
## 0.8 0.805 0.81 0.815 0.82 0.825 0.83 0.835 0.84 0.845 0.85 0.855
## 3 1 2 3 5 1 4 4 8 1 2 3
## 0.86 0.865 0.87 0.875 0.88 0.885 0.89 0.895 0.9 0.91 0.915 0.92
## 2 1 4 2 5 5 1 1 3 3 4 1
## 0.935 0.95 0.955 0.96 0.965 0.975 0.98 1 1.005 1.01 1.02 1.025
## 2 1 1 3 3 1 3 3 1 1 4 1
## 1.035 1.04 1.07 1.09 1.115 1.13 1.18 1.185 1.24 1.33 1.58
## 1 3 1 1 1 1 1 1 1 2 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Most of the volatile acidity is between 0.39 and 0.64. Median volatile acidity is 0.52 and the mean is 0.528. Since a higher amount of volatile acidity leaves an unpleasant taste, volatile acidity may be inversely correlated with wine quality. This will be explored later.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 132 33 50 30 29 20 24 22 33 30 35 15 27 18 21
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 19 9 16 22 21 25 33 27 25 51 27 38 20 19 21
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 30 30 32 25 24 13 20 19 14 28 29 16 29 15 23
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 22 19 18 23 68 20 13 17 14 13 12 8 9 9 8
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 9 2 1 10 9 7 14 2 11 4 2 1 1 3 4
## 0.75 0.76 0.78 0.79 1
## 1 3 1 1 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
For citric acid, 0 grams have the highest frequency with 132 wines. The median is 0.26 grams and the mean is 0.271 grams. This plot presents three interesting spikes in frequency of wine at citric acid at 0, 0.24 and 0.49 grams. It is also worth noting a drop in frequency after 0.49 grams. Citric acid imparts “freshness” to the wine. Whether high citric acidity after 0.49 grams makes wine too “fresh” and less desirable or pleasantly fresh and rarer is hard to determine from this plot alone.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
These two histograms reflect the amount of residual sugar in the surveyed wine with the bottom histogram log transformed to better depict the distribution of residual sugar.
Based on these two histograms, you can see an overall trend indicating that around 2 grams is the most frequent sugar amount. For the bottom plot, residual sugar is log transformed and there are tall spikes around 1.8 and 2.2. The minimum value is 0.9 and the maximum is 15.5. The significant drop in frequency of wine in the 3 grams and above range is noteworthy. Perhaps wines that are too sweet are less desirable. Another explanation could be that the fermentation process may not usually produce very sweet wines with more than 3 grams of sugar.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
The log transformed graph better represents the distribution of chlorides. Around 0.08 is the highest frequency of the chlorides. The median value is 0.079 and the mean is 0.087.
There are puzzling outliers: 0.012 and 0.61, which are particularly far from the distribution.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 837 6.7 0.28 0.28 2.4 0.012
## 838 6.7 0.28 0.28 2.4 0.012
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 837 36 100 0.99064 3.26 0.39
## 838 36 100 0.99064 3.26 0.39
## alcohol quality
## 837 11.7 7
## 838 11.7 7
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 152 9.2 0.52 1.00 3.4 0.610
## 259 7.7 0.41 0.76 1.8 0.611
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 152 32 69 0.9996 2.74 2.00
## 259 8 45 0.9968 3.06 1.26
## alcohol quality
## 152 9.4 4
## 259 9.4 5
The two wines with 0.012 gram chlorides (Wine #837 & Wine #838) both have the same high quality rank: 7. In contrast, the wines with above 0.61 grams of chlorides (Wine #152 & Wine #259) have lower qualities of 4 and 5. Perhaps a lower amount of chlorides correlates with better quality wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
##
## 1 2 3 4 5 5.5 6 7 8 9 10 11 12 13 14
## 3 1 49 41 104 1 138 71 56 62 79 59 75 57 50
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 78 61 60 46 39 30 41 22 32 34 24 32 29 23 23
## 30 31 32 33 34 35 36 37 37.5 38 39 40 40.5 41 42
## 16 20 22 11 18 15 11 3 2 9 5 6 1 7 3
## 43 45 46 47 48 50 51 52 53 54 55 57 66 68 72
## 3 3 1 1 4 2 4 3 1 1 2 1 1 2 1
In the case of free sulfur dioxide, this plot also has outliers on both ends. Noticeably, there is a dramatic rise in the number of wines after the 3 gram mark. I removed values less than 3 grams and values with more than 60 grams (another outlier). The result creates a plot with an easy to see right skewed distribution. Based on this dataset, as you decrease free sulfur dioxide, the wine becomes more common. The median is 14 and the mean is 15.87 grams of free sulfur dioxide.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
This plot is based on total sulfur dioxide, has a noticeable right skew, and median and mean values of 38 and 46.47 respectively. I notice that there are some values isolated on the far right side of the positively skewed plot. Let’s look closely at those data points.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080 7.9 0.3 0.68 8.3 0.05
## 1082 7.9 0.3 0.68 8.3 0.05
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 1080 37.5 278 0.99316 3.01 0.51
## 1082 37.5 289 0.99316 3.01 0.51
## alcohol quality
## 1080 12.3 7
## 1082 12.3 7
It turns out that both outlier data points have wine quality with a score of 7, the second highest quality. They also both have the highest amount of total sulfur. This is puzzling. My assumption was that sulfur, which is used as a preservative would result in a poorer taste as opposed to letting wine age naturally. Perhaps wine aged with preservatives then do taste better.
## [1] 0.07122077
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
This density plot (ratio of wine to water) seems to present a bell-shaped curve. However, using the skewness function from the moments library, the plot is in fact slightly skewed to the right (0.071). For density, the minimum is 0.9901 and the maximum is 1.0037 with the median of 0.9968 and mean of 0.9967. Most of the wine lies in between 0.9956 and 0.9978.
## [1] 0.1935018
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
The pH plot also seems normally distributed with a skewness score of 0.19, which indicates that the plot is a little skewed to the right. Most wine has between 3.2 and 3.4 pH.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
The sulphates plot is positively skewed with the median of 0.62 and the mean of 0.66.
##
## 8.4 8.5 8.7 8.8
## 2 1 2 2
## 9 9.05 9.1 9.2
## 30 1 23 72
## 9.23333333333333 9.25 9.3 9.4
## 1 1 59 103
## 9.5 9.55 9.56666666666667 9.6
## 139 2 1 59
## 9.7 9.8 9.9 9.95
## 54 78 49 1
## 10 10.0333333333333 10.1 10.2
## 67 2 47 46
## 10.3 10.4 10.5 10.55
## 33 41 67 2
## 10.6 10.7 10.75 10.8
## 28 27 1 42
## 10.9 11 11.0666666666667 11.1
## 49 59 1 27
## 11.2 11.3 11.4 11.5
## 36 32 32 30
## 11.6 11.7 11.8 11.9
## 15 23 29 20
## 11.95 12 12.1 12.2
## 1 21 13 12
## 12.3 12.4 12.5 12.6
## 12 13 21 6
## 12.7 12.8 12.9 13
## 9 17 9 6
## 13.1 13.2 13.3 13.4
## 2 1 3 3
## 13.5 13.5666666666667 13.6 14
## 1 1 4 7
## 14.9
## 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Most red wines have alcohol content between 9.5% and 11.1% with median 10.2% and mean 10.4%. According to vincarta.com, the higher the amount of sugar during the fermentation process, the higher the amount of alcohol. As wines increase in alcohol content, they become rarer.
There are 1599 redwine observations with 12 continuous variables. I treat “quality” as the output variable in this project.
Other observations:
The main feature of interest is the quality variable for red wine and how each variable affects the quality of the wine. At this stage, it may be difficult to tell which variables directly affect the quality of the wine. Is it a single variable or a combination of variables that determine the quality of the wine?
For the citric acid histogram, as I changed the binwidth from 0.1 to 0.01, three spikes appeared on the plot at 0, 0.24 and 0.49 grams forming a multimodal distribution.
As you can see from the correlation matrix, alcohol(0.48) and volatile acidity(-0.39) are correlated to the quality the most the former positively correlated and latter negatively correlated.
Sulphates(0.25) and citric acid(0.23) are moderately correlated with quality. Residual sugar is the least correlated with quality. Below, I will visualize the relationship between quality and alcohol, residual sugar, volatile acidity, density, and citric acid.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.400 9.725 9.925 9.955 10.575 11.000
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.00 9.60 10.00 10.27 11.00 13.10
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.5 9.4 9.7 9.9 10.2 14.9
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.80 10.50 10.63 11.30 14.00
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.20 10.80 11.50 11.47 12.10 14.00
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.80 11.32 12.15 12.09 12.88 14.00
As the quality of wine increases, the median, lower and upper limit of alcohol content also increases except for wine ranked at quality 3 and 4. Quality 5 stands out on this plot as it presents several upper outliers. While there is a clear trend in an increasing amount of alcohol while a quality of wine increases in rank, alcohol is not the only factor which determines the quality of wine since there are multiple wines of differing quality that have the same or similar percentage of alcohol content.
I speculated initially that residual sugar of wine would correlate with quality of wine. Based on personal preference, I assumed sweeter wines would be of higher quality. It seems however that there is no correlation with quality.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4400 0.6475 0.8450 0.8845 1.0100 1.5800
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.230 0.530 0.670 0.694 0.870 1.130
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.180 0.460 0.580 0.577 0.670 1.330
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1600 0.3800 0.4900 0.4975 0.6000 1.0400
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3000 0.3700 0.4039 0.4850 0.9150
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2600 0.3350 0.3700 0.4233 0.4725 0.8500
Quality and volatile acidity is clearly correlated in the box plot above. As the quality increases the volatile acidity decreases.
This correlation proves that alcohol is correlated with density. To explore this further, I am going to use a scatter plot to examine the relationship between alcohol and density.
The trend between alcohol and density seems negatively correlated. As alcohol increases the density decreases. Now, let us look at how density and quality are correlated as depicted in the below box plot.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9947 0.9961 0.9976 0.9975 0.9988 1.0008
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9965 0.9965 0.9974 1.0010
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9926 0.9962 0.9970 0.9971 0.9979 1.0031
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9954 0.9966 0.9966 0.9979 1.0037
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9948 0.9958 0.9961 0.9974 1.0032
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9908 0.9942 0.9949 0.9952 0.9972 0.9988
As quality increases, the median and mean points for density decrease. However, there is an unexpected exception for quality 4. Though this is not a significant outlier nor does it deviate strongly from the trend.
As shown in the above scatterplot, volatile acidity seems strongly correlated with citric acid. As volatile acidity increases citric acid decreases.
Let us now see the relationship between citric acid and quality of wine as depicted by a boxplot below.
## wine$quality: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0050 0.0350 0.1710 0.3275 0.6600
## --------------------------------------------------------
## wine$quality: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0300 0.0900 0.1742 0.2700 1.0000
## --------------------------------------------------------
## wine$quality: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2300 0.2437 0.3600 0.7900
## --------------------------------------------------------
## wine$quality: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0900 0.2600 0.2738 0.4300 0.7800
## --------------------------------------------------------
## wine$quality: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3050 0.4000 0.3752 0.4900 0.7600
## --------------------------------------------------------
## wine$quality: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.3025 0.4200 0.3911 0.5300 0.7200
There is a consistent increase in median and mean values of citric acid as the quality increases. Though visually this may depict a strong correlation between citric acid and quality, if we refer back to the correlation matrix, it is interesting to see that in fact the correlation value of citric acid is relatively low at 0.23 compared to for example alcohol with 0.48.
In this section, I mainly investigated how alcohol is correlated with each variable using a correlation matrix, boxplot and scatter plot. Observing the correlation matrix for each variable, I found three variables which seem to correlate highly with the quality of wine: alcohol, volatile acidity and citric acid. Using boxplots, we can see that as the quality of wine increases, the amount of alcohol and citric acid increase while the volatile acidity decreases.
For the scatter plot depicting volatile acidity and citric acid, as the volatile acidity increases, the citric acid decreases. This is interesting because both variables are inversely correlated although both correlate with quality of wine.
## wine$quality: 3
## low mid low mid high high
## 2 5 3 0
## --------------------------------------------------------
## wine$quality: 4
## low mid low mid high high
## 12 17 13 11
## --------------------------------------------------------
## wine$quality: 5
## low mid low mid high high
## 298 221 119 43
## --------------------------------------------------------
## wine$quality: 6
## low mid low mid high high
## 120 140 189 189
## --------------------------------------------------------
## wine$quality: 7
## low mid low mid high high
## 4 21 51 123
## --------------------------------------------------------
## wine$quality: 8
## low mid low mid high high
## 0 2 2 14
In order to plot the distribution of alcohol, I split the alcohol variable based on quantile. For alcohol content, low is less than 25%, mid low is 25-50%, mid high is 50-75% and high is more than 75%. As the quality increases, the ratio of “high” alcohol increases. It seems that almost half of the data points are in the “high” category for quality of 7 and the majority of data points are in “high category” for 8. Thus, with higher quality wine, higher alcohol content becomes more prevalent.
## wine$quality: 3
## low mid low mid high high
## 0 1 2 7
## --------------------------------------------------------
## wine$quality: 4
## low mid low mid high high
## 5 8 12 28
## --------------------------------------------------------
## wine$quality: 5
## low mid low mid high high
## 92 165 210 214
## --------------------------------------------------------
## wine$quality: 6
## low mid low mid high high
## 185 186 156 111
## --------------------------------------------------------
## wine$quality: 7
## low mid low mid high high
## 114 46 24 15
## --------------------------------------------------------
## wine$quality: 8
## low mid low mid high high
## 10 4 3 1
Just as done with the alcohol content variable, I split the variable for volatile acidity into quantiles. As depicted above, as the quality increases, the ratio of low volatile acidity increases and high volatile acidity decreases.
## wine$quality: 3
## low mid low mid high high
## 7 0 1 2
## --------------------------------------------------------
## wine$quality: 4
## low mid low mid high high
## 29 10 7 7
## --------------------------------------------------------
## wine$quality: 5
## low mid low mid high high
## 173 246 133 129
## --------------------------------------------------------
## wine$quality: 6
## low mid low mid high high
## 163 158 152 165
## --------------------------------------------------------
## wine$quality: 7
## low mid low mid high high
## 28 14 71 86
## --------------------------------------------------------
## wine$quality: 8
## low mid low mid high high
## 3 1 5 9
Although it is not as obvious as the alcohol and volatile acid variables, as seen above, the proportion of wine with high citric acid increases and the low citric acid decreases with higher quality wine. One thing to note is that at quality 6 there is a roughly equal ratio for all levels of citric acid.
In the following scatterplot, I would like to explore how quality is distributed among volatile acidity and citric acid variables.
Above, I categorize the quality variable into three levels (Low, Medium, High) in order to transform the plot into a discrete scale.
By transforming the volatile acidity with squareroot, the relationship between volatile and citric acid now seems more correlated.
I observe that high quality wine is clustered around the top left with high citric acid and low volatile acidity. Although there are not many data points for low quality wine, they tend to cluster around the bottom right with higher volatile acidity and lower citric acid.
For the above scatterplot between volatile acidity and alcohol, high quality wines tend to cluster around the bottom right with higher alcohol and lower volatile acidity. In contrast, low quality wines tend to cluster around the upper left with higher volatile acidity and lower alcohol.
Higher quality wines tend to have a higher amount of alcohol and citric acid and lower volatile acidity. Looking at the distribution of quality on the scatterplot between volatile and citric acidity, the higher alcohol wines tend to locate around higher citric acid with lower volatile acidity.
Also by looking at the scatterplot between alcohol and volatile acidity, higher quality wines are clustered around the higher alcohol/lower volatile acidity area.
For alcohol, volatile acidity, and citric acid, I transformed those variables into categorical variables with four levels (low, mid low, mid high, high) in order to observe the proportion of each variable as the quality of wine increases.
The distribution of wine quality appears to be unevenly distributed. While the wine quality with 5 has 681 observations, quality with 3 has only 10 observations. Both low quality wine and high quality wine are rare.
As the wine quality increases, the percentage of alcohol also increases. One exception is category 4 which has a higher mean and median than quality 5. Quality 5 wines and wines ranked higher follow a consistent upward trend for alcohol content.
There are mainly two things to interpret from this scatter plot. 1) As citric acid increases, the volatile acidity decreases. 2) High quality red wines tend to cluster around the top left while low quality wines are mainly found on the bottom right. Thus, high quality wines tend to be high in citric acidity and low in volatile acid. Low quality wines tend to be low in citric acid and high in volatile acidity.
For this project, I explored a red wine dataset with 1599 observations to determine which properties contribute to wine quality. In preparation for this data analysis, I did some research around how red wine is created by watching YouTube videos and I read articles online about the chemical properties of wine. Since my domain knowledge of wine was shallow before conducting this analysis, this initial phase of research helped equip me with a better understanding of wine industry terminology.
My initial assumption was that low quality wine would be abundant and high quality wine rare. After analyzing the data, I learned that in fact both ends of the spectrum are relatively rare. In addition, I learned that high quality wines tend to be low in volatile acidity and high in citric acid. Low quality wines tend to be high in volatile acidity and low in citric acid. Among all properties, alcohol was most strongly correlated with quality with the highest ranked wines having more alcohol in them.
One difficulty while doing this analysis was that the number of observations for wine quality 3 and 8 are somewhat limited. Thus, it was hard to draw a rock solid conclusion from each plot. However, as I investigated further, there was a clear pattern of high quality wine with higher citric acid and lower volatile acidity, and low quality wine with lower citric acid and higher volatile acidity.
In terms of tools, R provided me with numerous useful libraries. The corrplot library in particular enabled me to see correlations among all variables. With this tool, I was able to pick three chemical components which could influence wine quality. For future work, additional wine datasets with more observations of higher and lower quality wine could strengthen my findings. I would also like to conduct a linear regression analysis based on alcohol, volatile acidity and citric acid to see how robust those variables’ correlation is with the quality of wine.
In addition, since sulfate was moderately correlated with the quality of wine, I would like to analyze the relationship between quality and sulfate to observe how much the property influences the result of the linear regression.
Lastly, it would be interesting to see if the properties of white wine are similar to those of red wine, though I happen to prefer red wine. Cheers.
Aroma Dictionary: - About sulfer dioxide http://www.aromadictionary.com/articles/sulfurdioxide_article.html
Cookbook-r.com: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/
R-blogger: - Measure of Skew and Kurtosis https://www.r-bloggers.com/measures-of-skewness-and-kurtosis/ - How to set plot title https://www.r-bloggers.com/how-to-format-your-chart-and-axis-titles-in-ggplot2/
R studio pubs: - Combination of Mutate and Ifelse Statement https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/5
Red Wine Dataset by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt
Stack Overflow: - Changing the width of geom_line https://stackoverflow.com/questions/14794599/how-the-change-line-width-in-ggplot - Centralizing the plot of main title https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2 - How to rotate axis angle https://stackoverflow.com/questions/1828742/rotating-axis-labels-in-r
STHDA: - Correlation Matrix http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software - Setting A Theme of Background http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements
UC Davis: - Fixed acidity http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity - Volatile acidity http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity
Wine Makers Academy: - About acidity http://winemakersacademy.com/understanding-wine-acidity/
Wikipedia: - Wine acidity https://en.wikipedia.org/wiki/Acids_in_wine
YouTube - Wine making https://www.youtube.com/watch?v=a0sb3dS5120