Red Wine Quality Data Analysis by Arata Kagan

title: “Red Wine Quality Data Analysis” author: Arata Kagan date: January 27th, 2018 output: html_document: toc: TRUE toc_depth: 3 toc_float: TRUE

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

This report is about red wine quality with a dataset of 1599 observations. The following are descriptions of each variable:

1 - Fixed acidity: most acids involved with wine are fixed or nonvolatile (do not evaporate readily.)

2 - Volatile acidity: the amount of acetic acid in wine, which at too high a level can lead to an unpleasant, vinegar taste.

3 - Citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wine.

4 - Residual sugar: the amount of sugar remaining after fermentation stops. It is rare to find wine with less than 1 gram/liter and wine with greater than 4.5 grams/liter are considered sweet.

5 - Chlorides: the amount of salt in the wine.

6 - Free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

7 - Total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.

8 - Density: the density of wine is close to that of water depending on the percent of alcohol and sugar content.

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.

10 - Sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.

11 - Alcohol: the percent alcohol content of the wine.

12 - Quality (score between 0 and 10.)

Univariate Plots Section

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Before seeing the dataset, my assumption from shopping for wine in daily life, is that cheap low quality wine is prevalent and expensive high quality wine is rarer on the shelves of supermarkets. I am assuming here that low quality is cheap to make and therefore abundant like clothing.

This plot shows a fascinating result. In fact, the lowest quality wine is rarest and high quality wine is just slightly less rare.

8 is the highest quality and 3 is the lowest quality of the wine surveyed. 5 has the highest frequency with 681 wines in the dataset and 3 has the lowest frequency with 10 wines. The highest wine quality at score 8 has 18 wines in the dataset.

In this project, I am going to determine which chemical properties of red wine affect quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

The minimum fixed acidity is 4.6 and the maximum is 15.9 with a median value of 7.9 and mean of 8.32.

## 
##  0.12  0.16  0.18  0.19   0.2  0.21  0.22  0.23  0.24  0.25  0.26  0.27 
##     3     2    10     2     3     6     6     5    13     7    16    14 
##  0.28  0.29 0.295   0.3 0.305  0.31 0.315  0.32  0.33  0.34  0.35  0.36 
##    23    16     1    16     2    30     2    23    20    30    22    38 
## 0.365  0.37  0.38  0.39 0.395   0.4  0.41 0.415  0.42  0.43  0.44  0.45 
##     2    24    35    35     2    37    33     3    31    43    23    22 
##  0.46  0.47 0.475  0.48  0.49   0.5  0.51  0.52  0.53  0.54 0.545  0.55 
##    31    21     2    24    35    46    24    33    29    31     5    20 
##  0.56 0.565  0.57 0.575  0.58 0.585  0.59 0.595   0.6 0.605  0.61 0.615 
##    34     1    28     3    38     3    39     1    47     3    27     6 
##  0.62 0.625  0.63 0.635  0.64 0.645  0.65 0.655  0.66 0.665  0.67 0.675 
##    24     3    29     9    27    12    16     7    26     3    23     3 
##  0.68 0.685  0.69 0.695   0.7 0.705  0.71 0.715  0.72 0.725  0.73 0.735 
##    12    11    23     7    10     6     3    12     5     9     6     8 
##  0.74 0.745  0.75 0.755  0.76 0.765  0.77 0.775  0.78 0.785  0.79 0.795 
##    11     5     6     3     5     5     6     4    10     8     2     2 
##   0.8 0.805  0.81 0.815  0.82 0.825  0.83 0.835  0.84 0.845  0.85 0.855 
##     3     1     2     3     5     1     4     4     8     1     2     3 
##  0.86 0.865  0.87 0.875  0.88 0.885  0.89 0.895   0.9  0.91 0.915  0.92 
##     2     1     4     2     5     5     1     1     3     3     4     1 
## 0.935  0.95 0.955  0.96 0.965 0.975  0.98     1 1.005  1.01  1.02 1.025 
##     2     1     1     3     3     1     3     3     1     1     4     1 
## 1.035  1.04  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.33  1.58 
##     1     3     1     1     1     1     1     1     1     2     1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Most of the volatile acidity is between 0.39 and 0.64. Median volatile acidity is 0.52 and the mean is 0.528. Since a higher amount of volatile acidity leaves an unpleasant taste, volatile acidity may be inversely correlated with wine quality. This will be explored later.

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

For citric acid, 0 grams have the highest frequency with 132 wines. The median is 0.26 grams and the mean is 0.271 grams. This plot presents three interesting spikes in frequency of wine at citric acid at 0, 0.24 and 0.49 grams. It is also worth noting a drop in frequency after 0.49 grams. Citric acid imparts “freshness” to the wine. Whether high citric acidity after 0.49 grams makes wine too “fresh” and less desirable or pleasantly fresh and rarer is hard to determine from this plot alone.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

These two histograms reflect the amount of residual sugar in the surveyed wine with the bottom histogram log transformed to better depict the distribution of residual sugar.

Based on these two histograms, you can see an overall trend indicating that around 2 grams is the most frequent sugar amount. For the bottom plot, residual sugar is log transformed and there are tall spikes around 1.8 and 2.2. The minimum value is 0.9 and the maximum is 15.5. The significant drop in frequency of wine in the 3 grams and above range is noteworthy. Perhaps wines that are too sweet are less desirable. Another explanation could be that the fermentation process may not usually produce very sweet wines with more than 3 grams of sugar.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The log transformed graph better represents the distribution of chlorides. Around 0.08 is the highest frequency of the chlorides. The median value is 0.079 and the mean is 0.087.

There are puzzling outliers: 0.012 and 0.61, which are particularly far from the distribution.

##     fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 837           6.7             0.28        0.28            2.4     0.012
## 838           6.7             0.28        0.28            2.4     0.012
##     free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 837                  36                  100 0.99064 3.26      0.39
## 838                  36                  100 0.99064 3.26      0.39
##     alcohol quality
## 837    11.7       7
## 838    11.7       7
##     fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 152           9.2             0.52        1.00            3.4     0.610
## 259           7.7             0.41        0.76            1.8     0.611
##     free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 152                  32                   69  0.9996 2.74      2.00
## 259                   8                   45  0.9968 3.06      1.26
##     alcohol quality
## 152     9.4       4
## 259     9.4       5

The two wines with 0.012 gram chlorides (Wine #837 & Wine #838) both have the same high quality rank: 7. In contrast, the wines with above 0.61 grams of chlorides (Wine #152 & Wine #259) have lower qualities of 4 and 5. Perhaps a lower amount of chlorides correlates with better quality wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
## 
##    1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
##    3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
##   15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
##   78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
##   30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
##   16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
##   43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
##    3    3    1    1    4    2    4    3    1    1    2    1    1    2    1

In the case of free sulfur dioxide, this plot also has outliers on both ends. Noticeably, there is a dramatic rise in the number of wines after the 3 gram mark. I removed values less than 3 grams and values with more than 60 grams (another outlier). The result creates a plot with an easy to see right skewed distribution. Based on this dataset, as you decrease free sulfur dioxide, the wine becomes more common. The median is 14 and the mean is 15.87 grams of free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

This plot is based on total sulfur dioxide, has a noticeable right skew, and median and mean values of 38 and 46.47 respectively. I notice that there are some values isolated on the far right side of the positively skewed plot. Let’s look closely at those data points.

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1080           7.9              0.3        0.68            8.3      0.05
## 1082           7.9              0.3        0.68            8.3      0.05
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 1080                37.5                  278 0.99316 3.01      0.51
## 1082                37.5                  289 0.99316 3.01      0.51
##      alcohol quality
## 1080    12.3       7
## 1082    12.3       7

It turns out that both outlier data points have wine quality with a score of 7, the second highest quality. They also both have the highest amount of total sulfur. This is puzzling. My assumption was that sulfur, which is used as a preservative would result in a poorer taste as opposed to letting wine age naturally. Perhaps wine aged with preservatives then do taste better.

## [1] 0.07122077
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

This density plot (ratio of wine to water) seems to present a bell-shaped curve. However, using the skewness function from the moments library, the plot is in fact slightly skewed to the right (0.071). For density, the minimum is 0.9901 and the maximum is 1.0037 with the median of 0.9968 and mean of 0.9967. Most of the wine lies in between 0.9956 and 0.9978.

## [1] 0.1935018
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

The pH plot also seems normally distributed with a skewness score of 0.19, which indicates that the plot is a little skewed to the right. Most wine has between 3.2 and 3.4 pH.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The sulphates plot is positively skewed with the median of 0.62 and the mean of 0.66.

## 
##              8.4              8.5              8.7              8.8 
##                2                1                2                2 
##                9             9.05              9.1              9.2 
##               30                1               23               72 
## 9.23333333333333             9.25              9.3              9.4 
##                1                1               59              103 
##              9.5             9.55 9.56666666666667              9.6 
##              139                2                1               59 
##              9.7              9.8              9.9             9.95 
##               54               78               49                1 
##               10 10.0333333333333             10.1             10.2 
##               67                2               47               46 
##             10.3             10.4             10.5            10.55 
##               33               41               67                2 
##             10.6             10.7            10.75             10.8 
##               28               27                1               42 
##             10.9               11 11.0666666666667             11.1 
##               49               59                1               27 
##             11.2             11.3             11.4             11.5 
##               36               32               32               30 
##             11.6             11.7             11.8             11.9 
##               15               23               29               20 
##            11.95               12             12.1             12.2 
##                1               21               13               12 
##             12.3             12.4             12.5             12.6 
##               12               13               21                6 
##             12.7             12.8             12.9               13 
##                9               17                9                6 
##             13.1             13.2             13.3             13.4 
##                2                1                3                3 
##             13.5 13.5666666666667             13.6               14 
##                1                1                4                7 
##             14.9 
##                1
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Most red wines have alcohol content between 9.5% and 11.1% with median 10.2% and mean 10.4%. According to vincarta.com, the higher the amount of sugar during the fermentation process, the higher the amount of alcohol. As wines increase in alcohol content, they become rarer.

Univariate Analysis

What is the structure of your dataset?

There are 1599 redwine observations with 12 continuous variables. I treat “quality” as the output variable in this project.

Other observations:

  1. The majority of red wine is ranked 5 or 6 in quality. Both lower ranked wines and higher ranked wines are notably rarer with low quality wine being the rarest.
  2. Median volatile acidity is 0.52 grams. Too much volatile acidity results in a bad vinegar like taste. Perhaps the lowest quality wines then sit within the right tail representing the highest level of volatile acidity. Is there an inverse correlation here in relation to quality?
  3. The highest frequency of the citric acid is 0 grams. As citric acidity imparts “freshness” which I assumed would result in better taste, it is interesting to find a mode of 0 which means no citric acid at all!
  4. Most red wine has an alcohol content between 9.5% and 11.1%. The higher the alcohol, the rarer the wine becomes. We cannot determine yet whether alcohol content correlates with quality as both low quality and high quality wines are rare.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is the quality variable for red wine and how each variable affects the quality of the wine. At this stage, it may be difficult to tell which variables directly affect the quality of the wine. Is it a single variable or a combination of variables that determine the quality of the wine?

Of the features you investigated, were there any unusual distributions?

For the citric acid histogram, as I changed the binwidth from 0.1 to 0.01, three spikes appeared on the plot at 0, 0.24 and 0.49 grams forming a multimodal distribution.

Bivariate Plots Section

As you can see from the correlation matrix, alcohol(0.48) and volatile acidity(-0.39) are correlated to the quality the most the former positively correlated and latter negatively correlated.

Sulphates(0.25) and citric acid(0.23) are moderately correlated with quality. Residual sugar is the least correlated with quality. Below, I will visualize the relationship between quality and alcohol, residual sugar, volatile acidity, density, and citric acid.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

As the quality of wine increases, the median, lower and upper limit of alcohol content also increases except for wine ranked at quality 3 and 4. Quality 5 stands out on this plot as it presents several upper outliers. While there is a clear trend in an increasing amount of alcohol while a quality of wine increases in rank, alcohol is not the only factor which determines the quality of wine since there are multiple wines of differing quality that have the same or similar percentage of alcohol content.

I speculated initially that residual sugar of wine would correlate with quality of wine. Based on personal preference, I assumed sweeter wines would be of higher quality. It seems however that there is no correlation with quality.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Quality and volatile acidity is clearly correlated in the box plot above. As the quality increases the volatile acidity decreases.

This correlation proves that alcohol is correlated with density. To explore this further, I am going to use a scatter plot to examine the relationship between alcohol and density.

The trend between alcohol and density seems negatively correlated. As alcohol increases the density decreases. Now, let us look at how density and quality are correlated as depicted in the below box plot.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

As quality increases, the median and mean points for density decrease. However, there is an unexpected exception for quality 4. Though this is not a significant outlier nor does it deviate strongly from the trend.

As shown in the above scatterplot, volatile acidity seems strongly correlated with citric acid. As volatile acidity increases citric acid decreases.

Let us now see the relationship between citric acid and quality of wine as depicted by a boxplot below.

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

There is a consistent increase in median and mean values of citric acid as the quality increases. Though visually this may depict a strong correlation between citric acid and quality, if we refer back to the correlation matrix, it is interesting to see that in fact the correlation value of citric acid is relatively low at 0.23 compared to for example alcohol with 0.48.

Bivariate Analysis

How did the feature(s) of interest vary with other features in the dataset?

In this section, I mainly investigated how alcohol is correlated with each variable using a correlation matrix, boxplot and scatter plot. Observing the correlation matrix for each variable, I found three variables which seem to correlate highly with the quality of wine: alcohol, volatile acidity and citric acid. Using boxplots, we can see that as the quality of wine increases, the amount of alcohol and citric acid increase while the volatile acidity decreases.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

For the scatter plot depicting volatile acidity and citric acid, as the volatile acidity increases, the citric acid decreases. This is interesting because both variables are inversely correlated although both correlate with quality of wine.

Multivariate Plots Section

## wine$quality: 3
##      low  mid low mid high     high 
##        2        5        3        0 
## -------------------------------------------------------- 
## wine$quality: 4
##      low  mid low mid high     high 
##       12       17       13       11 
## -------------------------------------------------------- 
## wine$quality: 5
##      low  mid low mid high     high 
##      298      221      119       43 
## -------------------------------------------------------- 
## wine$quality: 6
##      low  mid low mid high     high 
##      120      140      189      189 
## -------------------------------------------------------- 
## wine$quality: 7
##      low  mid low mid high     high 
##        4       21       51      123 
## -------------------------------------------------------- 
## wine$quality: 8
##      low  mid low mid high     high 
##        0        2        2       14

In order to plot the distribution of alcohol, I split the alcohol variable based on quantile. For alcohol content, low is less than 25%, mid low is 25-50%, mid high is 50-75% and high is more than 75%. As the quality increases, the ratio of “high” alcohol increases. It seems that almost half of the data points are in the “high” category for quality of 7 and the majority of data points are in “high category” for 8. Thus, with higher quality wine, higher alcohol content becomes more prevalent.

## wine$quality: 3
##      low  mid low mid high     high 
##        0        1        2        7 
## -------------------------------------------------------- 
## wine$quality: 4
##      low  mid low mid high     high 
##        5        8       12       28 
## -------------------------------------------------------- 
## wine$quality: 5
##      low  mid low mid high     high 
##       92      165      210      214 
## -------------------------------------------------------- 
## wine$quality: 6
##      low  mid low mid high     high 
##      185      186      156      111 
## -------------------------------------------------------- 
## wine$quality: 7
##      low  mid low mid high     high 
##      114       46       24       15 
## -------------------------------------------------------- 
## wine$quality: 8
##      low  mid low mid high     high 
##       10        4        3        1

Just as done with the alcohol content variable, I split the variable for volatile acidity into quantiles. As depicted above, as the quality increases, the ratio of low volatile acidity increases and high volatile acidity decreases.

## wine$quality: 3
##      low  mid low mid high     high 
##        7        0        1        2 
## -------------------------------------------------------- 
## wine$quality: 4
##      low  mid low mid high     high 
##       29       10        7        7 
## -------------------------------------------------------- 
## wine$quality: 5
##      low  mid low mid high     high 
##      173      246      133      129 
## -------------------------------------------------------- 
## wine$quality: 6
##      low  mid low mid high     high 
##      163      158      152      165 
## -------------------------------------------------------- 
## wine$quality: 7
##      low  mid low mid high     high 
##       28       14       71       86 
## -------------------------------------------------------- 
## wine$quality: 8
##      low  mid low mid high     high 
##        3        1        5        9

Although it is not as obvious as the alcohol and volatile acid variables, as seen above, the proportion of wine with high citric acid increases and the low citric acid decreases with higher quality wine. One thing to note is that at quality 6 there is a roughly equal ratio for all levels of citric acid.

In the following scatterplot, I would like to explore how quality is distributed among volatile acidity and citric acid variables.

Above, I categorize the quality variable into three levels (Low, Medium, High) in order to transform the plot into a discrete scale.

By transforming the volatile acidity with squareroot, the relationship between volatile and citric acid now seems more correlated.

I observe that high quality wine is clustered around the top left with high citric acid and low volatile acidity. Although there are not many data points for low quality wine, they tend to cluster around the bottom right with higher volatile acidity and lower citric acid.

For the above scatterplot between volatile acidity and alcohol, high quality wines tend to cluster around the bottom right with higher alcohol and lower volatile acidity. In contrast, low quality wines tend to cluster around the upper left with higher volatile acidity and lower alcohol.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation.

Higher quality wines tend to have a higher amount of alcohol and citric acid and lower volatile acidity. Looking at the distribution of quality on the scatterplot between volatile and citric acidity, the higher alcohol wines tend to locate around higher citric acid with lower volatile acidity.

Also by looking at the scatterplot between alcohol and volatile acidity, higher quality wines are clustered around the higher alcohol/lower volatile acidity area.

Did you create any new variables from existing variables in the dataset?

For alcohol, volatile acidity, and citric acid, I transformed those variables into categorical variables with four levels (low, mid low, mid high, high) in order to observe the proportion of each variable as the quality of wine increases.


Final Plots and Summary

Plot One

Description One

The distribution of wine quality appears to be unevenly distributed. While the wine quality with 5 has 681 observations, quality with 3 has only 10 observations. Both low quality wine and high quality wine are rare.

Plot Two

Description Two

As the wine quality increases, the percentage of alcohol also increases. One exception is category 4 which has a higher mean and median than quality 5. Quality 5 wines and wines ranked higher follow a consistent upward trend for alcohol content.

Plot Three

Description Three

There are mainly two things to interpret from this scatter plot. 1) As citric acid increases, the volatile acidity decreases. 2) High quality red wines tend to cluster around the top left while low quality wines are mainly found on the bottom right. Thus, high quality wines tend to be high in citric acidity and low in volatile acid. Low quality wines tend to be low in citric acid and high in volatile acidity.


Reflection

For this project, I explored a red wine dataset with 1599 observations to determine which properties contribute to wine quality. In preparation for this data analysis, I did some research around how red wine is created by watching YouTube videos and I read articles online about the chemical properties of wine. Since my domain knowledge of wine was shallow before conducting this analysis, this initial phase of research helped equip me with a better understanding of wine industry terminology.

My initial assumption was that low quality wine would be abundant and high quality wine rare. After analyzing the data, I learned that in fact both ends of the spectrum are relatively rare. In addition, I learned that high quality wines tend to be low in volatile acidity and high in citric acid. Low quality wines tend to be high in volatile acidity and low in citric acid. Among all properties, alcohol was most strongly correlated with quality with the highest ranked wines having more alcohol in them.

One difficulty while doing this analysis was that the number of observations for wine quality 3 and 8 are somewhat limited. Thus, it was hard to draw a rock solid conclusion from each plot. However, as I investigated further, there was a clear pattern of high quality wine with higher citric acid and lower volatile acidity, and low quality wine with lower citric acid and higher volatile acidity.

In terms of tools, R provided me with numerous useful libraries. The corrplot library in particular enabled me to see correlations among all variables. With this tool, I was able to pick three chemical components which could influence wine quality. For future work, additional wine datasets with more observations of higher and lower quality wine could strengthen my findings. I would also like to conduct a linear regression analysis based on alcohol, volatile acidity and citric acid to see how robust those variables’ correlation is with the quality of wine.

In addition, since sulfate was moderately correlated with the quality of wine, I would like to analyze the relationship between quality and sulfate to observe how much the property influences the result of the linear regression.

Lastly, it would be interesting to see if the properties of white wine are similar to those of red wine, though I happen to prefer red wine. Cheers.


References

Aroma Dictionary: - About sulfer dioxide http://www.aromadictionary.com/articles/sulfurdioxide_article.html

Cookbook-r.com: http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/

R-blogger: - Measure of Skew and Kurtosis https://www.r-bloggers.com/measures-of-skewness-and-kurtosis/ - How to set plot title https://www.r-bloggers.com/how-to-format-your-chart-and-axis-titles-in-ggplot2/

R studio pubs: - Combination of Mutate and Ifelse Statement https://rstudio-pubs-static.s3.amazonaws.com/116317_e6922e81e72e4e3f83995485ce686c14.html#/5

Red Wine Dataset by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Stack Overflow: - Changing the width of geom_line https://stackoverflow.com/questions/14794599/how-the-change-line-width-in-ggplot - Centralizing the plot of main title https://stackoverflow.com/questions/40675778/center-plot-title-in-ggplot2 - How to rotate axis angle https://stackoverflow.com/questions/1828742/rotating-axis-labels-in-r

STHDA: - Correlation Matrix http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software - Setting A Theme of Background http://www.sthda.com/english/wiki/ggplot2-themes-and-background-colors-the-3-elements

UC Davis: - Fixed acidity http://waterhouse.ucdavis.edu/whats-in-wine/fixed-acidity - Volatile acidity http://waterhouse.ucdavis.edu/whats-in-wine/volatile-acidity

Wine Makers Academy: - About acidity http://winemakersacademy.com/understanding-wine-acidity/

Wikipedia: - Wine acidity https://en.wikipedia.org/wiki/Acids_in_wine

YouTube - Wine making https://www.youtube.com/watch?v=a0sb3dS5120