Overview

The file I have chosen to work with is the Red Wine Quality file.

I want to answer the question, ‘what features constitute a “good” red wine?’ I know from the research I carried out that the following relevant features contribute to the taste of wine:

- Alcohol levels contribute to the flavour of the wine

- Acid is responsible for the fresh crisp taste of wine In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” This is the ‘fixed acidity’ levels we have as a variable in the data set

- Wine spoilage is legally defined by volatile acidity. The amount of volatile acidity found in sound grapes [ought to be] negligible. It is a byproduct of microbial metabolism.

- Acetic acid bacteria require oxygen to grow, therefore, elimination of any air in wine containers and sulfur dioxide addition will limit their growth.

These insights indicate to me that ‘good’ red wine will have lower volatile acidity, higher fixed acidity, and some sulfur dioxide present. I also believe that by the questions I ask of the data I will discover the optimal range of some other features - eg alcohol percentage.

These insights informed the questions I thought of in my conversation with the data set.

**********************************************

I began by taking a quick look at data itself, by printing out the first 10 lines of data:

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5

From a visual appraisal of the data, I can get a sense of the types of values that are present. This will inform the way I approach graphing the counts of the different pieces of data in the Univariate data section.

I then decided that I needed to know about the overall structure of the data, so I generated a quick look at aspects of the data:

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## [1] 0
## [1] 132

Observations - I made some observations on the question of ‘what does a snapshot of this data tell me?’

- The data set ‘wineQualityReds’ contains 13 variables, with 1599 observations.

-The data is all in numeric format so no conversions need to occur (i.e. from string to numeric).

- Upper and lower limits of each column name are observed for the purpose of informing things like limit parameters for outliers, bin width etc; I will most likely need to set limit parameters for the data in my graphs, as well as using things like square root and log of the data, in order to get a better sense of the distribution of the data.

Note: I use different colours for each graph that follows, to visually separate out progress of the graphs.

Plotting

It was clear to me that the variable ‘quality’ was the most important variable, against which I would need to think about all other variables, in order to address my guiding question, ‘what makes a ’good’ red wine?‘I thought to myself, ’what does the distribution of quality look like?’

I used qplot to look at the layout of ‘quality’; deciding to set the binwidth to 0.5 first, but then changing this option to 1, as quality is all in integers, so a binwidth lower than 1 doesn’t make sense:

using ggplot syntax

As I say above, I wanted to know, what did the distribution of ‘quality’ look like?I discovered upon looking at the initial ‘Quality’ graph, it seems that all the wine falls within a certain range (3-8). So I set breaks to get a better idea of this:

Observations on Quality distribution:

From these graphs, I can surmise that quality falls between 3 and 8 in terms of points rankings - so, across 6 points.

The majority of wines have a quality of 5 or 6 - or ‘medium quality’, if we analyse the graphs by low, medium and high quality and assigned a level of 3-4, 5-6, and 7-8 respectively to these labels.

Fixed Acidity graph

I changed the binwidth to 0.1 to get a more accurate representation of the data distribution. I also try some different ‘step’ values for the break sequence, deciding that 1 is best in the end.

Fixed.acidity has a right-skewed long tail distribution, with the highest count of fixed acidity as seven.

I expect better quality wines to have medium to higher fixed acidity from my research

Volatile Acidity graph

I discovered earlier that the max is 1.58 and the min is 0.12 for volatile acidity

I decided set the bin width to 0.01, for this a narrow range

The peak amount is between 00.25 and 0.75 where the count reaches 350 and 400

I expect better quality wines to have low volatile acidity, as it is associated with spoilage

Residual sugar graph

I wanted to answer the question, ‘what does the distribution of residual sugar look like?’

I set the bin width to 0.5, to get a clearer picture of the variation in counts.

I decided to log it in order to get an even clearer picture of the variations.

I notice that residual sugar count is highest for lower amounts (min is 0.9, max is 15, as we learned earlier) It seems that circa two thirds of the wine have residual sugar of around 0.8 and 1.6

Citric Acid graph

I wanted to answer the question, ‘what does the distribution of citric acid look like?’

I set the bin width to 0.5, to get a clearer picture of the variation in counts. There is quite a variation in citric acid amounts, but the highest counts tend to contain between 0 and 0.5

Chlorides graph

I wanted to answer the question, ‘what does the distribution of chlorides look like?’

I set the bin width to 0.01, to get a clearer picture of the variation in counts.

This graph has a long tail but the values are negligable as they are so low, so I cut off some of these outliers to generate a graph without any skewing factors:

###Free Sulfur Dioxide graph

I wanted to answer the question, ‘what does the distribution of sulfur dioxide look like?’

I set the bin width to 1 in order to get a clear picture

There is a long tail on the right side of this graph; the data is over-dispersed

I transformed this graph to remove the long tail data, in order to better understand the distribution of free sulphur dioxides. I used log 10.I add ‘+1’ so that we don’t get any negative values that will skew the data

###Total Sulfur Dioxide graph

Total sulfur dioxide distribution was interesting to me because it is supposed to help guard against a wine getting spoiled so I wanted to plot its layout:

I set the bin width to 1, to get a clearer picture of the variation in counts.

There is a wide variation in total sulfur dioxide among the data;

I decided to limit the x axis as there were outliers that were skewing the data that I determined were not important.

This generated a much clearer picture of the distribution of the counts

###Density graph

Next, I moved on to the question,‘how is density distributed?’

I set the bin width to 0.001, to get a clearer picture of the variation in counts, as the figures are spread over a small window and so there is little variation in the amounts of density for each count

This generated a much clearer picture of the distribution of the counts

pH graph

Since I researched it and found out about the ideal pH levels for red wine (3.3-3.6), I wanted to find out, ‘what does the distribution of pH look like?’

I set the bin width to 0.01, as I discovered through trial and error that this was the best binwidth to reflect changes in the size of the window of distribution. We see that the pH of wine varies from 0 to 4 only, with many variations within this window. The most common band seesm to be that between 3.2 and 3.7:

Sulphates graph

I wanted to answer the question, ‘what does the distribution of sulphates look like?’

I set the bin width to 0.1. This allowed to observe the distribution as long tailed to the right:

Alcohol graph

I was curious to know,‘what does the distribution of alcohol look like?’ considering it affects the taste of wine.

The layout of ‘alcohol’ demonstrates that the highest count of alcohol level is circa 9.5 - 10.5 percent, however the distribution is not sharply spiked - i.e. there is a decent distribution of values from 8.5 to 12.5 percent. This suggests to me that wine alcohol levels can vary and still maintain their quality. Perhaps alcohol level simply influences types of taste, rather than a certain amount being objectively ‘good’ or ‘bad.’

Univariate Plots Questions

Overview

Univariate Plots Section Questions

What is the structure of your dataset?

The data set ‘wineQualityReds’ contains 13 variables, with 1599 observations.

What is/are the main feature(s) of interest in your dataset?

In my opinion, from observing the distributions plotted above, the main feature is quality, against which all other features will be measured.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Other features which I think will be important are alcohol, acidity, sulfur dioxide and pH

Did you create any new variables from existing variables in the dataset?

I subset the dataset and created an outline of the wines which have higher quality, calling it ‘goodReds’; I generated this in the following section

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I checked for NAs; I performed log calculations on some of the x axis data, in order to determine a better outline of the distribution.

************************************************************

Bivariate Analysis

Reflection on questions

In my research, I looked into the features of what is considered to be ‘good’ wine. From this research into blogs, magazines and other online resources, I was able to direct my questions on the data. For example, I discovered that pH levels of wine can affect taste. I also discovered that a pH level of 3.3 to 3.6 is best for red wines. So, for example, one question I wanted to investigate in my EDA was to compare quality of wines and pH level, to see if the data backs up this theory.

Bivariate Analysis - Quality vs. other metrics

Introduction

I wanted to know, ‘what metrics do good red wines have in common?’

So, I decided to subset the data based on quality, isolating the rows that contain the best quality - that of a 7 or 8, the two highest points listed in the ‘quality’ metric. I will use this to plot against other features to get a sense of what constitutes a ‘high’ quality wine in some of the work I carry out below.

I therefore created the following dataset and called it ‘goodReds’:

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1542 1542           7.4             0.25        0.29            2.2
## 1545 1545           8.4             0.37        0.43            2.3
## 1550 1550           7.4             0.36        0.30            1.8
## 1556 1556           7.0             0.56        0.17            1.7
## 1585 1585           6.7             0.32        0.44            2.4
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1542     0.054                  19                   49 0.99666 3.40
## 1545     0.063                  12                   19 0.99550 3.17
## 1550     0.074                  17                   24 0.99419 3.24
## 1556     0.065                  15                   24 0.99514 3.44
## 1585     0.061                  24                   34 0.99484 3.29
##      sulphates alcohol quality
## 1542      0.76   10.90       7
## 1545      0.81   11.20       7
## 1550      0.70   11.40       8
## 1556      0.68   10.55       7
## 1585      0.80   11.60       7

Quality vs. pH

As I mention in the overview above, ‘good’ red wine usually has a pH of between 3.3 and 3.6. There are the majority of wines falling into this window, but it is not just the ‘better quality’ wines. THis suggests that this is a common basic feature of wine, rather than an indicator of good red wine. I will try another layout to be sure.

I decided to subset the pH levels greater than 3.3 and less than 3.6

##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1595 1595           6.2            0.600        0.08            2.0
## 1596 1596           5.9            0.550        0.10            2.2
## 1597 1597           6.3            0.510        0.13            2.3
## 1598 1598           5.9            0.645        0.12            2.0
## 1599 1599           6.0            0.310        0.47            3.6
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1595     0.090                  32                   44 0.99490 3.45
## 1596     0.062                  39                   51 0.99512 3.52
## 1597     0.076                  29                   40 0.99574 3.42
## 1598     0.075                  32                   44 0.99547 3.57
## 1599     0.067                  18                   42 0.99549 3.39
##      sulphates alcohol quality
## 1595      0.58    10.5       5
## 1596      0.76    11.2       6
## 1597      0.75    11.0       6
## 1598      0.71    10.2       5
## 1599      0.66    11.0       6

### Overlaying boxplot on scatterplot fo pH and Quality:

I then produced further graphs to investigate this question, to see if wines which have a pH between 3.3 and 3.6 fall in the ‘high quality’ category (i.e. have a quality of 7 or higher).

Conclusion:

It doesn’t appear to be the case that pH is a strong indicator, as the bulk of what is considered the best pH for red wine occurrs at the mid-level of quality (5-6). As I said earlier, the presence of a pH between 3.3 and 3.6 seems to be the norm rather than pointing to the ‘exceptional’ wine.

Quality vs. alcohol

In this section, I wanted to know, how does alcohol level influence wine quality?

I discovered that there is a positive correlation between amount of alcohol and quality of wine - that the more alcohol, the better the quality of wine.

Correlation: I used the Pearson product-moment correlation, which measures the strength of relationship between any two variables, and discovered that there is a correlation coefficient of 0.4761. This is a moderate indication of causality, although, the lower count of wines in the higher quality range in this dataset make it difficult to be more accurate.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

Quality vs. fixed.acidity

From my research, I know that fixed acidity influences quality, so I wanted to know if there were higher amounts of fixed acidity present in higher quality wines??

I plotted the original dataset (‘df’) values of quality vs. fixed.acidity to see if there is any trend in relation to the fixed acidity of a wine and its quality. It does appear that between bad and medium quality that there is an upward trend - i.e., the better the wine, the higher the fixed acidity. However, there is not enough higher quality reds in the data set (quality => 7) to confirm this seeming upward trend, but looks likely:

Quality vs total.sulfur.dioxide

From my research, I discovered that “the use of sulfur dioxide (SO2) is widely accepted as a useful winemaking aide. It is used as a preservative because of its anti-oxidative and anti-microbial properties in wine.” Therefore, I would expect to find a correlation between presence of sulfur dioxide and quality wine.

However, there seems to be a downward trend in quality and total sulfur dioxied - the higher the quality, the lower the sulfur dioxide. However, within lower quality wines, there is also lower sulfur dioxide:

I tried plotting the data a few different ways to gain further insight into what that data is telling us. From applying the cor.test function, we see that there is a correlation of -0.185 between quality and total sulfur dioxide:

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

Fixed Acidity

Using facet wrap function to plot the distribution of fixed acidity over different quality levels

Observations: There is an abundance of wines of quality level 5 and 6 that have higher fixed acidity. The graph seems to indicate that there higher quality means higher presence of fixed.acidity, though quality levels 3, 4, and 8, have very little fixed acidity:

Quality vs. Volatile Acidity

I check this visual appraisal by applying the Pearson product moment correlation.

There is a small but meaningful correlation between the volatile acidity and the quality of the wine of -0.3906.

A similar downward trend can be observed using the point graph- the higher the quality, the lower the volatile acidity.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

QUality vs Residual Sugar

There seems to be a strong baseline amount of sugar common across all levels of quality - that each will have between 2 and 4 mg.

#### (maybe a faceted scatterplot with box plots)

Multivariate Analysis and Plots section questions

Overview - Correlations

Here I am using corrplot to create a correlation plot from all the metrics, in order to get an overview picture of how variables (besides quality) relate to eachother in terms of correlation. The correlation ranges from 1 (colour representative = blue shades) to -1 (colour representatives are red shades)

Observations - these observations will be used to inform how I go about plotting the multivariate plots

Total and free sulfur dioxide have a strong positive correlation according to the graph, but this is not surprising since ‘free’ are counted as part of ‘total’ sulfur dioxide presence.

Fixed acidity and density seem to have a strong positivcorrelation

Fixed acidity and pH have a strong negative correlation

Density and residual sugara have a positive correlatino

Volatile acidity and citric acid have a negative correlation

Citric acid and pH have a negative correlation

I decided to graph some of the relationships shown in Overview COrrelation outline above

Graph 1: Fixed Acidity , Density and Quality

I wanted to know, what is the relationship between density, fixed acidity and quality?

We can see from the hierarchical clustering graph above that there are properties that have negative and positive correlations with eachother. So I decided it would be a good idea to graph them together with quality, to glean some information.

I observe that wines with lower density and higher fixed acidity seem to be the better quality wines

Graph 2: Fixed acidity, pH and Quality

In exploring the question of the relationship between fixed acidity, pH and quality, I discovered that the most often-occurring best quality wines ( 7 and 8) occur in the regin of 3-4 pH.

Graph 3:Quality,Citric Acid and Fixed Acidity

I wanted to know how citric acid readings shed light on quality and fixed acidity, since it is an additive.

It appears that better quality wines have higher citric acid, from the graph below:

Graph 4: Fixed Acidity, Volatile Acidity and Citric Acid

I was interested in knowing the relationship between acidities, since one is bad and two are good to have in wine, as I have observed earlier - high volatile numbers is bad but the others are indicators of good wine.

Sure enough, the graph showed this- there is a correlation between higher citric acid and higher fixed acidity:

Multivariate Analysis

Question: Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Answer: From the graphs in this section, I observed that there is a correlation between good wines Wines where there is more than 0.30 citric acid and circa 5 and upwards of fixed acidity.

I also observed that the amount of citric acid increases where the fixed acidity rises and volatile acidity gets lower.

Question: Were there any interesting or surprising interactions between features?

Answer: I was surprised to find that fixed acidity and citric acid have a strong correlation of 0.671, nd that they both have a distinct positive correlations with quality.

************************************************************

Final Plots and Summary (x3 plots)

Plot and description 1: Alcohol vs Quality

This graph demonstrates that there is a moderate trend upwards in quality of wine and alcohol content- ie. the nicer the wine, th higher the alcohol content, peaking mosting at 14 percent, with a small number of outliers with higher alcohol content.

Plot and description 2: Quality,Citric Acid and Fixed Acidity

Quality wines seem to fall within a fixed acidity od beteween 8 and 12, as well as a citric acid level of 0.3 and 0.6, as shown by this graph.

## $title
## [1] "Quality, Citric Acid, and Fixed Acidity"
## 
## $subtitle
## NULL
## 
## attr(,"class")
## [1] "labels"

Plot and description 3: Hierarchical Clustering Overview of Relationships Plot

I chose to include this graph as I felt it is a very good representative graph, showing an overview of all combinations of information

Reflection, including overview of findings and limitations of the EDA

The dataset contains 1,599 observations of 13 variables. I decided to measure and plot everything that I did against quality, or else plot correlations that were particularly strong, which I got an overview of by generating a hierarchical clustering graph.

THANK YOU FOR YOUR TIME *** END OF REPORT