Overview
The file I have chosen to work with is the Red Wine Quality file.
I want to answer the question, ‘what features constitute a “good” red wine?’ I know from the research I carried out that the following relevant features contribute to the taste of wine:
- Alcohol levels contribute to the flavour of the wine
- Acid is responsible for the fresh crisp taste of wine In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.” This is the ‘fixed acidity’ levels we have as a variable in the data set
- Wine spoilage is legally defined by volatile acidity. The amount of volatile acidity found in sound grapes [ought to be] negligible. It is a byproduct of microbial metabolism.
- Acetic acid bacteria require oxygen to grow, therefore, elimination of any air in wine containers and sulfur dioxide addition will limit their growth.
These insights indicate to me that ‘good’ red wine will have lower volatile acidity, higher fixed acidity, and some sulfur dioxide present. I also believe that by the questions I ask of the data I will discover the optimal range of some other features - eg alcohol percentage.
These insights informed the questions I thought of in my conversation with the data set.
**********************************************
I began by taking a quick look at data itself, by printing out the first 10 lines of data:
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
From a visual appraisal of the data, I can get a sense of the types of values that are present. This will inform the way I approach graphing the counts of the different pieces of data in the Univariate data section.
I then decided that I needed to know about the overall structure of the data, so I generated a quick look at aspects of the data:
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## [1] 1599 13
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## [1] 0
## [1] 132
Observations - I made some observations on the question of ‘what does a snapshot of this data tell me?’
- The data set ‘wineQualityReds’ contains 13 variables, with 1599 observations.
-The data is all in numeric format so no conversions need to occur (i.e. from string to numeric).
- Upper and lower limits of each column name are observed for the purpose of informing things like limit parameters for outliers, bin width etc; I will most likely need to set limit parameters for the data in my graphs, as well as using things like square root and log of the data, in order to get a better sense of the distribution of the data.
Note: I use different colours for each graph that follows, to visually separate out progress of the graphs.
Plotting
It was clear to me that the variable ‘quality’ was the most important variable, against which I would need to think about all other variables, in order to address my guiding question, ‘what makes a ’good’ red wine?‘I thought to myself, ’what does the distribution of quality look like?’
I used qplot to look at the layout of ‘quality’; deciding to set the binwidth to 0.5 first, but then changing this option to 1, as quality is all in integers, so a binwidth lower than 1 doesn’t make sense:

using ggplot syntax
As I say above, I wanted to know, what did the distribution of ‘quality’ look like?I discovered upon looking at the initial ‘Quality’ graph, it seems that all the wine falls within a certain range (3-8). So I set breaks to get a better idea of this:

Observations on Quality distribution:
From these graphs, I can surmise that quality falls between 3 and 8 in terms of points rankings - so, across 6 points.
The majority of wines have a quality of 5 or 6 - or ‘medium quality’, if we analyse the graphs by low, medium and high quality and assigned a level of 3-4, 5-6, and 7-8 respectively to these labels.
Fixed Acidity graph
I changed the binwidth to 0.1 to get a more accurate representation of the data distribution. I also try some different ‘step’ values for the break sequence, deciding that 1 is best in the end.
Fixed.acidity has a right-skewed long tail distribution, with the highest count of fixed acidity as seven.
I expect better quality wines to have medium to higher fixed acidity from my research

Volatile Acidity graph
I discovered earlier that the max is 1.58 and the min is 0.12 for volatile acidity
I decided set the bin width to 0.01, for this a narrow range
The peak amount is between 00.25 and 0.75 where the count reaches 350 and 400
I expect better quality wines to have low volatile acidity, as it is associated with spoilage

Residual sugar graph
I wanted to answer the question, ‘what does the distribution of residual sugar look like?’
I set the bin width to 0.5, to get a clearer picture of the variation in counts.
I decided to log it in order to get an even clearer picture of the variations.
I notice that residual sugar count is highest for lower amounts (min is 0.9, max is 15, as we learned earlier) It seems that circa two thirds of the wine have residual sugar of around 0.8 and 1.6


Citric Acid graph
I wanted to answer the question, ‘what does the distribution of citric acid look like?’
I set the bin width to 0.5, to get a clearer picture of the variation in counts. There is quite a variation in citric acid amounts, but the highest counts tend to contain between 0 and 0.5

Chlorides graph
I wanted to answer the question, ‘what does the distribution of chlorides look like?’
I set the bin width to 0.01, to get a clearer picture of the variation in counts.
This graph has a long tail but the values are negligable as they are so low, so I cut off some of these outliers to generate a graph without any skewing factors:
###Free Sulfur Dioxide graph
I wanted to answer the question, ‘what does the distribution of sulfur dioxide look like?’
I set the bin width to 1 in order to get a clear picture
There is a long tail on the right side of this graph; the data is over-dispersed
Total sulfur dioxide distribution was interesting to me because it is supposed to help guard against a wine getting spoiled so I wanted to plot its layout:
I set the bin width to 1, to get a clearer picture of the variation in counts.
There is a wide variation in total sulfur dioxide among the data;
I decided to limit the x axis as there were outliers that were skewing the data that I determined were not important.
This generated a much clearer picture of the distribution of the counts
###Density graph
Next, I moved on to the question,‘how is density distributed?’
I set the bin width to 0.001, to get a clearer picture of the variation in counts, as the figures are spread over a small window and so there is little variation in the amounts of density for each count
This generated a much clearer picture of the distribution of the counts

pH graph
Since I researched it and found out about the ideal pH levels for red wine (3.3-3.6), I wanted to find out, ‘what does the distribution of pH look like?’
I set the bin width to 0.01, as I discovered through trial and error that this was the best binwidth to reflect changes in the size of the window of distribution. We see that the pH of wine varies from 0 to 4 only, with many variations within this window. The most common band seesm to be that between 3.2 and 3.7:

Sulphates graph
I wanted to answer the question, ‘what does the distribution of sulphates look like?’
I set the bin width to 0.1. This allowed to observe the distribution as long tailed to the right:

Alcohol graph
I was curious to know,‘what does the distribution of alcohol look like?’ considering it affects the taste of wine.
The layout of ‘alcohol’ demonstrates that the highest count of alcohol level is circa 9.5 - 10.5 percent, however the distribution is not sharply spiked - i.e. there is a decent distribution of values from 8.5 to 12.5 percent. This suggests to me that wine alcohol levels can vary and still maintain their quality. Perhaps alcohol level simply influences types of taste, rather than a certain amount being objectively ‘good’ or ‘bad.’

Univariate Plots Questions
Overview
Univariate Plots Section Questions
What is the structure of your dataset?
The data set ‘wineQualityReds’ contains 13 variables, with 1599 observations.
What is/are the main feature(s) of interest in your dataset?
In my opinion, from observing the distributions plotted above, the main feature is quality, against which all other features will be measured.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
Other features which I think will be important are alcohol, acidity, sulfur dioxide and pH
Did you create any new variables from existing variables in the dataset?
I subset the dataset and created an outline of the wines which have higher quality, calling it ‘goodReds’; I generated this in the following section
Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
I checked for NAs; I performed log calculations on some of the x axis data, in order to determine a better outline of the distribution.
************************************************************
Bivariate Analysis
Reflection on questions
In my research, I looked into the features of what is considered to be ‘good’ wine. From this research into blogs, magazines and other online resources, I was able to direct my questions on the data. For example, I discovered that pH levels of wine can affect taste. I also discovered that a pH level of 3.3 to 3.6 is best for red wines. So, for example, one question I wanted to investigate in my EDA was to compare quality of wines and pH level, to see if the data backs up this theory.
Bivariate Analysis - Quality vs. other metrics
Introduction
I wanted to know, ‘what metrics do good red wines have in common?’
So, I decided to subset the data based on quality, isolating the rows that contain the best quality - that of a 7 or 8, the two highest points listed in the ‘quality’ metric. I will use this to plot against other features to get a sense of what constitutes a ‘high’ quality wine in some of the work I carry out below.
I therefore created the following dataset and called it ‘goodReds’:
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1542 1542 7.4 0.25 0.29 2.2
## 1545 1545 8.4 0.37 0.43 2.3
## 1550 1550 7.4 0.36 0.30 1.8
## 1556 1556 7.0 0.56 0.17 1.7
## 1585 1585 6.7 0.32 0.44 2.4
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1542 0.054 19 49 0.99666 3.40
## 1545 0.063 12 19 0.99550 3.17
## 1550 0.074 17 24 0.99419 3.24
## 1556 0.065 15 24 0.99514 3.44
## 1585 0.061 24 34 0.99484 3.29
## sulphates alcohol quality
## 1542 0.76 10.90 7
## 1545 0.81 11.20 7
## 1550 0.70 11.40 8
## 1556 0.68 10.55 7
## 1585 0.80 11.60 7
Quality vs. pH
As I mention in the overview above, ‘good’ red wine usually has a pH of between 3.3 and 3.6. There are the majority of wines falling into this window, but it is not just the ‘better quality’ wines. THis suggests that this is a common basic feature of wine, rather than an indicator of good red wine. I will try another layout to be sure.

I decided to subset the pH levels greater than 3.3 and less than 3.6
## X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1595 1595 6.2 0.600 0.08 2.0
## 1596 1596 5.9 0.550 0.10 2.2
## 1597 1597 6.3 0.510 0.13 2.3
## 1598 1598 5.9 0.645 0.12 2.0
## 1599 1599 6.0 0.310 0.47 3.6
## chlorides free.sulfur.dioxide total.sulfur.dioxide density pH
## 1595 0.090 32 44 0.99490 3.45
## 1596 0.062 39 51 0.99512 3.52
## 1597 0.076 29 40 0.99574 3.42
## 1598 0.075 32 44 0.99547 3.57
## 1599 0.067 18 42 0.99549 3.39
## sulphates alcohol quality
## 1595 0.58 10.5 5
## 1596 0.76 11.2 6
## 1597 0.75 11.0 6
## 1598 0.71 10.2 5
## 1599 0.66 11.0 6
### Overlaying boxplot on scatterplot fo pH and Quality:

I then produced further graphs to investigate this question, to see if wines which have a pH between 3.3 and 3.6 fall in the ‘high quality’ category (i.e. have a quality of 7 or higher).
Conclusion:
It doesn’t appear to be the case that pH is a strong indicator, as the bulk of what is considered the best pH for red wine occurrs at the mid-level of quality (5-6). As I said earlier, the presence of a pH between 3.3 and 3.6 seems to be the norm rather than pointing to the ‘exceptional’ wine.
Quality vs. alcohol
In this section, I wanted to know, how does alcohol level influence wine quality?
I discovered that there is a positive correlation between amount of alcohol and quality of wine - that the more alcohol, the better the quality of wine.
Correlation: I used the Pearson product-moment correlation, which measures the strength of relationship between any two variables, and discovered that there is a correlation coefficient of 0.4761. This is a moderate indication of causality, although, the lower count of wines in the higher quality range in this dataset make it difficult to be more accurate.

##
## Pearson's product-moment correlation
##
## data: df$quality and df$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
Quality vs. fixed.acidity
From my research, I know that fixed acidity influences quality, so I wanted to know if there were higher amounts of fixed acidity present in higher quality wines??
I plotted the original dataset (‘df’) values of quality vs. fixed.acidity to see if there is any trend in relation to the fixed acidity of a wine and its quality. It does appear that between bad and medium quality that there is an upward trend - i.e., the better the wine, the higher the fixed acidity. However, there is not enough higher quality reds in the data set (quality => 7) to confirm this seeming upward trend, but looks likely:

Quality vs total.sulfur.dioxide
From my research, I discovered that “the use of sulfur dioxide (SO2) is widely accepted as a useful winemaking aide. It is used as a preservative because of its anti-oxidative and anti-microbial properties in wine.” Therefore, I would expect to find a correlation between presence of sulfur dioxide and quality wine.
However, there seems to be a downward trend in quality and total sulfur dioxied - the higher the quality, the lower the sulfur dioxide. However, within lower quality wines, there is also lower sulfur dioxide:

I tried plotting the data a few different ways to gain further insight into what that data is telling us. From applying the cor.test function, we see that there is a correlation of -0.185 between quality and total sulfur dioxide:
##
## Pearson's product-moment correlation
##
## data: df$quality and df$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
Fixed Acidity
Using facet wrap function to plot the distribution of fixed acidity over different quality levels
Observations: There is an abundance of wines of quality level 5 and 6 that have higher fixed acidity. The graph seems to indicate that there higher quality means higher presence of fixed.acidity, though quality levels 3, 4, and 8, have very little fixed acidity:

Quality vs. Volatile Acidity
There appears to be a correlation between lower volatile acidity and better quality wine (7-8 points). In this graph plotted, there is a downward volatile acidity pattern related to quality - as in, the high quality wines (7-8) seem to have lower overall volatile acidity than teh rest of the quality values
I check this visual appraisal by applying the Pearson product moment correlation.
There is a small but meaningful correlation between the volatile acidity and the quality of the wine of -0.3906.
A similar downward trend can be observed using the point graph- the higher the quality, the lower the volatile acidity.

##
## Pearson's product-moment correlation
##
## data: df$quality and df$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
QUality vs Residual Sugar
There seems to be a strong baseline amount of sugar common across all levels of quality - that each will have between 2 and 4 mg.

#### (maybe a faceted scatterplot with box plots)
Multivariate Analysis and Plots section questions
Overview - Correlations
Here I am using corrplot to create a correlation plot from all the metrics, in order to get an overview picture of how variables (besides quality) relate to eachother in terms of correlation. The correlation ranges from 1 (colour representative = blue shades) to -1 (colour representatives are red shades)

Observations - these observations will be used to inform how I go about plotting the multivariate plots
Total and free sulfur dioxide have a strong positive correlation according to the graph, but this is not surprising since ‘free’ are counted as part of ‘total’ sulfur dioxide presence.
Fixed acidity and density seem to have a strong positivcorrelation
Fixed acidity and pH have a strong negative correlation
Density and residual sugara have a positive correlatino
Volatile acidity and citric acid have a negative correlation
Citric acid and pH have a negative correlation
I decided to graph some of the relationships shown in Overview COrrelation outline above
Graph 1: Fixed Acidity , Density and Quality
I wanted to know, what is the relationship between density, fixed acidity and quality?
Graph 2: Fixed acidity, pH and Quality
In exploring the question of the relationship between fixed acidity, pH and quality, I discovered that the most often-occurring best quality wines ( 7 and 8) occur in the regin of 3-4 pH.

Graph 3:Quality,Citric Acid and Fixed Acidity
I wanted to know how citric acid readings shed light on quality and fixed acidity, since it is an additive.
It appears that better quality wines have higher citric acid, from the graph below:

Graph 4: Fixed Acidity, Volatile Acidity and Citric Acid
I was interested in knowing the relationship between acidities, since one is bad and two are good to have in wine, as I have observed earlier - high volatile numbers is bad but the others are indicators of good wine.
Sure enough, the graph showed this- there is a correlation between higher citric acid and higher fixed acidity:

Multivariate Analysis
Question: Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Answer: From the graphs in this section, I observed that there is a correlation between good wines Wines where there is more than 0.30 citric acid and circa 5 and upwards of fixed acidity.
I also observed that the amount of citric acid increases where the fixed acidity rises and volatile acidity gets lower.
Question: Were there any interesting or surprising interactions between features?
Answer: I was surprised to find that fixed acidity and citric acid have a strong correlation of 0.671, nd that they both have a distinct positive correlations with quality.
************************************************************
Final Plots and Summary (x3 plots)
Plot and description 1: Alcohol vs Quality
This graph demonstrates that there is a moderate trend upwards in quality of wine and alcohol content- ie. the nicer the wine, th higher the alcohol content, peaking mosting at 14 percent, with a small number of outliers with higher alcohol content.

Plot and description 2: Quality,Citric Acid and Fixed Acidity
Quality wines seem to fall within a fixed acidity od beteween 8 and 12, as well as a citric acid level of 0.3 and 0.6, as shown by this graph.

## $title
## [1] "Quality, Citric Acid, and Fixed Acidity"
##
## $subtitle
## NULL
##
## attr(,"class")
## [1] "labels"
Plot and description 3: Hierarchical Clustering Overview of Relationships Plot
Reflection, including overview of findings and limitations of the EDA
The dataset contains 1,599 observations of 13 variables. I decided to measure and plot everything that I did against quality, or else plot correlations that were particularly strong, which I got an overview of by generating a hierarchical clustering graph.
While the data yielded some interesting insights, the distribution of wines according to quality means that the majority of trends were more difficult to discern, as there were much more wines of quality 5 and 6 than there were of 7 and 8. I would have preferred to work with data where there was a more even spread of quality, to better understand what makes a wine good.
THANK YOU FOR YOUR TIME *** END OF REPORT