P4 Explore and Summarize Data: Red Wine EDA

by Jason Medina

for Udacity Data Analyst Nanodegree (DAND)

Introduction

This assignement explores the univariate, bivariate, & multivariate relationships between variables with data analysis techniques in R. The obersvation and variables have red wine samples only. The dataset is here; and this information document contains helpful descriptions and domain knowledge for the dataset.

Other helpful links for this work:
Wiki article on acids in wine
Types of variables
Outlier function from r-bloggers
Analysis on both White and Red Wine
GGcorr documentation
Legends GGplot cookbook
Wiki to reversal paradox
Package for reversal paradox
Diamonds example project
Git sample project 1
Git sample project 2

EDA with R for red wine data

There are 1599 observations and 13 variables in this data set.

## 'data.frame':  1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
X, the first variable, looks like a unique identifier.
Note no observations have a quality greater than 8. This lack of disparity is not favorable for analysis.
Quality is an ordinal categorical variable. A new variable is created from quality for subsquent analysis. Read this for more on types of variables. From the information document: “The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent)”.

Except for total.sulfar.dioxide and free.sulfar.dioxide, all other numeric variables are continous.
Note total.sulfer.dioxide is the sum of free.sulfur.dioxide and bound forms; hence the two sulfur variables are related.

volitale.acidity is acetic acid, different from tartaric or fixed.acidity and citric.acid. Acetic acid gives wine vinegar like taste, while fixed acids do not easily evaporate. Citric acid is added to some wines for freshness or to increase acidity.

Univariate Plots Section

Outliers skew the mean. Running a version of this function will identify and plot outliers. This view shows how the distribution improves with higher proprtions or excessive outliers; see chlorides, residual sugars and sulphates for examples where the distribution improves after removing outliers.

## chlorides  outlier stats 
## 112 Outliers identified 
## 7.5 (%) propotion of outliers 
## 0.2 mean of the outliers 
## 0.09 mean without removing outliers 
## 0.08  mean with removeing outliers

## residual.sugar  outlier stats 
## 155 Outliers identified 
## 10.7 (%) propotion of outliers 
## 5.88 mean of the outliers 
## 2.54 mean without removing outliers 
## 2.18  mean with removeing outliers

## sulphates  outlier stats 
## 59 Outliers identified 
## 3.8 (%) propotion of outliers 
## 1.23 mean of the outliers 
## 0.66 mean without removing outliers 
## 0.64  mean with removeing outliers

## total.sulfur.dioxide  outlier stats 
## 55 Outliers identified 
## 3.6 (%) propotion of outliers 
## 143.89 mean of the outliers 
## 46.47 mean without removing outliers 
## 43  mean with removeing outliers

## fixed.acidity  outlier stats 
## 49 Outliers identified 
## 3.2 (%) propotion of outliers 
## 13.29 mean of the outliers 
## 8.32 mean without removing outliers 
## 8.16  mean with removeing outliers

These distributions below are more like normal distributions, and less impacted by outliers.

## citric.acid  outlier stats 
## 1 Outliers identified 
## 0.1 (%) propotion of outliers 
## 1 mean of the outliers 
## 0.27 mean without removing outliers 
## 0.27  mean with removeing outliers

## pH  outlier stats 
## 35 Outliers identified 
## 2.2 (%) propotion of outliers 
## 3.42 mean of the outliers 
## 3.31 mean without removing outliers 
## 3.31  mean with removeing outliers

## density  outlier stats 
## 45 Outliers identified 
## 2.9 (%) propotion of outliers 
## 1 mean of the outliers 
## 1 mean without removing outliers 
## 1  mean with removeing outliers

## volatile.acidity  outlier stats 
## 19 Outliers identified 
## 1.2 (%) propotion of outliers 
## 1.13 mean of the outliers 
## 0.53 mean without removing outliers 
## 0.52  mean with removeing outliers

Univariate Analysis

This summary excludes the unique identifer variable X:

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Quality variable max and min are 8 and 3with a 6 median and 5.6 mean. Recall quality is on a 0 to 10 scale; hence both end points, zero and ten, are missing as are one, two and nine as seen in histogram.

Citric acid min is 0. There exist 132 zero values. Only variable with any zero values.

Residual sugar, chlorides and sulfurs appear to have outliers.

Similar max for fixed acidity, residual sugar and alcohol but different means and medians.

Density and pH appear to have normal distributions.

Quality variable

The feature of main interest is quality categorical numeric, ordinal variable. Observations rate into three groups like bad, average or good with respect to this variable. Although it is not possible to place a value nor say an average wine is three times better than a bad wine.

Citric acid as a supplement, some wines have no citric acid

There exist 132 zero values in the citric acid variable. These zeros account for 8.3 percent. Citric acid is an inexpensive way to boost total acidity in wine. This wiki link says the use of citric acid for acidification is prohibited in the EU; though limited use is permitted for removing excess iron and copper from wine if potassium ferrocyanide is unavailable.

Template questions

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

There may exist an association between residual.sugar and quality. Unsure how the multiple acidity variables relate to each other and quality. Citric.acid is the only variable with observations equal zero. Expect most positive correlation for quality to be with alcohol, and most negative to be volatile acid.

Did you create any new variables from existing variables in the dataset?

Yes, changed quality to an ordered factor the created a new variable called rating to bucket wines as bad, average or good based on quality. In addition, the tt.acidity variable sums up fixed acidity, volatile acidity, and citric acid as these acids are independent. Volatile acidity or acetic acid at high levels can lead to an unpleasant vinegar taste. Therefore expect at least a moderate negative correlation between volatile acidity and quality. On the other hand, citric acid in small quantities can add ‘freshness’ and flavor to wines while fixed acids do not evaporate readily per the information document.

# change quality to ordered factor
rw$quality <- factor(rw$quality, ordered = TRUE)

# create total acidity variable
rw$ttl.acidity <- rw$citric.acid + rw$fixed.acidity + rw$volatile.acidity

# create rating variable wrt quality 
rw$rating <- ifelse(rw$quality < 5, 'bad', ifelse(
  rw$quality < 7, 'average', 'good'))
rw$rating <- ordered(rw$rating,
                     levels = c('bad', 'average', 'good'))

Summary for rating variable below, not many bad or good wines in the dataset limits model learning and training.

##     bad average    good 
##       4      82      14

_table shows percent distribution

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Tidy adjustments not made to data. The lack of disparity in the distribution for quality standsout. Other variables have long tails, see above sections for additional commentary. Subsquent boxplots to show outliers while ratings feature to help classify observations in further analysis.

Bivariate Plots Section

(Recall .6 to .79 corr is strong, .4 to .59 moderate and less than .2 is very weak)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Two variables, alcohol and volatile acidity have moderate correlations with quality. There is a 0.48 moderate positive correlation with alcohol and and quality. From the plot, quality increases at moderate rates with higher alcohol.

```
## # A tibble: 3 x 4
##    rating alcohol_mean alcohol_median     n
##     <ord>        <dbl>          <dbl> <int>
## 1     bad     10.21587           10.0    63
## 2 average     10.25272           10.0  1319
## 3    good     11.51805           11.6   217
```

Volatile acid and quality have a -0.39 moderate negative correlation, which implies red wine quality decreases as volatile acidity increases.

## # A tibble: 3 x 4
##    rating volatile.acidity_mean volatile.acidity_median     n
##     <ord>                 <dbl>                   <dbl> <int>
## 1     bad             0.7242063                    0.68    63
## 2 average             0.5385595                    0.54  1319
## 3    good             0.4055300                    0.37   217

This plot excludes the average wines to plot alcohol and volatile acidity into two different colored clusters. The trend lines make is easy to see the relationship alcohol and volatile acidity by rating.

There exist weak positive correlations for both 1) quality and sulphates and 2) quality and citric acid. Quality trends in the same direction as both sulphates and citric acid at a weak rate.

## # A tibble: 3 x 4
##    rating sulphates_mean sulphates_median     n
##     <ord>          <dbl>            <dbl> <int>
## 1     bad      0.5922222             0.56    63
## 2 average      0.6472631             0.61  1319
## 3    good      0.7434562             0.74   217

## # A tibble: 3 x 4
##    rating citric.acid_mean citric.acid_median     n
##     <ord>            <dbl>              <dbl> <int>
## 1     bad        0.1736508               0.08    63
## 2 average        0.2582638               0.24  1319
## 3    good        0.3764977               0.40   217

Citric acid and fixed acidity have a strong positive correlation of 0.67; citric acid has a weak positive correlation of 0.23 with quality while fixed acidity has a very weak positive correlation of 0.12.

## # A tibble: 3 x 4
##    rating fixed.acidity_mean fixed.acidity_median     n
##     <ord>              <dbl>                <dbl> <int>
## 1     bad           7.871429                  7.5    63
## 2 average           8.254284                  7.8  1319
## 3    good           8.847005                  8.7   217

The moderate negative correlation volatile acidity and citric acid where the volatile acidity y values scale with the squre root function.

## # A tibble: 3 x 4
##    rating volatile.acidity_mean volatile.acidity_median     n
##     <ord>                 <dbl>                   <dbl> <int>
## 1     bad             0.7242063                    0.68    63
## 2 average             0.5385595                    0.54  1319
## 3    good             0.4055300                    0.37   217

Other observations and noteable correlations

ttl.acidity variable is the sum of citric, volatile and fixed acidity. Ignore any correlation efficents between these three variable.
Volatile acidity and citric acid have strong moderate negative correlation; volatile acidity has moderate negative correlation with quality. Citric acid has weak positive correlation with quality. Likewise, citric acid and fixed acidity also share a strong correlation.

Citric acid and sulfates have weak positive correlations with quality.
Residual sugar corellates to only one variable, density.

Density and citric acid each correlate with five variables.

Not unexpected, fixed.acidity and pH have a strong negative correlation, likewise, total.sulfur.dioxide and free.sulfur.dioxide have a strong positive correlation. A list of notable correlations is below.

Template questions

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Yes, the Simpsons package clusters data into subsets to test if the regression at the level of the group is in the opposite direction at the level of the clusters. This package helps identify instances of Simpsons paradox.

Executing the Simpson function on citric acid and fixed acidity detects several clusters to regress upon. Only two clusters correlate in the same direction as the group. The overall trend for the subgroups reverses or disappears when the subgroups are combined.

Opposite Trend Lines for subgroup and overall group

This is also known as the reversal or amalgamation paradox. See wiki page here for more examples. For the correlation with density and fixed acidity, the simpsons function identifies three clusters, two of which show no evidence for Simpson’s paradox.

What was the strongest relationship you found?

Citric acid and fixed acidity have a strong positive correlation of 0.67, while citrict acid and volatile acidity have a moderate negative correlation of -0.55.

Multivariate Plots Section

There is a negative moderate correlation of 0.5 between alcohol and density. This means higher alcohol implies lower density. This makes sense since alcohol is less dense than water. The boxplot confirms wines rated good have higher alcohol content.

pH measures acidity, this multivariate plot shows the negative correlations between pH and fixed acidty has little affect on rating.

Multivariate Analysis

Alcohol, sulphates and citric acid have the largest positive correlations with quality. Higher quality wines tend to be higher in alcohol, citric acid and sulphates. On the other hand, higher quality wines tend to be lower in volatile acidity.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Applying the simpsons package to different variables to identify lurking variables impacting the overall coefficent variables. Residual sugar and pH had no meaningful correlation; this confirms that wine quality really is about acidic profiles.

Were there any interesting or surprising interactions between features?

There is no meaningful correlation between residual sugar and alcohol; that and the positive correlation between pH & volatile acidity were unexpected.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

We know that pH measures acidity on a log scale. So stronger correlations between variables with acidity and pH make sense. A linear model can quantify the pH variance based upon the three acidity variables. The r-squared value is nearly 50% of the pH variance is explained by the acid variables; meaning other variable(s) contribute to the variance.

m <- lm(pH ~
            I(log10(fixed.acidity)) +
            I(log10(volatile.acidity)) +
            I(log10(citric.acid)),
                data = subset(rw, rw$citric.acid > 0))

rw$pH.predictions <- predict(m,rw)
rw$pH.error <- (rw$pH.predictions - rw$pH)/rw$pH

ggplot(data = df, aes(x = quality, y = pH.error)) +
  geom_boxplot()
summary(m)

Final Plots and Summary

Plot One: Alcohol and Wine Quality

Description One

This boxplot confirms wine quality increases with alcohol content. Though plenty of outliers exist in quality 5.

Plot Two: Acidic and pH variable

Description Two

The trends line for Bad and Good wines trend in a different direction than average wines. The average subgroup shows a negative correlation between alcohol and volatile acidity. The trend lines make is easy to see the relationship alcohol and volatile acidity by rating.

Plot Three: Correlation Matrix

Highlight correlations with abs value greater than 0.3

Correlation matrix for red wine data subset

Description Three

The correlation matrix makes is easy to identify correlations greater than 0.3 - this visual is clean and highlight noteable correlations.

Reflection

With this exploratory data analysis on the red wine dataset, I found the biggest challenging was sharing the right amount of information. Plots and visuals make it easier to see where to explore more. Overcoming challenges with domain knowledge was aided by internet research, though I can see how domain knowledge is super helpful during the EDA process.

Alcohol and volatile acid have the largest correlations with quality. Citric acid and sulphates also have positive correlations. Sulphates, liked fixed acidity, had several observations with high values but average wine ratings. This subset was missing any wines rated above 8 or below 3. Having a more disparse dataset would improve the analysis; some of the challenges with the data included factoring the quality variable and domain knowledge. Wine is all about the acids, so understaind the relationships between acids and sulfur could be helpful. Additional inferential statistics and modeling could be done to quantify and confirm the analysis. After this project, I understand why wine data is a fun way to explore data analysis techniques.