by Js Lims
December 26 2016
Contents

Introduction

The purpose of this project is to use EDA(Exploratory Data Analysis) tequnique to figure out distributions, outliers, relations and any other surprising by exploring data from one variable to multiple variables. The goal of this project is to find important variables which influence the quality of red wine. This project is written out by using R programming.

A brief summary of the dataset

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599   13
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

As seen above graph, Fixed acidity is skewed positively. The mean is between median and 3rd quartile.

Volatile Acidity

Volatile Acidity can describe condition of wine. Appropriate volatile acidity is necessary to the scent of wine. If it is too much, the wine could go bad.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

The distribution of volatile acidity close to normal distribution, but there is small tail on the right side of the plot. I wonder the quality of wine which is out of 3rd quartile.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

There are three peaks in this plot.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It’s postively skewed. It has long tail on the right side. 75% of wines have residual sugar below 2.6 g/dm^3.

After removing ouliers, residual sugar looks normaly distributed.

Chlorides

This plot looks normally distributed, but there is long tail on the right side. I wonder effects of those outliers on quality of wine later.

After removing outliers, we can see the distribution looks normal.

Free sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

This plot is positively skewed. Sulfur dioxide is bad for human body, I wonder how this effects on quality of wine.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Also, the plot is positively skewed. There are outliers near 300.

After remvoing outliers and log scaling, the distribution looks normal.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

This plot is normally distributed well. The mean and medians are fairly closed.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Also, the plot is normally distributed.

Total Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Sulphates variale is left skewed.

With a log scale on x-axis, the distribution looks normal.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The plot is left skewed. 75% of wines have an alcohol below 11.10%.

Quality

## $x
## [1] "Quality ( 0 ~ 10 )"
## 
## attr(,"class")
## [1] "labels"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

I grouped the quality attribute as level attribute.

  • Quality 3 and 4 -> low
  • Quality 5 and 6 -> middle
  • Quality 7 and 8 -> high

Most of quality level is middle The mean quality score is 5.636

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation and 13 attributes in this data set. Except quality variable which is categorical, the variables are numeric.

What is/are the main feature(s) of interest in your dataset?

Quality variable is main. We need to figure out how other variables effects on main value.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

As i see some ariticles about wine, flavor and scent are important to quality of wines.
There would be many other factors effects on them and harmony of these factors would be important.
I think below variables would be support my investigation.
Total acidity, Fixed acidity, Citric acidity,Alcohol.

Did you create any new variables from existing variables in the dataset?

Not yet.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There are several plots were distributed positively skewed.

  • Free sulfur dioxide plot
  • Total sulfur dioxide plot
  • Alcohol plot
  • Citric acid plot

Since this data is tidy, I didn’t perform any process to adjust form of the data.

Bivariate Plots Section

I’m going to check relation between features.
First, let’s check relations with making pair plot.
The plot is created as subtracting 500 samples from whole dataset.

Pair plot

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

As seeing pair plot we can say,

  • The quality of wine looks relative to volatile acidity, citric acidity, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide.
  • There are negative and positvie correltion between some variables.

Let’s check them out.

Scatter plot

fixed acidity vs density, citric acid, pH

## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473
## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034
## 
##  Pearson's product-moment correlation
## 
## data:  wine$fixed.acidity and wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

Fixed acidity is positively correlated with density and citric acid, while negatively correlated with pH.

volatile acidity vs citric acid

## 
##  Pearson's product-moment correlation
## 
## data:  wine$volatile.acidity and wine$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Volatile acidity is negatively correlated with citric acid.

alcohol vs density

## 
##  Pearson's product-moment correlation
## 
## data:  wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Density is negatively correlated with alcohol. Since alcohol makes density of wine lower, there are negatively correlated.

free sulfur dioxide vs total sulfur dioxide

There are 2 outliers on the right side. There are no data points around them. So, before getting linear regression model, let’s remove them.

## 
## Call:
## lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide, data = wine[idx, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.600  -4.305  -1.693   3.605  34.972 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          5.656032   0.340589   16.61   <2e-16 ***
## total.sulfur.dioxide 0.220741   0.006074   36.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.723 on 1595 degrees of freedom
## Multiple R-squared:  0.453,  Adjusted R-squared:  0.4526 
## F-statistic:  1321 on 1 and 1595 DF,  p-value: < 2.2e-16
## 
##  Pearson's product-moment correlation
## 
## data:  wine[idx, ]$total.sulfur.dioxide and wine[idx, ]$free.sulfur.dioxide
## t = 36.341, df = 1595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6452693 0.6989950
## sample estimates:
##      cor 
## 0.673019

Total sulfur dioxide and free sulfur dioxide are positively correlated.

Box plot

The quality of wine is positively correlated with alcohol, citric acid and sulphates and negatively correlated with volatile acidity, pH and density.

alcohol density plot

This chart shows how alcohol percent highly effects the quality level.
The wine with high alcohol has higher probablity to be a high quality wine.

Linear model

## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates + 
##     citric.acid + density + pH, data = wine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64141 -0.38701 -0.06721  0.45480  2.11572 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -11.77058   11.92162  -0.987 0.323631    
## alcohol            0.34190    0.01985  17.222  < 2e-16 ***
## volatile.acidity  -1.32197    0.11597 -11.399  < 2e-16 ***
## sulphates          0.65627    0.10367   6.330 3.17e-10 ***
## citric.acid       -0.37834    0.13479  -2.807 0.005064 ** 
## density           15.84518   11.88503   1.333 0.182655    
## pH                -0.47787    0.13381  -3.571 0.000366 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6563 on 1592 degrees of freedom
## Multiple R-squared:  0.3421, Adjusted R-squared:  0.3396 
## F-statistic:   138 on 6 and 1592 DF,  p-value: < 2.2e-16

Linear model with 6 values explains describe 34.21% of variablity in quality, density and citric.acid are statistically unsignificant, there is likely to be no relationship between citric acid and density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I found relationships between some variables.

  • negative correlation
    • Fixed acidity vs pH
    • volatile acidity vs citric acid
    • alcohol vs density
  • positive correlation
    • fixed acidity vs density
    • fixed acidity vs citric acid
    • free sulfur dioxide vs total sulfur dioxideree sulfur dioxide vs total sulfur dioxide
Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Negative correlation between volatile acidity and citric acid is interesting.
It is not what i expected.

What was the strongest relationship you found?

Relationship between fixed acidity and pH is strongest.

Multivariate Plots Section

I grouped the quality attribute as level attribute.

The polygons are drawn in confidence interval 0.95.

volatile acidity , citric acid quality

High quality wines have higher citric acid and lower volatile aicidity, while low quality wines have lower citric acid and higher volatile acidity.

alcohol, citric acid, quality

High quality wines have higher alcohol and citric acid. Middle and low quality have similar alcohol , but middle quality alcohol has more citric acid.
There is no relationship between alcohol and citric acid.

volatile acidity, level of quality, alcohol

As quality of wines goes better, the relation between volatile acidity and alcohol is positive except for lowest quality of wine. Also, the more volatile alcohol, quality of wine goes worse.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Grouping qulity of wines in scatter plot with Citric acid and volatile acidity, show me clearly that higher citric acid and lower volatile acidity makes quality of wines be better.

There is no relation between alcohol and citric acid by looking at scatter plot. However, the plotting it with level of quality shows me that alcohol is really important variable to determine quality of wines high and citric acid attribute is also pretty important variable to determine a quality of wines.

Were there any interesting or surprising interactions between features?

In high quality of wines, most of wines which have low alcohol have high citric acid value and low volatile acidity. When high quality wine have low citric acidity and high volatile acidity, they have high level of alcohol.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I created a linear model to expect quality of wines in bivariate plots section with alcohol, volatile acidity, sulphates, citric acid, density and pH. However, it can explain 34.21% of variablity in quality which means it is not accracy.

Final Plots and Summary

Plot One

Description One

As creating violin plots with box plot, we can see distribution of volatile acidity for each quality of wines. As quality of wine goes better, volatile acidity is distributed at lower level and citric acid is distributed at higher level. The black lines among the median of each quality support volatile acidity and quality is negatively related. Also, it supports citric acidity and quality is positively related.

Plot Two

Description Two

As creating more ellipses on the right side, we can see there are quality level. The less volatile acidity and the more citric acid determine quality of wine better.
(confidence intaval : 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01)

Plot Three

Description Three

I added a ecdf plot on the right side. A rate in high quality of wines begins to rise at higher density of alcohol than others. As looking both plots, there is no big differences between low and middle quality of wines. However, In high quality of alcohol, It’s pretty different both middle and low quality of alcohol.


Reflection

This data set contains a lot of surprising information on red wines and their chemical properties. From each step, I did exploration data analysis one variable, two variables and more variables. I found what features are related to quality of wine.
I wish the data-set include other variables like measure of wine price, the place where wine made in or etc. That data set would ask us more interesting questions.

I was able to create a linear model to expect quality from new data, but that model was not accurate. If this dataset had quality variables as continuous, this analysis would be more accurate. With continuous taget variable, we could scale quality variable to get better visualization. That would make result clearer and be really good to make a linear model better. There might be still good ways to expect quality of wines with another kind of a model.

For exploring this dataset, i’ve tried to make a scatter plot. But, since the size of dataset is large, each data points are overlapped. That makes a plot bad view. Even adjusting color and opacity didn’t work well. Also, it makes me struggled to make a bubble chart in multivariate analysis. For this reason, i used ‘stat ellipse’ function and ‘stat smooth’ function which really helped me to get better plots.