Red and White Wine Quality

by Daria Alekseeva

1. Citation

This dataset is public available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

2. About dataset

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

The two datasets are related to red and white variants of the Portuguese “Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

3. Number of Instances:

red wine - 1599; white wine - 4898.

4. Number of Attributes:

11 + output attribute

5. Attribute information:

   For more information, read [Cortez et al., 2009].

   Input variables (based on physicochemical tests):
   1 - fixed acidity (tartaric acid - g / dm^3)
   2 - volatile acidity (acetic acid - g / dm^3)
   3 - citric acid (g / dm^3)
   4 - residual sugar (g / dm^3)
   5 - chlorides (sodium chloride - g / dm^3
   6 - free sulfur dioxide (mg / dm^3)
   7 - total sulfur dioxide (mg / dm^3)
   8 - density (g / cm^3)
   9 - pH
   10 - sulphates (potassium sulphate - g / dm3)
   11 - alcohol (% by volume)
   Output variable (based on sensory data): 
   12 - quality (score between 0 and 10)

6. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

# set up working directory
setwd("/home/daria/Courses/R/Udacity/EDA_Course_Materials/FinalProject")
# load all packages used in this exploratory analysis
library(ggplot2)
library(GGally)

Analysis

Load the Data

red <- read.csv('wineQualityReds.csv')
white <- read.csv('wineQualityWhites.csv')

# add categorical varialbles to both sets
red['color'] <- 'red'
white['color'] <- 'white'

# merge red wine and white wine datasets
data <- rbind(red, white)

Summary of the Data Set

head(data)
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality color
## 1       5   red
## 2       5   red
## 3       5   red
## 4       6   red
## 5       5   red
## 6       5   red
tail(data)
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 6492 4893           6.5             0.23        0.38            1.3
## 6493 4894           6.2             0.21        0.29            1.6
## 6494 4895           6.6             0.32        0.36            8.0
## 6495 4896           6.5             0.24        0.19            1.2
## 6496 4897           5.5             0.29        0.30            1.1
## 6497 4898           6.0             0.21        0.38            0.8
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 6492     0.032                  29                  112 0.99298 3.29
## 6493     0.039                  24                   92 0.99114 3.27
## 6494     0.047                  57                  168 0.99490 3.15
## 6495     0.041                  30                  111 0.99254 2.99
## 6496     0.022                  20                  110 0.98869 3.34
## 6497     0.020                  22                   98 0.98941 3.26
##      sulphates alcohol quality color
## 6492      0.54     9.7       5 white
## 6493      0.50    11.2       6 white
## 6494      0.46     9.6       5 white
## 6495      0.46     9.4       6 white
## 6496      0.38    12.8       7 white
## 6497      0.32    11.8       6 white
dim(data)
## [1] 6497   14
names(data)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"              "color"
summary(data)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.: 813   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Median :1650   Median : 7.000   Median :0.2900   Median :0.3100  
##  Mean   :2044   Mean   : 7.215   Mean   :0.3397   Mean   :0.3186  
##  3rd Qu.:3274   3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.443   Mean   :0.05603   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.219   Mean   :0.5313  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality         color          
##  Min.   : 8.00   Min.   :3.000   Length:6497       
##  1st Qu.: 9.50   1st Qu.:5.000   Class :character  
##  Median :10.30   Median :6.000   Mode  :character  
##  Mean   :10.49   Mean   :5.818                     
##  3rd Qu.:11.30   3rd Qu.:6.000                     
##  Max.   :14.90   Max.   :9.000
str(data)
## 'data.frame':    6497 obs. of  14 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ color               : chr  "red" "red" "red" "red" ...

Observations from the Summary

Mean residual sugar level is 5.4 g/l, but there is a sample of very sweet wine with 65.8 g/l (an outlier). Mean free sulfur dioxide is 30.5 ppm. Max value is 289 which is quite high as 75% is 41 ppm. PH of wine is within range from 2.7 till 4, mean 3.2. There is no basic wines in this dataset (no high pH levels). Alcohol: lightest wine is 8%, strongest is 14.9. Minimum quality mark is 3, mean 5.8, highest is 9.

Understand the Distribution of Single Variables

First I’d like to have a look at a single variable to get an idea about its distribution and decide for the things I will have a look deeper later.

Quality of Wine

summary(data$quality)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.818   6.000   9.000
table(data$quality)
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5
qplot(quality, data = data, fill = color, binwidth = 1) +
    scale_x_continuous(breaks = seq(3,10,1), lim = c(3,10)) +
    scale_y_sqrt()

We know that number of observations for red and white are different in out dataset, but still we can see that for both colors it’s normal distribution with almost the same picks at 5 and 6 quality point.

Level of alcohol

summary(data$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90
qplot(alcohol, data = data, fill = color, binwidth = 0.5) +
    scale_x_continuous(breaks = seq(8,15,0.5), lim = c(8,15))

Alcohol level distribution looks skewed. Again, red wine sample is smaller but it gives the same pattern of alcohol level distribution as while wines. Most frequently wines have 9.5%, mean is 10.49% of alcohol.

Density of wine

summary(data$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390
qplot(density, data = data, fill = color, binwidth = 0.0002) +
    scale_x_log10(lim = c(min(data$density), 1.00370), 
                  breaks = seq(min(data$density), 1.00370, 0.002))

Looking at ‘table’ summary we see that there are two outliers: 1.0103 and 1.03898. To see the distribution of density clearer I used log10 and limited the data. Now we can see that density distribution of white wine is bimodal and of red wine is normal.

Level of Volatile Acidity

summary(data$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800
qplot(volatile.acidity, data = data, fill = color, binwidth = 0.001) +
    scale_x_log10(breaks = seq(min(data$volatile.acidity), 
                               max(data$volatile.acidity), 0.1))
## Warning: position_stack requires constant width: output may be incorrect

Volatile acidity has normal distribution. I also suppose that more acetic wines have worse marks because high acidity can lead to unpleasant taste.

Level of Chlorides

summary(data$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100
qplot(chlorides, data = data, fill = color, binwidth = 0.01) +
    scale_x_log10(breaks = seq(min(data$chlorides), max(data$chlorides), 0.1))
## Warning: position_stack requires constant width: output may be incorrect

Chlorides distribution initially is skewed so I used log10 to see the distribution clearer.

Understand the Distribution of Two and More Variables

Density of Quality by color

qplot(quality, data = data, binwidth = 1, color = color, geom = "density") + 
    scale_x_continuous(breaks = seq(3, 9, 1))

In out sample we have almost the same amount of red and white wines with quality ‘3’, ‘4’ and ‘9’, more red wines with quality ‘5’ and more white wines with quality “6”, “7” and “8”.

ggpairs(data)
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

I checked correlation between all the variables in wine dataset.

We can see some correlation in pairs like:

  • alcohol vs. density
  • fixed.acidity vs. density
  • residual.sugar vs total.sulfur.dioxide
  • residual.sugar vs. density
  • residual.sugar vs. alcohol
  • chlorides vs. density
  • chlorides vs. sulphates
  • quality vs. alcohol

Compare objective parameters of wines

# use function for plotting with ggplot for simplicity of code
f <- function(dataset, x, y, z, opts=NULL) {
  ggplot(dataset, aes_string(x = x, y = y, color = z)) +
   geom_point(alpha = 1/5, position = position_jitter(h = 0), size = 2) +
   geom_smooth(method = 'lm')
}

# density vs. alcohol plot
p <- f(data, "density", "alcohol", "color")
p + coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15))

# density vs. fixed.acidity plot
p <- f(data, "density", "fixed.acidity", "color")
p + coord_cartesian(xlim=c(min(data$density),1.005))

# residual.sugar vs. total.sulfur.dioxide
p <- f(data, "residual.sugar", "total.sulfur.dioxide", "color")
p + scale_x_log10() +
    coord_cartesian(xlim=c(min(data$residual.sugar),30), 
                    ylim=c(min(data$total.sulfur.dioxide), 350))

# residual.sugar vs. density
p <- f(data, "residual.sugar", "density", "color")
p + coord_cartesian(xlim=c(min(data$residual.sugar),25), 
                    ylim=c(min(data$density), 1.005))

# residual.sugar vs. alcohol
p <- f(data, "residual.sugar", "alcohol", "color")
p + coord_cartesian(xlim=c(min(data$residual.sugar),25), 
                    ylim=c(min(data$alcohol), 15))

# chlorides vs. density
p <- f(data, "chlorides", "density", "color")
p + scale_x_log10() +
    coord_cartesian(ylim=c(min(data$density), 1.005))

# chlorides vs. sulphates
p <- f(data, "chlorides", "sulphates", "color")
p + scale_x_log10() +
    coord_cartesian(ylim=c(min(data$sulphates), 1))

After checking correlated pairs I noticed that red and white wine behave different in some graphs. Means for red and white wine correlation can be significantly different.

                                                RED      |     WHITE
* alcohol vs. density                    :    strong c.  :   strong c.
* fixed.acidity vs. density              :    strong c.  :     no c.
* residual.sugar vs total.sulfur.dioxide :     weak      :    weak c.
* residual.sugar vs. density             :    strong c.  :   strong c. 
* residual.sugar vs. alcohol             :     no c.     :   strong c.
* chlorides vs. density                  :    strong c.  :   strong c.
* chlorides vs. sulphates                :    strong c.  :     no c.

Compare Quality vs. Alcohol

ggplot(data = data, aes(y = alcohol, x = quality)) +
   geom_point(alpha = 1/4, position = position_jitter(h = 0), size = 4) +
   geom_smooth(method = 'lm') +
   facet_wrap(~ color)

My idea that volatile acidity affects the quality of wine is not correct. The only objective wine parameter which has correlation with quality is alcohol.

Significantly Different Parameters for Red and White Wine

Fixed Acidity vs. Color

qplot(x = color, y = fixed.acidity, data = data, geom = "boxplot")

Volatile Acidity vs. Color

qplot(x = color, y = volatile.acidity, data = data, geom = "boxplot")

Residual Sugar vs. Color

qplot(x = color, y = residual.sugar, data = data, geom = "boxplot")

Total Sulfur Dioxide vs. Color

qplot(x = color, y = total.sulfur.dioxide, data = data, geom = "boxplot")

These are parameters which depend a lot on the color of wine.

Quality of Red and White Wines

qplot(x = color, y = quality, data = data, geom = "boxplot")

Mean and 25% & 75% quintiles are similar for red and white wines.

Exploring Wine Parameters and Quality

Chlorides and Sulphates

# converting 'quality' vector into factor variable
data$quality <- as.factor(data$quality)

ggplot(aes(x = chlorides, y = sulphates, color = quality), data = data) +
    facet_wrap(~color) + 
    geom_point(size = 3, alpha = 1/4) +
    scale_color_identity(guide = 'legend') +
    ylim(min(data$sulphates), quantile(data$sulphates, 0.95)) +
    xlim(min(data$chlorides), quantile(data$chlorides, 0.95)) 
## Warning: Removed 391 rows containing missing values (geom_point).
## Warning: Removed 182 rows containing missing values (geom_point).

Sulfates and chlorides of white wine are spread out more than red wine. Most frequent quality levels for both colors are 4, 5, 6 and 7.

Fixed Acidity and Volatile Acidity

ggplot(aes(x = fixed.acidity, 
           y = volatile.acidity, 
           color = quality), 
       data = data) +
    facet_wrap(~color) + 
    geom_point(size = 3, alpha = 1/4) +
    scale_color_identity(guide = 'legend') +
    ylim(min(data$volatile.acidity), 
         quantile(data$volatile.acidity, 0.99)) +
    xlim(min(data$fixed.acidity), 
         quantile(data$fixed.acidity, 0.99)) 
## Warning: Removed 115 rows containing missing values (geom_point).
## Warning: Removed 7 rows containing missing values (geom_point).

Red wine of quality 5 has fixed acidity between 6 - 12, white wine = 5 - 10. White wine samples of quality ‘6’ are highly concentrated around volatile acidity of 0.2 and fixed acidity of 6-7.

Free Sulfur Dioxide and Total Sulfur Dioxide

ggplot(aes(x = free.sulfur.dioxide, 
           y = total.sulfur.dioxide, 
           color = quality), 
       data = data) +
    facet_wrap(~color) + 
    geom_point(size = 3, alpha = 1/4) +
    scale_color_identity(guide = 'legend') +
    ylim(min(data$total.sulfur.dioxide), 
         quantile(data$total.sulfur.dioxide, 0.95)) +
    xlim(min(data$free.sulfur.dioxide), 
         quantile(data$free.sulfur.dioxide, 0.95)) 
## Warning: Removed 6 rows containing missing values (geom_point).
## Warning: Removed 523 rows containing missing values (geom_point).

We can see clear positive correlation between total sulfur dioxide and free sulfur dioxide for red and white wine. Total sulfur dioxide of white wine has higher values mostly because of wine with quality ‘5’.

PH and Alcohol

ggplot(aes(x = pH, y = alcohol, color = quality), data = data) +
    facet_wrap(~color) + 
    geom_point(size = 3, alpha = 1/4) +
    scale_color_identity(guide = 'legend') +
    ylim(min(data$alcohol), quantile(data$alcohol, 0.95)) +
    xlim(min(data$pH), quantile(data$pH, 0.95)) 
## Warning: Removed 202 rows containing missing values (geom_point).
## Warning: Removed 372 rows containing missing values (geom_point).

Levels of alcohol for both wine is quite the same, but minimum level of pH for white wine is 2.9, for red wine is 3.1. Only red wine with quality ‘5’ has pH level around 3.

Citric Acid and Alcohol

ggplot(aes(x = citric.acid, y = alcohol, color = quality), 
       data = data) +
    facet_wrap(~color) + 
    geom_point(size = 3, alpha = 1/4) +
    scale_color_identity(guide = 'legend') +
    ylim(min(data$alcohol), quantile(data$alcohol, 0.95)) +
    xlim(min(data$citric.acid), quantile(data$citric.acid, 0.95)) 
## Warning: Removed 161 rows containing missing values (geom_point).
## Warning: Removed 430 rows containing missing values (geom_point).

In these plots we can notice that most of red wine is clustered around where citric acid is 0 - 0.2, for white wine this is true for citric acid level 0.2 - 0.4.

Final Plots and Summary

Plot One: Quality of Wine

summary(data$quality)
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5
table(data$quality)
## 
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5
qplot(as.numeric(as.character(quality)), 
      data = data, 
      fill = color, 
      binwidth = 1, 
      origin = - 0.5, 
      main = "Quality of Red and White Wine") +
    scale_x_continuous(breaks = seq(2,10,1), lim = c(2,10)) +
    scale_y_sqrt(breaks = seq(0,5600,500)) +
    xlab('Quality') +
    ylab('Quantity')

Description

Let’s again look at distribution of wine by color. We may say that most frequent quality levels are 5 and 6 for both wine color.

Plot Two: Correlation Between Objective Parameters

ggplot(data = data, 
       aes(x = density, y = alcohol, color = color)) +
   geom_point(alpha = 1/6, position = position_jitter(h = 0), size = 3) +
   geom_smooth(method = 'lm') +
   coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15)) +
   xlab('Density') +
   ylab('Alcohol') +
   ggtitle('Density vs. Alcohol correlation by Color')

Description

Density and Alcohol show the strongest correlation among all wine parameters and it equals to -0.687 for both red and white wine. Red wine in average are stronger than white wine. In this dataset wine with less alcohol percentage are mostly white and red wine mostly has more alcohol percentage.

ggplot(data = data, 
       aes(x = density, y = alcohol, color = factor(quality))) +
   geom_point(alpha = 1/2, position = position_jitter(h = 0), size = 2) +
   coord_cartesian(xlim=c(min(data$density),1.005), ylim=c(8,15)) +
   scale_color_brewer(type='qual') +
   xlab('Density') +
   ylab('Alcohol') +
   ggtitle('Density vs. Alcohol correlation by Quality')

ggplot(data = data, 
       aes(x = density, y = alcohol) )+
   facet_wrap( ~ quality) +
   geom_boxplot() +
   xlab('Density') +
   ylab('Alcohol') +
   ggtitle('Density vs. Alcohol correlation by Quality')

Description

Wine with high alcohol percentage has quality level 7, wine with less alcohol percentage is quality level 5. Wine with quality levels 6 and 8 have various combinations of alcohol and density.

Plot Three: Quality vs. Alcohol

ggplot(data = data, aes(y = alcohol, x = quality)) +
   geom_boxplot() +
   geom_smooth(method = 'lm') +
   facet_wrap(~ color) +
   xlab('Quality') +
   ylab('Alcohol') +
   ggtitle('How Alcohol Level Affects Wine Quality')
## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?
## geom_smooth: Only one unique x value each group.Maybe you want aes(group = 1)?

Description

Alcohol level and Quality have correlation value of 0.4. This is the strongest correlation we have found between an objective wine parameter and wine quality. 0.4 is not a high correlation level so we cannot use alcohol as a parameter for quality prediction.


Reflection

Analyzing the data we can come up the following conclusion:

  1. When alcohol percentage decreases, density grows.
  2. In general alcohol level of red wine is higher than alcohol level of white wine.
  3. When fixed acidity increases density of red wine increases as well. White wine almost doesn’t show any correlation.
  4. Total sulfur dioxide and level of residual sugar are positively correlated. Correlation shows higher value with white wine.
  5. White wine density and residual sugar level have positive correlation.
  6. Alcohol level of white wine decreases with the growth of residual sugar level.
  7. Wine with high alcohol percentage has quality level 7, wine with less alcohol percentage is quality level 5.
  8. Mostly frequent quality levels of red and white wine are 5 and 6.

According to my investigation I may conclude that experts’ decisions on wine quality levels are based on their personal testes or could depend on other variables like year of production, grape types, wine brand etc. as only one variable (alcohol level) has correlation with quality of wine.

For future exploration of this data I would pick one category of wine (for example, quality level 3-4, 5-7, 8-9) to look at the patterns which can appear in each of these three buckets. I also would normalize data because we have more white wine than red wine.