White Wine Quality Exploration by Dogan Askan

Univariate Plots Section

This report explores a dataset containing quality and attributes for approximately 4,900 white wines.

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

It looks there are just 19 zeros in Citric Acid, and no NA values in the entire dataset.

## [1] "Non-Zero values in columns"
## [1] "4898 X"
## [1] "4898 fixed.acidity"
## [1] "4898 volatile.acidity"
## [1] "4879 citric.acid" "19 citric.acid"  
## [1] "4898 residual.sugar"
## [1] "4898 chlorides"
## [1] "4898 free.sulfur.dioxide"
## [1] "4898 total.sulfur.dioxide"
## [1] "4898 density"
## [1] "4898 pH"
## [1] "4898 sulphates"
## [1] "4898 alcohol"
## [1] "4898 quality"
## [1] "Non-NA values in columns"
## [1] "4898 X"
## [1] "4898 fixed.acidity"
## [1] "4898 volatile.acidity"
## [1] "4898 citric.acid"
## [1] "4898 residual.sugar"
## [1] "4898 chlorides"
## [1] "4898 free.sulfur.dioxide"
## [1] "4898 total.sulfur.dioxide"
## [1] "4898 density"
## [1] "4898 pH"
## [1] "4898 sulphates"
## [1] "4898 alcohol"
## [1] "4898 quality"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Fixed Acidity looks normally distributed. Skewness is very minor. There are just a few outliers. Fixed Acidity may be helpful in further analysis due to its relation with pH and Volatile Acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile Acidity also looks normally distributed. It is positively skewed. There are more than a few outliers. Volatile Acidity may be helpful in further analysis due to its relation with pH and Fixed Acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Interestingly, there are relatively many wines with 0.49 and 0.74 Citric Acid values. Apart from that, distributions look normal so far. There is an extreme outlier here with value 1.66. Since the acidity attributes (i.e. Fixed Acidity, Volatile Acidity and Citric Acid) are related each other, I believe those three may be significant in further analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

For Residual Sugar, most values are below 3 and half of all values are below 5.2 even though there are relatively huge outliers such as 65.8. As also seen above Residual Sugar is highly skewed. So, it may be helpful to make a Log_10 transformation for further analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## [1] "Lower bound of range: 0.98878"
## [1] "Upper bound of range: 1.0003"
## [1] "Standard Deviation: 0.00279160061693161"

For better understanding of Density, x axises are limited by omitting outliers. Although Standard Deviation is very small, I think Density may be somehow correlated with Quality. There are some peaks through the distribution. However, mean and median very close to each other and by looking the chart itself, it doesn’t look like skewed. Excluding outliers, range and variance for Density are relatively small.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH also looks normally distributed, there are no extreme outliers, no skewness. Half of pH values are between 3.09 and 3.28 with mean 3.1882666.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Alcohol values are distributed positively skewed. There are no extreme outliers as expected. The range is between 8 and 14.2. From gut feeling, I believe Alcohol plays a crucial role for the quality of a white wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000

Quality is also normally distributed as expected with the range 6. It is interesting that there are just a few white wines with Quality score 3 or 9. So, there is no perfect white wine! There are also no white wine with a quality score of 0, 1 or 2.

Univariate Analysis

The structure of the dataset

There are 4,898 white wines in the dataset with 12 features (Fixed Acidity, Volatile Acidity, Citric Acid, Residual Sugar, Chlorides, Free Sulfur Dioxide, Total Sulfur Dioxide, Density, pH, Sulphates, Alcohol, and Quality). All variables are numeric.

Other observations,

  • Fixed Acidity is mostly about 7
  • 90% of Volatile Acidity values are below 0.4.
  • There are tow unsual peaks in Citric Acid values at 0.49 and 0.74.
  • Residual Sugar is the most postively skewed attribute.
  • Although the maximum of Chlorides values is 0.346, 92% of Chlorides values are below 0.06.
  • The range of Free Sulfur Dioxide is 287, and the mean is 35.3080849.
  • The mean of Total Sulfur Dioxide is 138.3606574.
  • The range of Density is 17.3425658 times greater than its standard deviation.

The main features of interest in the dataset

The main features in the data set are Residual Sugar, Density and Alcohol. I would like to determine which features are more significant to predict the Quality. I do not think all variables in the data set are significant to build a model to predict the Quality.

Other features in the dataset that will help support the investigation into the features of interest

Acidity attributes (i.e. Fixed Acidity, Volatile Acidity and Citric Acid) may also be significant to predict the Quality as well as pH.

The new variable created from existing variables in the dataset

I will add a factor variable attribute of Quality and some buckets in the further part of analysis.

Unusual distributions, operations on the data to tidy, adjust, or change the form of the data

For better visualization, I mostly limit the x axis to get rid of outliers. I also make a Log_10 transformation to Residual Sugar since it is highly skewed, now it looks like a bimodal distribution with two peaks around 1.2 and 10.

Bivariate Plots Section

“X” denotes the pairs with insignificant p-values. Residual Sugar - Density and Alcohol - Density pairs have strong correlations. And, there are just weak correlations in Density - Quality and Alcohol - Quality.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$fixed.acidity
## t = -8.005, df = 4896, p-value = 1.48e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.14121974 -0.08592991
## sample estimates:
##        cor 
## -0.1136628

According to this box plot, less Fixed Acidity implies better Quality in the Quality range 3-7. However, there is no significant correlation between this pair. For better visualization, the lowest and highest 1% are omitted to remove outliers.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$total.sulfur.dioxide
## t = -12.418, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2017563 -0.1474524
## sample estimates:
##        cor 
## -0.1747372

Although there is no significant correlation between Total Sulfur Dioxide and Quality, the box plot is interesting. The range is tend to decrease while Quality is increasing. So, this may indicate that Total Sulfur Dioxide may also have a significant effect to predict Quality score.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$density
## t = -22.581, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3322718 -0.2815385
## sample estimates:
##        cor 
## -0.3071233

In the above box plot, the lowest 1% and highest 21% are omitted to remove outliers in order to make the visualization better. There is a weak negative correlation between Density and Quality with the value -0.3071233. So, this attribute may have a significant effect to predict Quality score. The relation is especially remarkable after Quality score 5.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$pH
## t = 6.9917, df = 4896, p-value = 3.081e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07162022 0.12707983
## sample estimates:
##        cor 
## 0.09942725

pH is negatively correlated in low qualities such as 3, 4 and 5 then it turns to a positive correlation. However, it is hard to say that there is a correlation with the value 0.0994272.

## 
##  Pearson's product-moment correlation
## 
## data:  df$quality and df$alcohol
## t = 33.858, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4126015 0.4579941
## sample estimates:
##       cor 
## 0.4355747

It can visually be interpreted that Alcohol has some effect on Quality. It is negatively correlated in low qualities such as 3, 4 and 5 then it turns to a positive correlation, pH’s effect on Quality also