Red Wine Quality Exploratory Analysis

by Asmaa Jirani

Data by

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

This report explores a dataset containing 1599 red wines with 11 variables on the chemical properties of the wine and quality rates of each wine.

Data dimesions

## [1] 1599   13

Data structure

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Data Summary

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Our data consists of 13 variables with 1599 observations, additional to the 11 chemical properties we have X the index and quality.

Univariate Plots Section

Most of our data observations has an average quality 5-6 and only few samples of the low quality and high quality wines which may affect my analysis. What characteristics a good quality wine has? is it the sowerness, the sweetness or level of alcohol that makes best wines?

The distribuation of fixed acidity is little right skewed and centered arround 8.

I see that volatile acidity exists in small amounts in our wines with a mean of 0.52 and the distribuation looks bimodal at 0.4 and 0.6.

## [1] 0.08255159

8.2% of our wines have no citric acid. Does that mean citric acid is not necessary in wines? or is there a problem in the data?

Most of our wines are arround 2.2 the median and the 3rd quantile with only 2.6. A lot of outliers on the higher ranges if removed we obtain a distribution that looks normal as shown below.

Similar to residual sugar the distribution of chlorides long tailed and concentrated on the lower numbers, 75% of wines in our data has an amount of salt less than 0.09.

The distribuation for both free sulfur dioxide and total sulfur dioxide are right skewed with long tail and few outliers. I removed the outliers to see data clearly. I wonder if that affect the quality of the wine somehow?

Density and pH have symmetric distribution with few outliers on both sides.

The distribution for the sulphates is similar to the one for residual sugar and chlorides, it’s right skewed, long tailed and the presence of outliers, the second pair of plots is after removing outliers. I wonder if there is correlation between these variables?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Most observations has a percent of alcohol that ranges between 9% and 12% and mean of 10.42%. The median and the mean are close.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine. Quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Other observations:

Most wines has an average quality score with median of 6.
75% of wines in this data has a level of sugar less than 2.6g/l.
Mean alcohol level is 11.10
Mean fixed acidity is 8.32 and mean PH of 3.31.

What is/are the main feature(s) of interest in your dataset?

The main feature of this data is quality, I want to determine which wine characteristics affect the quality score.

Did you create any new variables from existing variables in the dataset?

I created a new variable named rating from the quality score to better categorize the quality and better study and visualize the different attributes of the wine.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The distribution of the citric acid variable was different from the rest with 132 samples from data has 0 amount of citric acid. I removed outliers from all variables that has long tailed distribution.

Bivariate Plots Section

Correlation between all variables:

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

From the plot and the numbers above, I don’t see any strong correlations which means that the quality of the wine is surely a conbination of these attributes.

I want to look closer to each variable that I see affecting the quality score mainly Alcohol, Fixed acidity, volatile acidity and also I will look at others.

First I will study the variables that has positive correlation with quality.

From the boxplots above we obviously that wines with high quality rates has higher amounts of Alcohol, Fixed acidity, Citric acid and Sulphates.

The following set of variables have a positive correlation with quality.

As volatile acidity, pH and density decrease the quality also decreases.

The following variables seem to have no direct effect on teh quality of wines.

For these variables the change doesn’t seem to affect the rating except for chlorides I can notice that good quality wines seem to have smaller amounts.

The above plot includes the variables that correlate with quality and also illustrates the relationship betwen these variables.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The positive correlation between alcohol and quality tells that people prefer more alcohol in wines and it’s also illustrated in the boxplot the median of good quality is above the median for bad and average quality.

Quality also correlates positively with citric acid and fixed acidity, wines with higher acidity seem to get better quality score, the “fresh” taste coming from the citric acid is preferable for tasters.Wines with higher amounts of sulphastes slightly increase its quality.

High amounts of volatile acidity is considered undesirable in wines but a touch of it is no bad thing. And that is confirmed in the boxplot above.

Quality and pH correlate negatively, Low pH levels gets better quality scores. Good wines seem to have lower density which also matches high level of alcohol.

The other variables like free sulfur dioxide, total sulfur dioxide, residual sugar and chlorides seem to have no direct effect on the quality of wines from this dataset.

As one of the wine characteristics is sweetness I was surprised that is it not affecting the quality in this dataset.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The acidity and pH level tend to negatively correlate with each which is normal as the pH is a measure of acidity. But what doesn’t make sense to me is pH correlates positively with volatile acidity, it might be a lurking variable.

Also the negative correlation between density and alcohol explains that wines with more alcohol weighs less.

The fact that citric acid is one of the main predominant acids in fixed acidity explains the strong positive correlation between both variables.

There is a positive correlation betwen free sulfur dioxide and total sulfur dioxide as the total is a compound between the free and the bound forms of SO2.

What was the strongest relationship you found?

The strongest relationship I found is between fixed acidity and pH.

Multivariate Plots Section

In the next analysis I will combine the variables that correlate with each other and the feature of interest which is quality in the same plots.

Looking at these density plots, they seem to tell same findings from the boxplots in the previous analysis, but interestingly in this density plots I see that alcohol, volatile acidity, citric acid and sulphates are the ones characterizing the good quality wines. Let’s find out!

## rw$rating: bad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9966  0.9967  0.9977  1.0010 
## -------------------------------------------------------- 
## rw$rating: average
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9958  0.9968  0.9969  0.9979  1.0037 
## -------------------------------------------------------- 
## rw$rating: good
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9947  0.9957  0.9960  0.9973  1.0032

The graph above compares alcohol to density, we can see as alcohol increases quality tend to get better. And I notice that density has a slight impact on quality so maybe be because of the natural relationship that more alcohol in wines the less it weighs where density comes in place. For this reason I will eliminate density from being one of the variables that affect the rating of red wines.

This graph compares fixed acidity to citric acid, we can see that most of the yellow points (good quality) are above the smooth line where both variables increases but specificaly citric acid, I may need to look deeper in other acids to confirm but using this dataset I will keep citric acid as a variable contribuating in the quality of the wine and disgard fixed acidity in further analysis.

Wines with higher level of alcohol seem to have lower amounts of volatile acidity, which confirms that small amounts of this variable is an important key to the quality of wines.

Here oppositely comes sulphates with a little amount can make wines better as we see it slightly decreases when alcohol increases.

As I came down with 4 variables as major keys in the quality of wines I am going to generate a linear model.

Linear model

## 
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + citric.acid, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + citric.acid + volatile.acidity, 
##     data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + citric.acid + volatile.acidity + 
##     sulphates, data = rw)
## 
## ============================================================================
##                          m1            m2            m3            m4       
## ----------------------------------------------------------------------------
##   (Intercept)           1.875***      1.830***      3.055***      2.646***  
##                        (0.175)       (0.171)       (0.194)       (0.201)    
##   I(alcohol)            0.361***      0.346***      0.314***      0.309***  
##                        (0.017)       (0.016)       (0.016)       (0.016)    
##   citric.acid                         0.730***      0.068        -0.079     
##                                      (0.090)       (0.103)       (0.104)    
##   volatile.acidity                                 -1.343***     -1.265***  
##                                                    (0.114)       (0.113)    
##   sulphates                                                       0.696***  
##                                                                  (0.103)    
## ----------------------------------------------------------------------------
##   R-squared             0.227         0.257         0.317         0.336     
##   adj. R-squared        0.226         0.256         0.316         0.334     
##   sigma                 0.710         0.696         0.668         0.659     
##   F                   468.267       276.595       246.976       201.777     
##   p                     0.000         0.000         0.000         0.000     
##   Log-likelihood    -1721.057     -1688.711     -1621.596     -1599.093     
##   Deviance            805.870       773.917       711.603       691.852     
##   AIC                3448.114      3385.421      3253.192      3210.186     
##   BIC                3464.245      3406.930      3280.078      3242.448     
##   N                  1599          1599          1599          1599         
## ============================================================================

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Holding alcohol level constant, density has little effect on the quality of wines as other keys can contribute in density. There is a low variance in the median with a mean almost equal in all ratings.

Good quality wines has more citric acid. This also applied to fixed acidity but not with the same slope. 75% of good quality wines has high amount of citric acid.

High amount of volatile acidity affect the quality negatively on the oppsite a little bit more of sulphates tend to affect positively.

Were there any interesting or surprising interactions between features?

I didn’t notice any surprising interactions.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

The R-squared in the linear model I generated are pretty low but in this field of study the numbers are acceptable. I notice that the adjustaed R-squared doesn’t vary much from the R-squared itself. This indicates that all varibales included in the model are relevant. I also see the 3 stars ’***’ next to each coeficient which tells me that the p-value is statisticaly significant, and the F statistics is pretty high. The model can be a good fit.

Final Plots and Summary

Plot One

Description One

Most observations in our dataset has a quality score of 5 and 6 which mean average quality on a scale from 0 to 10.

Plot Two

Description two

Level of alcohol is one of the properties that affect the quality rate, as we see in this density plot.

Plot three

Description three

Wines that have high level of alcohol with a touch of sulfates, good amounts of citric acid and a little bit of volatile acidity are the wines that are getting high quality scores.

Reflection

The red wines quality data set contains information on 1599 wines accross 13 variables, 11 are the chemical properties of the wines and one is the score that this particular wine gets on its quality. My work in this project is to determine if the quality score given is based on the variables in the dataset, in other words how does the change of these variables affect the quality score.

I started by looking at the data set and understand it by getting a summary of each variable, then ploting the distribuation of all of them. I plotted the variable of interest quality where I noticed that most of the observation have scores 5 and 6. The quality score ranges between 3 -8 on a scale of 0-10. I created a variable named rating where I categorized the wines having a score (1-4) as bad, 5-6 as average and 7-10 as good.

Next I looked into the correlation of all variables, it showed that quality correlated positivey with alcohol and negatively with volatile acidity. Also it correlated alightly with other varibles. I noticed that some variable strongly correlate with each other, that’s why I plotted the changes of these variables together to help me disgard any irrelevant or repetitive variable. I found that pH was a lurking variable as it gives same information as acidity.

The limitations I found in this project was with my own assumptions as I assumed that sweetness would be one of the main characteristics changing the quality score but studying this data set showed that it’s not. I also based my analysis on some information found on teh web such as the citric acid is one of the fixed acids in the wines. I would like to study bigger data set of wines with more variables to include for example all different acids in the wines.