Introduction

The purpose of this project was to do preliminary exploration of wine data, with a goal of selecting right attributes for classifying wine into one of the quality categories: poor, normal and excellent. Since the data was already in tidy structure, not much attention was payed in data wrangling, but finding relationships among the attributes and between attributes and the wine quality.

Loading The Data set

The two datasets “wineQualityReds.csv” and “wineQualityWhites.csv” are joined into one larger dataset. Since the data was already in tidy structure, no data wrangling was done on it, except adding the wine type column, removing the index variable X and randomizing the rows.

Introducing The Data Set

## Classes 'tbl_df' and 'data.frame':   6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 6.2 7.7 5.6 7 6.3 5.8 7.6 6.3 6.6 ...
##  $ volatile.acidity    : num  0.39 0.3 0.39 0.26 0.62 0.23 0.12 0.18 0.27 0.735 ...
##  $ citric.acid         : num  0.23 0.17 0.28 0 0.1 0.33 0.21 0.28 0.37 0.02 ...
##  $ residual.sugar      : num  7 2.8 4.9 10.2 1.4 6.9 1.3 7.1 7.9 7.9 ...
##  $ chlorides           : num  0.033 0.04 0.035 0.038 0.071 0.052 0.056 0.041 0.047 0.122 ...
##  $ free.sulfur.dioxide : num  29 24 36 13 27 23 35 29 58 68 ...
##  $ total.sulfur.dioxide: num  126 125 109 111 63 118 121 110 215 124 ...
##  $ density             : num  0.994 0.994 0.992 0.993 0.996 ...
##  $ pH                  : num  3.14 3.01 3.19 3.44 3.28 3.23 3.32 3.2 3.19 3.47 ...
##  $ sulphates           : num  0.42 0.46 0.58 0.46 0.61 0.46 0.33 0.42 0.48 0.53 ...
##  $ alcohol             : num  10.5 9 12.2 12.4 9.2 10.4 11.4 9.2 9.5 9.9 ...
##  $ quality             : int  5 5 7 6 5 6 6 6 6 5 ...
##  $ type                : Factor w/ 2 levels "red","white": 2 2 2 2 1 2 2 2 2 1 ...

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500   1st Qu.: 1.800  
##  Median : 7.000   Median :0.2900   Median :0.3100   Median : 3.000  
##  Mean   : 7.215   Mean   :0.3397   Mean   :0.3186   Mean   : 5.443  
##  3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900   3rd Qu.: 8.100  
##  Max.   :15.900   Max.   :1.5800   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  1.00      Min.   :  6.0       
##  1st Qu.:0.03800   1st Qu.: 17.00      1st Qu.: 77.0       
##  Median :0.04700   Median : 29.00      Median :118.0       
##  Mean   :0.05603   Mean   : 30.53      Mean   :115.7       
##  3rd Qu.:0.06500   3rd Qu.: 41.00      3rd Qu.:156.0       
##  Max.   :0.61100   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300   1st Qu.: 9.50  
##  Median :0.9949   Median :3.210   Median :0.5100   Median :10.30  
##  Mean   :0.9947   Mean   :3.219   Mean   :0.5313   Mean   :10.49  
##  3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000   3rd Qu.:11.30  
##  Max.   :1.0390   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality         type     
##  Min.   :3.000   red  :1599  
##  1st Qu.:5.000   white:4898  
##  Median :6.000               
##  Mean   :5.818               
##  3rd Qu.:6.000               
##  Max.   :9.000

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"             
## [13] "type"

There were 6497 observations in the dataset with 11 physiochemical variables, one type attribute and one output quality variable measuring quality score given by bunch of wine experts.

Input variables (based on physicochemical tests):

fixed acidity (tartaric acid - g / dm^3)
volatile acidity (acetic acid - g / dm^3)
citric acid (g / dm^3)
residual sugar (g / dm^3)
chlorides (sodium chloride - g / dm^3
free sulfur dioxide (mg / dm^3)
total sulfur dioxide (mg / dm^3)
density (g / cm^3)
pH
sulphates (potassium sulphate - g / dm3)
alcohol (% by volume)
type of wine(red or white)

Output variable (based on sensory data):

quality (score between 0 and 10)

Univariate Plots

## quality
##    3    4    5    6    7    8    9 
##   30  216 2138 2836 1079  193    5

Majority of wines are of normal quality, few are extremely poor, while few are excellent. Since, goal is to label wine as poor, normal or excellent, and to have sufficient examples of each to work with, a new variable bucket.quality might be created in future analysis.

Viewing Distributions Of Attribute Variables

##   red white 
##  1599  4898

There were more white wines than red and they were roughly in 3 to 1 ratio.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.400   7.000   7.215   7.700  15.900

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2300  0.2900  0.3397  0.4000  1.5800

Both distributions for fixed acidity and volatile acidity have long positive tails, this makes their mean higher than their medians, and make median better measure of central value. Moreover, volatile acidity distribution has a slight bimodal distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2500  0.3100  0.3186  0.3900  1.6600

Citric acid distribution seems bimodal and there are few outliers too. Outliers were removed by taking citric acid values less than 99-percentile. Another intresting thing to note is unsual spikes around 0.0 g/dm^3 and 0.5 g/dm^3, this may indicate few concentrations are more common than others.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.800   3.000   5.443   8.100  65.800

Residual sugar is highly positively skewed and hence I removed the top 1% of the data points in the above figure. In addition, the plot contains two peaks, a feature common in a lot of plots, mainly due to wine type.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03800 0.04700 0.05603 0.06500 0.61100

Bimodal distribution is due to wine types, the second smaller peak is due to red wines. The plot shows the typical long positive tail, with the bulk of the values between 0.03 g/dm^3 and .10 g/dm^3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   17.00   29.00   30.53   41.00  289.00

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0    77.0   118.0   115.7   156.0   440.0

Distributions for SO2, both free and total, were positively skewed.Thus their extreme values were removed by discarding top 1% values. Free SO2 has quite a spiky distribution, which perhaps might indicate that some levels are more common that others, or limitation of the measurement rods. Also, total SO2 frequency distribution is clearly bimodal, due to wine type, that would be shown later under bivariate analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9923  0.9949  0.9947  0.9970  1.0390

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4300  0.5100  0.5313  0.6000  2.0000

Density and sulphates ditributions, like others, had long tails. Thus outliers were removed using middle 98% of density values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.110   3.210   3.219   3.320   4.010

pH distribution is almost normal with little standard deviation. Indicating almost all wines have similar pH values.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.30   10.49   11.30   14.90

Alcohol distribution was slightly skewed. Its mean and median values are almost same and all wines have some alcohol in them as the minimum is 8%.

Creating new Variable: bucket.quality

## bucket.quality
##      poor    normal excellent 
##  3.786363 76.558412 19.655225

quality -> bucket.quality where:

poor equals 3 to 4 quality rating wines
normal equals 5 to 6 quality rating wines
excellent equals 7 to 9 quality rating wines

As expected, majority falls under normal category.

Univariate Analysis

What is the structure of your dataset?

There were 6497 wine observations measuring 12 attributes and 2 quality outputs.

What is/are the main feature(s) of interest in your dataset?

Majority of wines were of normal quality. So in order to predict wine quality, quality variable was grouped further. Moreover, few attributes had bimodal distributions and that was due to different wine types. For instance, attributes like, volatile acidity, citric acid, residual sugar, total sulfur dioxide had roughly two peaks and upon faceting the data, as would further be explored in bivariate analysis, are mainly due to differences between red and white wines.

What other features in the dataset do you think will help support your

investigation into your feature(s) of interest?

Since the goal is to construct a wine classifier, selecting features that are important, using bivariate and multivariate graphs, is the next step. Perhaps variables like free SO2, total SO2, sulphates might carry the same information and thus chosing among those would narrow down the predictors. Similarly, pH values and all other acidic attributes might carry the same information and thus I might have to use one of them.

Did you create any new variables from existing variables in the dataset?

Yes. I created bucket.quality variable. Since the quality output is measured using sensory perception, allowing for greater precision, in my opinion, is an overkill.

Of the features you investigated, were there any unusual distributions?

Did you perform any operations on the data to tidy, adjust, or change the

form of the data? If so, why did you do this?

As I mentioned above, a lot of attributes were positively skewed with bimodal distribution. The bimodal distribution feature observed was caused by wine type which will be further explored in bivariate plots section. While plotting the distributions, I removed top 1-percentile values to get rid of outliers and in some cases, like in density attribute, I took middle 98% of the values.

Bivariate Plots Section

Visualizing Correlation Matrix of Attributes

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000       0.21900826  0.32443573
## volatile.acidity        0.21900826       1.00000000 -0.37798132
## citric.acid             0.32443573      -0.37798132  1.00000000
## residual.sugar         -0.11198128      -0.19601117  0.14245123
## chlorides               0.29819477       0.37712428  0.03899801
## free.sulfur.dioxide    -0.28273543      -0.35255731  0.13312581
## total.sulfur.dioxide   -0.32905390      -0.41447619  0.19524198
## density                 0.45890998       0.27129565  0.09615393
## pH                     -0.25270047       0.26145440 -0.32980819
## sulphates               0.29956774       0.22598368  0.05619730
## alcohol                -0.09545152      -0.03764039 -0.01049349
## quality                -0.07674321      -0.26569948  0.08553172
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity           -0.11198128  0.29819477         -0.28273543
## volatile.acidity        -0.19601117  0.37712428         -0.35255731
## citric.acid              0.14245123  0.03899801          0.13312581
## residual.sugar           1.00000000 -0.12894050          0.40287064
## chlorides               -0.12894050  1.00000000         -0.19504479
## free.sulfur.dioxide      0.40287064 -0.19504479          1.00000000
## total.sulfur.dioxide     0.49548159 -0.27963045          0.72093408
## density                  0.55251695  0.36261466          0.02571684
## pH                      -0.26731984  0.04470798         -0.14585390
## sulphates               -0.18592741  0.39559331         -0.18845725
## alcohol                 -0.35941477 -0.25691558         -0.17983843
## quality                 -0.03698048 -0.20066550          0.05546306
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.32905390  0.45890998 -0.25270047
## volatile.acidity              -0.41447619  0.27129565  0.26145440
## citric.acid                    0.19524198  0.09615393 -0.32980819
## residual.sugar                 0.49548159  0.55251695 -0.26731984
## chlorides                     -0.27963045  0.36261466  0.04470798
## free.sulfur.dioxide            0.72093408  0.02571684 -0.14585390
## total.sulfur.dioxide           1.00000000  0.03239451 -0.23841310
## density                        0.03239451  1.00000000  0.01168608
## pH                            -0.23841310  0.01168608  1.00000000
## sulphates                     -0.27572682  0.25947850  0.19212341
## alcohol                       -0.26573964 -0.68674542  0.12124847
## quality                       -0.04138545 -0.30585791  0.01950570
##                         sulphates      alcohol     quality
## fixed.acidity         0.299567744 -0.095451523 -0.07674321
## volatile.acidity      0.225983680 -0.037640386 -0.26569948
## citric.acid           0.056197300 -0.010493492  0.08553172
## residual.sugar       -0.185927405 -0.359414771 -0.03698048
## chlorides             0.395593307 -0.256915580 -0.20066550
## free.sulfur.dioxide  -0.188457249 -0.179838435  0.05546306
## total.sulfur.dioxide -0.275726820 -0.265739639 -0.04138545
## density               0.259478495 -0.686745422 -0.30585791
## pH                    0.192123407  0.121248467  0.01950570
## sulphates             1.000000000 -0.003029195  0.03848545
## alcohol              -0.003029195  1.000000000  0.44431852
## quality               0.038485446  0.444318520  1.00000000

Observations About The Correlogram:

Wine quality is highly correlated with alcohol quantity and density. However, alcohol and density are negatively correlated. Therefore, one of them can be used as wine quality predictor. Moreover, it’s the alcohol amount that reduces the density, due to chemistry, hence alcohol amount is a good choice as a wine quality predictor.
Wine quality is negatively correlated with the volatile acidity, as too high levels of it leads to vinegary taste, supporting the description about the data set.
As suspected, free SO2 and total SO2 are highly correlated with each other and negativily correlated with the volatile acidity.
pH is negatively correlated with fixed acidity, citric acid, total SO2 and residual sugar. The negative correlation with the residual sugar makes sense, since sugar has not yet oxidized into acids. Moreover, pH is positively correlated with the volatile acidity, which is a bit counter-intuitive.
Residual sugar and density are also positively correlated, which I guess makes sense, adding sugar ought to increase the density!
According to description, sulphates are added to produce SO2 which acts as antimicrobial and antioxidant; total SO2 is also added for the same purpose, but why then they are negatively correlated, perhaps one is converted into another.
It is surprising to see positive correlation between total SO2 and residual sugar, maybe more SO2 is added to prevent sugar from being coverted, and thus make sure that wine tastes a bit sugary.
It is nice to see volatile acidity is negatively correlated with SO2, as SO2 is added in the wine to prevent acetic acid formation. https://en.wikipedia.org/wiki/Wine_fault#Sulfur_compounds

## Source: local data frame [3 x 4]
## Groups: type [1]
## 
##     type bucket.quality     n proportion
##   (fctr)         (fctr) (int)      (dbl)
## 1    red           poor    63 0.03939962
## 2    red         normal  1319 0.82489056
## 3    red      excellent   217 0.13570982

## Source: local data frame [3 x 4]
## Groups: type [1]
## 
##     type bucket.quality     n proportion
##   (fctr)         (fctr) (int)      (dbl)
## 1  white           poor   183 0.03736219
## 2  white         normal  3655 0.74622295
## 3  white      excellent  1060 0.21641486

Both white and red wines have majority of normal wines, however the proportion of excellent wines are high in white than in red. This perhaps would indicate that wine type might be an important predictor.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount   iqr mean.amount       std
##           (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1           poor         10.05   1.5    10.18435 0.9990347
## 2         normal         10.00   1.6    10.26528 1.0706263
## 3      excellent         11.50   1.7    11.43336 1.2156200

## 
##  Pearson's product-moment correlation
## 
## data:  alcohol and quality
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

Excellent wines have more alcohol amounts in them compared to normal and poor labelled wines. However, looking at the density, excellent wines also have more spread in alcohol amounts than the others. In addition, I found that alcohol amount and quality had the strongest linear relationship among them.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount   iqr mean.amount       std
##           (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1           poor          0.38  0.33   0.4651626 0.2456773
## 2         normal          0.30  0.19   0.3464234 0.1656597
## 3      excellent          0.27  0.14   0.2891699 0.1169600

As expected, excellent and normal quality wines have lower median volatile acidity compared to poor wines. Another thing to notice is that both normal and excellent wines have a lot of vinegary tasting ones among them(outliers). Moreover, despite a lot of volatile acidity in some excellent wines, being marked as excellent, suggests that other factors are there that make a wine excellent. Also, excellent and normal wines have less spread in this attribute compared to poor wines.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount     iqr mean.amount        std
##           (fctr)         (dbl)   (dbl)       (dbl)      (dbl)
## 1           poor         0.051 0.02775  0.06212602 0.04929806
## 2         normal         0.049 0.03000  0.05867431 0.03643728
## 3      excellent         0.039 0.01800  0.04457557 0.02101399

A large amount of outliers are observed in the chloride values in normal quality wines. In addition, excellent wines are less saltier than poor ones and have less spread compared to other qualities.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount   iqr mean.amount       std
##           (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1           poor          0.27  0.24   0.2733740 0.1807343
## 2         normal          0.31  0.16   0.3167652 0.1506578
## 3      excellent          0.32  0.10   0.3346280 0.1100404

Normal and excellent wines have slightly higher median citric acid amount compared with poor wines, however, they have a large number of outliers too, especially excellent wines. Another observation is that as wine quality improves, spread in citric acid concentration decreases. Plus, according to description of the data set, citric acid is added for freshness, thus it is reasonable to believe, that excellent wines would have larger amount of it.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount    iqr mean.amount      std
##           (fctr)         (dbl)  (dbl)       (dbl)    (dbl)
## 1           poor           102 108.25    105.7012 69.43677
## 2         normal           121  84.00    117.7441 57.86146
## 3      excellent           114  53.00    109.8912 47.12620

Median total SO2 level increases from poor to normal quality wines and then decreases, in excellent wines.However, as the wine quality improves, the interquartile range, which measures spread, shrinks considerably.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount   iqr mean.amount       std
##           (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1           poor         3.225  0.27    3.234797 0.1913128
## 2         normal         3.200  0.21    3.215346 0.1594609
## 3      excellent         3.220  0.22    3.227651 0.1590941

Median and mean pH level is almost same for different wine quality, though, spread is smaller in normal and excellent wines.

## Source: local data frame [3 x 5]
## 
##   bucket.quality median.amount   iqr mean.amount      std
##           (fctr)         (dbl) (dbl)       (dbl)    (dbl)
## 1           poor           2.2   4.1    4.273984 3.937832
## 2         normal           3.1   6.7    5.659087 4.935219
## 3      excellent           2.9   4.7    4.827721 4.063824

Median residual sugar level is sligthly higher in normal and excellent wines, but difference is not that much to standout. Moreover, normal wines have larger spread in their residual sugar levels.

The above plots explains why we observed bimodal distributions when we plotted attributes individually. The fact that we have two different types of wines in our data set, with different levels of contents in them, we observed this phenomena.

Finding Relations Among The Wine Attributes

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -76.14, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6993829 -0.6736787
## sample estimates:
##        cor 
## -0.6867454

The plot suggests simple linear line is a good first approximation for the relation between alcohol amount and density, which is reflected with reasonable coefficient of correlation. However, a nonlinear relationship would better capture the relation which is shown by a polynomial fit of order 3. In addition, few outliers were removed from the density variable.

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and free.sulfur.dioxide
## t = 83.84, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7090475 0.7324111
## sample estimates:
##       cor 
## 0.7209341

There is a strong linear relationship between free and total SO2, thus it makes sense to use either one of them as a predictor.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other features

in the dataset?

I found higher alcohol amount and citric acid concentration among best quality wines, and less amount of volatile acidity. In addition, the best quality wines had less spread in their different contents.And white wines had slighly more ecellent wines in them compared to red.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

I found strong positive correlation between free SO2 and total SO2 and between residual sugar and density. Similarly, strong negative correlation between alcohol amount and the density. Moreover, I found the reason for bimodal distribution of several wine components, and that was presence of red and white wines in the data set.

What was the strongest relationship you found?

Among attributes, free SO2 and total SO2 and residual sugar and density had strong correlation while among attributes and output, amount of alcohol and quality had highest correlation of 0.44.

Multivariate Plots Section

## Source: local data frame [6 x 6]
## Groups: bucket.quality [?]
## 
##   bucket.quality   type median.amount   iqr mean.amount       std
##           (fctr) (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1           poor    red          10.0   1.4    10.21587 0.9181778
## 2           poor  white          10.1   1.4    10.17350 1.0275704
## 3         normal    red          10.0   1.4    10.25272 0.9723537
## 4         normal  white          10.0   1.6    10.26981 1.1040355
## 5      excellent    red          11.6   1.4    11.51805 0.9981532
## 6      excellent  white          11.5   1.7    11.41602 1.2552094

Excellent quality wines, irrespective of type, have higher levels of alcohol by volume, compared to poor and normal quality wines. In addition, some normal wines have high alcohol amounts, which are certainly the outliers.

Plotting volatile acidity against alcohol and faceting over type didn’t produce clear separation between classes for red or white wines. However, general trends are visible. Excellent quality wines are found in right bottom section compared to poor wines which are more abundant in left top corner. However, their is nice separation between poor and excellent types for both red and white wines.

Using citric acid(measure of freshness), and volatile acidity (measure of vinigar taste) didn’t separate between the quality classes and that is same in both types of wines

It is possible to distinguish between red and white wines quite easily using attributes like, volatile acidity and sulphates or total SO2 and fixed acidity as compared with wine quality.

Multivariate Analysis

Talk about some of the relationships you observed in this part of

the investigation. Were there features that strengthened each other in

terms of looking at your feature(s) of interest?

At end of bivariate analysis, alcohol amount, type of wine and volatile acidity stood out and possible avenues for further exploration. Hence, in this section I focused on them. For example, to iterate, excellent wines had larger amount of alcohol irrespective of color ; spread in volatile acidty was quite high in red wines compared to white ones and the poor and normal of them had long positively skewed tails. In addition, when working in volatitle acidity and alcohol amount, I didn’t see large separations between the quality classes. Plus, I tried to explore citric acid, which is added for freshness, against vinger-taste– caused by volatile acidity–, but once again clear separation didn’t came forth. Combining them all, there was some separation, but not huge and clear ones, although there was good separation between poor and excellent wines.

Were there any interesting or surprising interactions between features?

We can classify wine types much easily compared to wine quality. For example, plotting volatile acidity against sulphates and coloring the points according to wine type, led nicely separated groups that might be classified using Linear Discriminant Analysis(LDA) or Quadratic Discriminant Analysis(QDA).

Simple KNN Model For Classification

##            class.labels
## prediction  excellent normal poor
##   excellent       757    524   28
##   normal          508   4329  163
##   poor             12    121   55

## [1] "training set accuracy (%) when using all the attributes"

## [1] 79.12883

##            class.labels
## prediction  excellent normal poor
##   excellent       429    337   10
##   normal          848   4636  236
##   poor              0      1    0

## [1] "training set accuracy (%) when using alcohol and volatile\n      acidity attributes "

## [1] 77.95906

Strengths and weaknesses of the model

Strengths:

If we had chosen just the most common class, in this case, wine of normal quality, we would have got a classification accuracy of about 76.5 %. So the strength of the model is that it slightly performs better and have an accuracy of about 79.1 % when just one nearest neighbor is selected and all the attributes are accounted.

Weakness:

Less interpretable. And when the model is build using features selected, using EDA, in this case alcohol amount and volatile acidity, the model accuracy is slightly better, 77.9 %, a thin improvement to the base case.

Final Plots and Summary

Plot One

##   type   quality occurances proportion.red  type   quality occurances
## 1  red      poor         63     0.03939962 white      poor        183
## 2  red    normal       1319     0.82489056 white    normal       3655
## 3  red excellent        217     0.13570982 white excellent       1060
##   proportion.white quality.ratio
## 1       0.03736219     0.9482879
## 2       0.74622295     0.9046327
## 3       0.21641486     1.5946883

Description One

The reason I chose this plot is that it shows the uneven representation of wine qualities in the data set irrespective of wine color; majority of wines are of normal class with few extreme qualities in both red and white colored ones. As the above table shows, about 82% of red wines are normal and 74% of white of same category. This plot also shows that there are a lot more white wines, roughly 5000 compared to about 1600 red ones, in the data set and thus might be selected in equal numbers for further analysis. In addition, the fact that their are more excellent whites wines than red ones and in a ratio of 1.59 suggested to me that it might be an important feature in classifying between wine qualities, therefore I chose this plot.

Plot Two

## Source: local data frame [6 x 6]
## Groups: type [?]
## 
##     type bucket.quality median.amount   iqr mean.amount       std
##   (fctr)         (fctr)         (dbl) (dbl)       (dbl)     (dbl)
## 1    red           poor          10.0   1.4    10.21587 0.9181778
## 2    red         normal          10.0   1.4    10.25272 0.9723537
## 3    red      excellent          11.6   1.4    11.51805 0.9981532
## 4  white           poor          10.1   1.4    10.17350 1.0275704
## 5  white         normal          10.0   1.6    10.26981 1.1040355
## 6  white      excellent          11.5   1.7    11.41602 1.2552094

## 
##  Pearson's product-moment correlation
## 
## data:  quality and alcohol
## t = 39.97, df = 6495, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4245892 0.4636261
## sample estimates:
##       cor 
## 0.4443185

Description Two

The reason I selected this plot is to show that alcohol amount being the special attribute among all the others. First, it had the highest positive correlation with the wine quality than any other attribute. Second, when I plotted wine attributes against bucket.quality, I observed excellent wines had considerably higher median amounts of alcohol, 11.5%, compared to poor and normal quality wines which had about 10%, and that when observed on the above boxplots created a nice separation among excellent and the rest and that pattern irrespective of wine type. This suggested to me that alcohol amount is an important feature to consider in building up the model at later stage. All I need to do is find another attribute that would make a good separation among all of them and that was my focus in multivariate plots.

Plot Three

Description Three

At this stage I am looking to create an ocean among wine qualities, by adding few other attributes besides the alcohol amount, which as I mentioned above, is important classifying attribute. Volatile acidity came as second potential attribute as poor wines had larger median amounts of volatile acidity and it had second strongest correlation with quality, after mentioning in the above analysis for keeping density out. Thus with this intuition, I tried the above plot, but it did not produce clear boundry separating the wine qualities of poor, normal and excellent. There was a lot of mixing between each class. I also added citric acid, because excellent wines had large amounts of it and poor with really little. But the result was the same. This also shows the reason why my knn classifier is performing only at 78% accuracy when I chose volatile acidity and alcohol amount as the only predictors. In addition, it also tells the importance of the variables I have narrowed down in this plot, because when I chose all of the attributes in the data set, it gave slightly better result with KNN classifier, roughly 79% accuracy. In addition, this plot tells that it is easier to separate between poor and excellent wines.

Furthermore, few important things repeat in this plot too. First, excellent quality wines have higher alcohol amounts, less volatile acidity and higher citric acid concentration compared to poor class. In additions, normal class is quite spread out, and in large amount, and blends with poor and excellent quality wines. Furthermore, poor rated wines have lower alcohol amounts, higher volatile acidity and less citric acid concentrations.

Reflections

As my main goal was to explore the data set in view of building a wine quality classifier, I was slightly able to improve compared to just marking all the wines as normal, which I consider my success. However, my features selected after EDA were not much rich. I tried several multivariate plots in order to find a good separation between wine qualities, but I failed to do that, as there were huge mixes among each quality class. However, I think, if I had more attributes available, for instance, wine price, brand ratings, maunfactoring information etc, I would have been able to construct a better classifier. Second, I think, the data set must have more representative wines of poor and excellent qualities, and with this constraint accompanying several other attribues,I would like to improve the classifying task further. Right now, I can build a good wine type classifier with this data set. For my future work, I would like to perform EDA on much richer and representative data set, that include wines from whole lot of other countries not just Portuguese “Vinho Verde” wine. Plus,I have just used K-nearest-neighbor classifier, which according to some people,usually gets the job done. Exploring other models like logistic regression, LDA, QDA and famous support vector machines, would definitely improve on the task with much richer and complete data set.

In conclusion, using exploratory data analysis, I was able to see new relations, confirm the patterns already stated in the data set description, and reduce the number of features required in classification.

Exploratory Data Analysis On Wine Quality

by Bilal Mahmood