Red wine Analysis by Saurabh Sarkar

Determine the effect of different factors on the quality of Red Wine. We will take a Descriptive Statistical approach here, to describe the collected data. A more complex approach to build a predictor model is well suited for the next phase of analysis

The collected Red Wine dataset is a basic structure with 1599 indexed observations and 13 variables. Of the 13 variables, there is one which is just the index, leaving 11 features that represent the independent variables and the single target variable or dependent variable, which represents the quality score of the wines.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Explore

We start by looking at the the distribution of the individual variables to determine if the data collected is valid to determine correlation to quality. The script below uses the R summary() function tot show the IQR and central measures of all the variables. This is a good gauge to determine general distribution and identify possibilities for outliers

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Results: The quality variable is discrete and and as the target variable will be correlated against the 11 different factors. The quality data consists of 6 discrete values with 8 being the highest 3 being the lowest. You can see below that the the most common quality values are 5 and 6, where 681 observations have a quality rating of 5 and 638 observations have a quality rating of 6.

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

The summary showing the median/mean and IQR for the predictive variables indicates that these variables are continuous. Even before we plot histograms for these, we can make some assessments for the skews by looking at the median, mean, and Max/Min values:

The following variables [fixed.acidity, volatile.acidity, citric.acid, residual.sugar] show median and mean close to each other indicating a normal distribution, however the relatively higher max value identifies possible outliers. Some of the other factors [free.sulfur.dioxide, total.sulfur.dioxide, sulphates,alcohol] indicate similar distributions, but will require further visual represenatations to confirm.
The remaining factor [density and pH) do not indicate outliers and have close median and mean values, and probably exhibit a normal distribution

We will confirm these findings in the next section, by observing the distribution of each variable by plotting histograms.

Univariate Plots Section

In this section we will show the visual distribution of the individual variables and identify if transformations are needed before we start observing correlations. In this section we will better qualify the categories in the depeneant variable quality, and then determine the validity of the independent variables to be used to look at correlation to quality by observing their distributions. A normal distribution will be a strong candidate to observe further, along with distributions that show a normal behavior once transformed on log scale.

Adjust Dependent variable The variable quality is discrete with most of the data have quality level in the 5 or 6 quality level with a fewer examples for the higher or lower quality scores. To constitute a better frequency for the different categories, it will be beneficial to change the bracket size. It might be beneficial to split this into two brakcets if we were building a logistic regression model. But for descriptive purposes, the quality scores will be split into 3 brackets allowing us to better utilize the quality number as a predictor variable for the various wine quality variables. Utilizing basic ifelse statements, the quality variable is re-categorized into the 3 segments of High, Average, and Low. The resulting distribution can be seen below.

Identify Normally distributed variables With the dependent variable quality re-categorized, we can start the Univariate analysis of the independent variables. We start with pH and density as they previously indicated a Normal Distribution, and these are confirmed by the histograms below.

Plot Note: The same distribution with a smaller bin width is also provided next to the original plot to highlight a deeper look at the distribution and in turn confirm that the normal distribution is rarely a perfectly bell shaped curve

For the remaining variables, we had previously identified possible skews on the normal distribution. We will plot this visually and confirm that the extent of the skews and also confirm if transformations result in a normal distribution.

Plot Note: R allows for an iterative approach to plotting. The plots below were plotted by just identifying the x axis, and were then further refined, by adjusting factors like binwidth and breaks, to present the final visual distribution

Among the remaining variable to be observed, it can be seen that fixed.acidity is very close to a normal distribution. However like the variables [‘residual.sugar’, ‘sulphates’, ‘total.sulfur.dioxide’, ‘chlorides’], the distribution has a long tail and would benefit with a transformation.

The variable free.sulfur.dioxide seems discrete after bins are shortened. This can probably also be constituted to rounding. It is interesting to see that the distribution does seem to exhibit a long trailing tail, so a transformation should be attempted

Plotting the remaining variables we can see that the variables [volatile.acidity, alcohol, citric.acid] have many peaks and do not have a trailing tail. It is not expected for this distributions to normalize after a log transformation. Further evaluations need to be made using subsets and groupings of these variables to evaluate the effect on ‘quality’ in the Bivariate and Multivariate sections. It can also be seen that variable citric.acid has a very large number of 0 values, so it is not a good candidate to use to determine correlation to quality.

Transformations for Normal distributions Here, we will continue transforming the variables in an attempt to get a normal bell shaped distribution. For the variables [fixed.acidity, residual.sugar, sulphates, total.sulfur.dioxied, ‘chlorides’] we transformed the distributions using log10 and observed a more normal distribution

Plot Note: It was also evident from the earlier plots that outliers are present for the variables, so only the top 99th percentile will be used to represent the distribution

Readme Note: Previous plots were done using qplot. Plots moving forward will be done using ggplot allowing for the iterative approach to plotting while allowing for more control and refinement. For example in the plots below one variable [p1] is assigned the basic plot, and then the variation is created by l1 = p1 + coord_trans(x = ‘log10’)

Unable to Normalize The plots below show the unsuccessful transformation of the variables free sulfur dioxide, alcohol and citric acid due the inability to normalize after applying transformations. It is difficult to form a conclusion on these and will require further investigation in bi-variate analysis. In later analysis the data can be subset to remove any null values.

Univariate Analysis - Summary

Structure: Dataset consists of 1599 observations for a total of 12 variables. A quality categorial variable for the quality of the wine, which has been converted to a factor datatype and split up into three categories for the different qualities of wine. The other 11 variables are the different features of the wine that might constitute towards the quality of the red wine.
Main Features: Main features of interest is the quality of the wine and essentially determine which of the supporting variables make the wine taste good or bad. The variables pH and density showed a are worth looking into. The different acid variables are also features worth checking correlation with.
New Variables: The ‘quality’ variable was adjusted to be a factor type variable of 3 distinct quality levels that will help in the analysis by grouping in the upcoming analysis
Other adjustments First adjustment was to remove the index column as it served no purpose. The variables [fixed.acidity, residual.sugar, sulphates, total.sulfur.dioxide, chlorides] were adjusted to include the 99th percentile, which means that outliers in the top 1% were excluded. Then a logarithimic transformation was applied to determine a normal distribution.
Further investigation needed: The variables [alcohol, citric.acid, free.sulfur.dioxide] that do not fit into the logarithmic normal distribution and will require further analsyis

Bivariate Plots Section

In this section we will observe the correlation of quality to the different variables. We had previously reduced the qine quality score to three values. We will compare this discrete quality values with the other continuous wine variables and identify trends using Box Plots

In the plots below forr fixed acidity, volatile acidity, citric acid, sulphates, alcohol we can see a general trend indicating a change in quality score influenced by a change in value of these variables. Within these variables, alcohol content seems to be the only exception where ‘bad’ and ‘good’ quality wines are not impacted by alcohol content. But it is worth including as a dependant variable as the ‘best’ quality wines seem to be determined by the higher alcohol content.

Whereas the plots for residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH do not indicate a correlation to the quality score of the wine. You can see from the plots below that for the different values within each variabe, the box plots indicating the distribution for the quality scores are generally identical ,or the differences of distribution is insignificant

While building a model to identify correlation it is important to identify bias by verifying that the variables are in fact independent. Below we will check verify correlation using scatter plots.

It is good to see that while the variables fixed acidity and volatile acidity were indicated to show an effect on quality score, we can confirm that in fact they have have a WEAK negative correlation within themselves

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3013681 -0.2097433
## sample estimates:
##        cor 
## -0.2561309

The variables, fixed acidity and citric acid, on the other is just below the STRONG positive correlation

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

While the Varaibles, volatile acidity and citric acid are not strongly correlated either

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5856550 -0.5174902
## sample estimates:
##        cor 
## -0.5524957

Observing pH correation with acid levels: It is interesting to see the ‘pH’ doesn’t have much of an effect while the different acidity levels does. Below, we will continue to spot correlations within the variables that influence quailty.

Both citric.acid and fixed.acidity have the expected negative correlation with pH, however the variable volatile.acidity has a weak correlation of ~0.23. This confirms that Volatile acidity constitutes strongly to the quality of the wine, while not having a significant effect on the pH of the wine

## 
##  Pearson's product-moment correlation
## 
## data:  volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1880823 0.2807254
## sample estimates:
##       cor 
## 0.2349373

Below, we continue looking at the other correlations that do not affect quality. These are captured if it becomes necessary as the model is further enhanced when evaluating ‘White Wine’ data. But for our purposes looking at just ‘Red Wine’ data we have seen before that they do not affect quality scores. It is also worth noting that the univariate analysis had identified prime candidates in pH and density as independent variables due to the normal distributions but then rejected as valide candidates once the correlation to the dependant variable quality showed an insignificant effect.

free sulfur dioxide and total sulfur dioxide has a strong positive correlation

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

The variables total sulfur dioxide and sulphates, unlike above, has no correlation

## 
##  Pearson's product-moment correlation
## 
## data:  total.sulfur.dioxide and sulphates
## t = 1.7178, df = 1597, p-value = 0.08602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006087119  0.091774762
## sample estimates:
##        cor 
## 0.04294684

Even though sulphates does not have any correlation to total sulfur dioxide, it does seem to have a medium strength correlation to acid levels, primarily citric acid as seen below

## 
##  Pearson's product-moment correlation
## 
## data:  citric.acid and sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2678558 0.3563278
## sample estimates:
##     cor 
## 0.31277

density and alcohol have a medium strength negative correlation

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Bivariate Analysis - Summary

Features of Interest: The plots for fixed acidity, volatile acidity, citric acid, sulphates, alcohol indicate an influence on the quality. Alcohol content is an exception among these as it has there is not affect on the ‘low’ and ‘medium’ scored qualities. The distinction is seen with the ‘best’ quality wines where a a higher alcohol content indcates an improvement in quality.
Non-main features of interest: Citric acid have negative correlation with volatile acidity while citric acid has a positive correlation with fixed acidity. alcohol and density has a medium strength negative correlation
Strongest Relationship: volatile.acidity in its affect on quality as seen in the box plot shows the clearest separation of distribution for each quality scores. It is also worth noting that fixed acidity and citric acid are independant variables that have a mid-strong level correlation

Multivariate Plots Section

In this section we are going to further enhance the analysis earlier and take steps to identify correlations while faceting by the quality factors.

Looking at the strong correlation between fixed.acidity and citric.acid. Once separated by quality we can see an indication for higher quality when levels for fixed.acidity and citric.acid are higher

Now looking at the MEDIUM strength correlation of volatile.acidity and citric.acid. A concentration of low volatile acidity also seems evident for ‘high’ quality wine

Finally, also comparing citric.acid to sulphates. Higher sulphate also improve quality, with a key indication to distinguish ‘medium’ quality from ‘low’ quality, as increasing Sulphates doesn’t seem to be an indicator distinguishing ‘high’ quality wines from ‘medium’ ones.

The confusion for the influence of alcohol is evident from these plots with a high variance in the plots for both ‘average’ and ‘high’ quality wine, but the concentration for the plot pattern in ‘high’ quality indicates an influence in quality for higher values of alcohol

It would interesting to further observe the influence of alcohol. This time we plot alcohol against density. We confirm the similar concentration of points within the 11-13 alcohol axis in the ‘high’ quality section

Some useful insight from the plots showing that the different levels of variables [citric.acid, fixed.acidity, volatile.acidity, suplphates, alcohol] does in fact influence quality. Better insight can be achieved by creating another variable which includes a discrete view by splitting up the continuous variables into buckets.

Using the top 99.5% percentile for the variables [citiric.acid, fixed.acidity, volatile.acidity, sulphates, alcohol] and split it up into 3 equal buckets.

Using the dplyr library, we create three equal buckets by using the 95th quantile and splitting it into three using the cut() function

Using bucket for fixed.acidity, plotting citric.acid against sulphates, we can identify the different levels of fixed.acidity and Citric.acid that improve quality

Using vixed.acidity bucket: The 0.25 citric acid and ‘4.6-7.73’ (in green) fixed acidity level is common for ‘average’ level wine and almost non existent for ‘high’ quality, indicates a ‘sweet spot’ to avoid for that extra level of quality

Using citric.acid buckets: The plot patterns are similar for ‘average’ and ‘high’, however the lower bucket of ‘citric acid’ (in green) is non existent in the ‘high’ quality plot just under the 3.4 level ‘pH’ mark, indicating another ‘sweet spot’

Using volatile.acidity buckets: Low levels of ‘volatile.acidity’ also affects the quality of the wine. The buckets of ‘volatile acidity’ creates a clear picture on quality. The lowest level on the ‘volatile acidity’ bucket (in green) is more prevalent in the ‘high’ quality wine, and mid level bucket of ‘volatile acidity’ is also quite evident in the ‘average’ quality wine

Using alcohol buckets: low level alcohol buckets ‘8.4 - 10.1’ (in green) is almost non-existent for ‘high’ quality wine

Using sulphates buckets: Mid level sulphates (in purple) seem to have the strongest influence on quality; sulphate levels also indicate a sweet spot. In ‘high’ quality wine, the mid level bucket for sulphates ‘0.673 - 1.02’ (in purple) is the most prevalent, while the lowest sulphate bucket (in green) is almost non existent

Another plot using the sulphates bucket: While the lower bucket influence is not confirmed in this plot, it is clear that that mid level ‘sulphate’ bucket has a clear influcene on ‘high’ quality wine

Multivariate Analysis - Summary

Features of Interest: There is some distinction that differentiates the three levels of alcohol. It should be noted that there isn’t much data for the two extremes of the quality spectrum (high and low). Wine quality improves with increasing citric.acid levels. ‘Mid’ level and ‘High’ level citric.acid are prominent among the ‘Average’ and ‘High’ quality wines. Wine quality improves while reducing volatile.acidity levels. The mid range level for volatile.acidity (0.443-0.767) is very prominent among the ‘Average level wines, while volatile.acidity at LOW levels (0.12-0.443) constitutes for ’High’ wine quality. Wine quality improves while increasing alcohol levels. The highest range on alcohol levels (11.9-13.6) is quite prominent among the ‘High’ quality wines while almost non-existent among the ‘Low’ quality wines

Interesting or surprising interactions between features: There are ‘sweet spots’ for the different variables that improve quality: * The sweet spot for the variables like fixed.acidity, citric acid, alcohol are on the higher end, while the ‘sweet spot’ for sulphates is in the middle. The mid-range level for sulphates at 0.673-1.02 are prominent in the ‘high’ quality wines and non existent in the ‘low’ quality wines * fixed.acidity levels do not provide a clear enough distinction within the different levels. While the most common fixed.acidity level is on the lower end (4.6-7.73) for ‘Low’ quality wine, that level is still prominent in the ‘Average’ and ‘High’ quality wines

Summary of all Analysis

Plot One

The bi-variate plots for fixed acidity, volatile acidity, citric acid, sulphates indicate an influence on the quality.

Plot Two

Distinguishing the bivariate plot for alcohol as the high distribution of *alcohol’ volume has a clear distinction in ‘High’ quality wines, while ‘Average’ and ‘Low’ quality wines do not have much of a distinction in alcohol volume.

Plot Three

The quality variable was the only discrete variable in this dataset so it was important to create a discrete version of of quality by splitting the variable into equal buckets. Out of the different buckets used in the independent variables, the sweet spot for sulphates in the mid range (0.673g/dm3 - 1.02 g/dm3) as being the ‘sweet spot’ for the influence of on ‘Average’ and ‘High’ quality, while non-existent in ‘Low’ quality

Reflection

This dataset for red wine consists of only 1599 observations. The majority of the collected data fell in the the mid quality range (5 or 6) leaving very little data for the two extremes (low quality and high quality). This proved to be difficult to find definite proof of a strong correlation. Variables that influenced correlations were apparent from bi-variate analysis. Multivariate analysis was difficult due to every other variable other than quality is continuous. Even with forming discrete variables by forming buckets, a clear indication on the influence was not certain due to the very small numbers for ‘high’ and ‘low’ quality wine.

A relative approach comparing the influences of different variables on the different types of wine would probably prove to be more fruitful, like determining factors that effect red wine and white wine differently. The analysis here is the useful first step to gain insight into how to approach that analysis.