Determine the effect of different factors on the quality of Red Wine. We will take a Descriptive Statistical approach here, to describe the collected data. A more complex approach to build a predictor model is well suited for the next phase of analysis
The collected Red Wine dataset is a basic structure with 1599 indexed observations and 13 variables. Of the 13 variables, there is one which is just the index, leaving 11 features that represent the independent variables and the single target variable or dependent variable, which represents the quality score of the wines.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
We start by looking at the the distribution of the individual variables to determine if the data collected is valid to determine correlation to quality. The script below uses the R summary() function tot show the IQR and central measures of all the variables. This is a good gauge to determine general distribution and identify possibilities for outliers
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
Results: The quality variable is discrete and and as the target variable will be correlated against the 11 different factors. The quality data consists of 6 discrete values with 8 being the highest 3 being the lowest. You can see below that the the most common quality values are 5 and 6, where 681 observations have a quality rating of 5 and 638 observations have a quality rating of 6.
##
## 3 4 5 6 7 8
## 10 53 681 638 199 18
The summary showing the median/mean and IQR for the predictive variables indicates that these variables are continuous. Even before we plot histograms for these, we can make some assessments for the skews by looking at the median, mean, and Max/Min values:
We will confirm these findings in the next section, by observing the distribution of each variable by plotting histograms.
In this section we will show the visual distribution of the individual variables and identify if transformations are needed before we start observing correlations. In this section we will better qualify the categories in the depeneant variable quality, and then determine the validity of the independent variables to be used to look at correlation to quality by observing their distributions. A normal distribution will be a strong candidate to observe further, along with distributions that show a normal behavior once transformed on log scale.
Adjust Dependent variable The variable quality is discrete with most of the data have quality level in the 5 or 6 quality level with a fewer examples for the higher or lower quality scores. To constitute a better frequency for the different categories, it will be beneficial to change the bracket size. It might be beneficial to split this into two brakcets if we were building a logistic regression model. But for descriptive purposes, the quality scores will be split into 3 brackets allowing us to better utilize the quality number as a predictor variable for the various wine quality variables. Utilizing basic ifelse statements, the quality variable is re-categorized into the 3 segments of High, Average, and Low. The resulting distribution can be seen below.
Identify Normally distributed variables With the dependent variable quality re-categorized, we can start the Univariate analysis of the independent variables. We start with pH and density as they previously indicated a Normal Distribution, and these are confirmed by the histograms below.
Plot Note: The same distribution with a smaller bin width is also provided next to the original plot to highlight a deeper look at the distribution and in turn confirm that the normal distribution is rarely a perfectly bell shaped curve
For the remaining variables, we had previously identified possible skews on the normal distribution. We will plot this visually and confirm that the extent of the skews and also confirm if transformations result in a normal distribution.
Plot Note: R allows for an iterative approach to plotting. The plots below were plotted by just identifying the x axis, and were then further refined, by adjusting factors like binwidth and breaks, to present the final visual distribution
Among the remaining variable to be observed, it can be seen that fixed.acidity is very close to a normal distribution. However like the variables [‘residual.sugar’, ‘sulphates’, ‘total.sulfur.dioxide’, ‘chlorides’], the distribution has a long tail and would benefit with a transformation.
The variable free.sulfur.dioxide seems discrete after bins are shortened. This can probably also be constituted to rounding. It is interesting to see that the distribution does seem to exhibit a long trailing tail, so a transformation should be attempted
Transformations for Normal distributions Here, we will continue transforming the variables in an attempt to get a normal bell shaped distribution. For the variables [fixed.acidity, residual.sugar, sulphates, total.sulfur.dioxied, ‘chlorides’] we transformed the distributions using log10 and observed a more normal distribution
Plot Note: It was also evident from the earlier plots that outliers are present for the variables, so only the top 99th percentile will be used to represent the distribution
Readme Note: Previous plots were done using qplot. Plots moving forward will be done using ggplot allowing for the iterative approach to plotting while allowing for more control and refinement. For example in the plots below one variable [p1] is assigned the basic plot, and then the variation is created by l1 = p1 + coord_trans(x = ‘log10’)
Unable to Normalize The plots below show the unsuccessful transformation of the variables free sulfur dioxide, alcohol and citric acid due the inability to normalize after applying transformations. It is difficult to form a conclusion on these and will require further investigation in bi-variate analysis. In later analysis the data can be subset to remove any null values.
In this section we will observe the correlation of quality to the different variables. We had previously reduced the qine quality score to three values. We will compare this discrete quality values with the other continuous wine variables and identify trends using Box Plots
In the plots below forr fixed acidity, volatile acidity, citric acid, sulphates, alcohol we can see a general trend indicating a change in quality score influenced by a change in value of these variables. Within these variables, alcohol content seems to be the only exception where ‘bad’ and ‘good’ quality wines are not impacted by alcohol content. But it is worth including as a dependant variable as the ‘best’ quality wines seem to be determined by the higher alcohol content.
Whereas the plots for residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH do not indicate a correlation to the quality score of the wine. You can see from the plots below that for the different values within each variabe, the box plots indicating the distribution for the quality scores are generally identical ,or the differences of distribution is insignificant
While building a model to identify correlation it is important to identify bias by verifying that the variables are in fact independent. Below we will check verify correlation using scatter plots.
It is good to see that while the variables fixed acidity and volatile acidity were indicated to show an effect on quality score, we can confirm that in fact they have have a WEAK negative correlation within themselves
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and volatile.acidity
## t = -10.589, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3013681 -0.2097433
## sample estimates:
## cor
## -0.2561309
The variables, fixed acidity and citric acid, on the other is just below the STRONG positive correlation
##
## Pearson's product-moment correlation
##
## data: fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
While the Varaibles, volatile acidity and citric acid are not strongly correlated either
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Observing pH correation with acid levels: It is interesting to see the ‘pH’ doesn’t have much of an effect while the different acidity levels does. Below, we will continue to spot correlations within the variables that influence quailty.
##
## Pearson's product-moment correlation
##
## data: volatile.acidity and pH
## t = 9.659, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1880823 0.2807254
## sample estimates:
## cor
## 0.2349373
Below, we continue looking at the other correlations that do not affect quality. These are captured if it becomes necessary as the model is further enhanced when evaluating ‘White Wine’ data. But for our purposes looking at just ‘Red Wine’ data we have seen before that they do not affect quality scores. It is also worth noting that the univariate analysis had identified prime candidates in pH and density as independent variables due to the normal distributions but then rejected as valide candidates once the correlation to the dependant variable quality showed an insignificant effect.
free sulfur dioxide and total sulfur dioxide has a strong positive correlation
##
## Pearson's product-moment correlation
##
## data: free.sulfur.dioxide and total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6395786 0.6939740
## sample estimates:
## cor
## 0.6676665
The variables total sulfur dioxide and sulphates, unlike above, has no correlation
##
## Pearson's product-moment correlation
##
## data: total.sulfur.dioxide and sulphates
## t = 1.7178, df = 1597, p-value = 0.08602
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.006087119 0.091774762
## sample estimates:
## cor
## 0.04294684
Even though sulphates does not have any correlation to total sulfur dioxide, it does seem to have a medium strength correlation to acid levels, primarily citric acid as seen below
##
## Pearson's product-moment correlation
##
## data: citric.acid and sulphates
## t = 13.159, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2678558 0.3563278
## sample estimates:
## cor
## 0.31277
density and alcohol have a medium strength negative correlation
##
## Pearson's product-moment correlation
##
## data: density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
In this section we are going to further enhance the analysis earlier and take steps to identify correlations while faceting by the quality factors.
Looking at the strong correlation between fixed.acidity and citric.acid. Once separated by quality we can see an indication for higher quality when levels for fixed.acidity and citric.acid are higher
Now looking at the MEDIUM strength correlation of volatile.acidity and citric.acid. A concentration of low volatile acidity also seems evident for ‘high’ quality wine
Finally, also comparing citric.acid to sulphates. Higher sulphate also improve quality, with a key indication to distinguish ‘medium’ quality from ‘low’ quality, as increasing Sulphates doesn’t seem to be an indicator distinguishing ‘high’ quality wines from ‘medium’ ones.
The confusion for the influence of alcohol is evident from these plots with a high variance in the plots for both ‘average’ and ‘high’ quality wine, but the concentration for the plot pattern in ‘high’ quality indicates an influence in quality for higher values of alcohol
It would interesting to further observe the influence of alcohol. This time we plot alcohol against density. We confirm the similar concentration of points within the 11-13 alcohol axis in the ‘high’ quality section
Some useful insight from the plots showing that the different levels of variables [citric.acid, fixed.acidity, volatile.acidity, suplphates, alcohol] does in fact influence quality. Better insight can be achieved by creating another variable which includes a discrete view by splitting up the continuous variables into buckets.
Using the top 99.5% percentile for the variables [citiric.acid, fixed.acidity, volatile.acidity, sulphates, alcohol] and split it up into 3 equal buckets.
Using the dplyr library, we create three equal buckets by using the 95th quantile and splitting it into three using the cut() function
Using bucket for fixed.acidity, plotting citric.acid against sulphates, we can identify the different levels of fixed.acidity and Citric.acid that improve quality
Using vixed.acidity bucket: The 0.25 citric acid and ‘4.6-7.73’ (in green) fixed acidity level is common for ‘average’ level wine and almost non existent for ‘high’ quality, indicates a ‘sweet spot’ to avoid for that extra level of quality
Using citric.acid buckets: The plot patterns are similar for ‘average’ and ‘high’, however the lower bucket of ‘citric acid’ (in green) is non existent in the ‘high’ quality plot just under the 3.4 level ‘pH’ mark, indicating another ‘sweet spot’
Using volatile.acidity buckets: Low levels of ‘volatile.acidity’ also affects the quality of the wine. The buckets of ‘volatile acidity’ creates a clear picture on quality. The lowest level on the ‘volatile acidity’ bucket (in green) is more prevalent in the ‘high’ quality wine, and mid level bucket of ‘volatile acidity’ is also quite evident in the ‘average’ quality wine
Using alcohol buckets: low level alcohol buckets ‘8.4 - 10.1’ (in green) is almost non-existent for ‘high’ quality wine
Using sulphates buckets: Mid level sulphates (in purple) seem to have the strongest influence on quality; sulphate levels also indicate a sweet spot. In ‘high’ quality wine, the mid level bucket for sulphates ‘0.673 - 1.02’ (in purple) is the most prevalent, while the lowest sulphate bucket (in green) is almost non existent
Another plot using the sulphates bucket: While the lower bucket influence is not confirmed in this plot, it is clear that that mid level ‘sulphate’ bucket has a clear influcene on ‘high’ quality wine
Features of Interest: There is some distinction that differentiates the three levels of alcohol. It should be noted that there isn’t much data for the two extremes of the quality spectrum (high and low). Wine quality improves with increasing citric.acid levels. ‘Mid’ level and ‘High’ level citric.acid are prominent among the ‘Average’ and ‘High’ quality wines. Wine quality improves while reducing volatile.acidity levels. The mid range level for volatile.acidity (0.443-0.767) is very prominent among the ‘Average level wines, while volatile.acidity at LOW levels (0.12-0.443) constitutes for ’High’ wine quality. Wine quality improves while increasing alcohol levels. The highest range on alcohol levels (11.9-13.6) is quite prominent among the ‘High’ quality wines while almost non-existent among the ‘Low’ quality wines
Interesting or surprising interactions between features: There are ‘sweet spots’ for the different variables that improve quality: * The sweet spot for the variables like fixed.acidity, citric acid, alcohol are on the higher end, while the ‘sweet spot’ for sulphates is in the middle. The mid-range level for sulphates at 0.673-1.02 are prominent in the ‘high’ quality wines and non existent in the ‘low’ quality wines * fixed.acidity levels do not provide a clear enough distinction within the different levels. While the most common fixed.acidity level is on the lower end (4.6-7.73) for ‘Low’ quality wine, that level is still prominent in the ‘Average’ and ‘High’ quality wines
The bi-variate plots for fixed acidity, volatile acidity, citric acid, sulphates indicate an influence on the quality.
Distinguishing the bivariate plot for alcohol as the high distribution of *alcohol’ volume has a clear distinction in ‘High’ quality wines, while ‘Average’ and ‘Low’ quality wines do not have much of a distinction in alcohol volume.
The quality variable was the only discrete variable in this dataset so it was important to create a discrete version of of quality by splitting the variable into equal buckets. Out of the different buckets used in the independent variables, the sweet spot for sulphates in the mid range (0.673g/dm3 - 1.02 g/dm3) as being the ‘sweet spot’ for the influence of on ‘Average’ and ‘High’ quality, while non-existent in ‘Low’ quality
This dataset for red wine consists of only 1599 observations. The majority of the collected data fell in the the mid quality range (5 or 6) leaving very little data for the two extremes (low quality and high quality). This proved to be difficult to find definite proof of a strong correlation. Variables that influenced correlations were apparent from bi-variate analysis. Multivariate analysis was difficult due to every other variable other than quality is continuous. Even with forming discrete variables by forming buckets, a clear indication on the influence was not certain due to the very small numbers for ‘high’ and ‘low’ quality wine.
A relative approach comparing the influences of different variables on the different types of wine would probably prove to be more fruitful, like determining factors that effect red wine and white wine differently. The analysis here is the useful first step to gain insight into how to approach that analysis.