Willem Popp, s3083899
Last updated: 25 October, 2020
Is there a relationship between the sugar and alcohol content of wine?
Wine is a common “social lubricant” that people use to socialise and relax. Having a glass of wine with dinner, or a bottle or two with friends at the weekend is a common occurence. The key ingredient in this metamorphosis is alcohol component of the wine that relaxes us and makes us more sociable. Thus, wine is a social relaxant.
But to generate the alcohol, we need sugar which is converted during the wine fermentation process to produce the alcohol in wine. Sugar is widely castigated for its negative effects on our health including weight gain which is an ever-growing social problem.
So one of society’s favourite relaxants, wine, is directly linked to one of our greatest social diseases, obesity. These two key social outcomes of wine are negatively linked at the societal level but is there any statistical evidence to indicate that an increase in one outcome (relaxation through wine consumption) leads to an increase in the other outcome (obesity through wine consumption).
My problem statement is that sugar and alcohol are linked? But the question is how? Does the residual sugar left over in wine after fermentation bear any relationship to the alcohol content of that wine?
The alcoholic component in any fermented drink is created by the conversion of sugar into that alcohol (P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. “Modeling wine preferences by data mining from physicochemical properties”. Decision Support Systems, Elsevier, 47(4), page 552). The more sugar you start with, the more alcoholic the final drink.
But what happens to any remnant sugar? If the sugar is not converted to alcohol, does it remain in the wine making it sweeter? And if there is more residual sugar, is the wine less alcoholic?
This is the subject of my investigation. Is there an inverse relationship between the amount of alcohol in wine, and its sweetness as measured by its sugar content?
My investigation is based upon the assumption that each wine starts with the same amount of sugar, some of which is converted to alcohol (thus raising the alcoholic content) with the rest remaining as residual sugar in the wine making it sweeter.
I make this assumption as I need a baseline against which to measure the sugar levels in wine. My hypothesis is based on how much sugar remains in a wine after it has been fermented. My hypothesis is that more alcoholic wines have more sugar converted to alcohol, leaving less in the finished product.
I will investigate the relationship between the residual sugar and alcoholic content of wine usIng a range of statistical methods as outlined below.
I will review my data to determine which variables I need to analyse to prove or disprove my hypothesis. I will then begin my statistical investigation with a review of the basic summary statistics of each variable. This should quickly show the range of values for each variable and where the distribution is focussed (the mean).
I will then use a box plot to give a one-dimensional visualisation of the distribution of the values of my variables. This will graphically illustrate the distribution of the values for each variable and identify if there are any outliers.
I will then use a histogram to show the frequency of each group of values for each variable. This will help determine the spread of my values and outliers that could skew my results and thus influence my findings.
My third visualisation will be a scatter plot comparing my two key paramaters. This will bring the two variables together in two-dimensional plane to show the cross-correlation between the outliers of each variable (Is there a relationship between the outliers for each variable?) This two-dimensional view also shows how the outliers of each variable align across the range of values for the other variable.
I will then use a simple linear regression model to statistically analyse any relationship between my variables. My aim is to show if there is any relationship between my predictor variable, residual sugar, and my dependent variable, the alcoholic content of wine.
I will then run a correlation test to double check any relationship I may find. I will use the Pearson standard correlation test that should show the strength of any relationship between my variables.
I obtained my data from the UCI repository at: (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). The data is open source and has been prepared for statistical analysis. The statistical techniques used in the original study are beyond the remit of this study so provided little guidance other than how to prepare and present a professional analysis.
The data set is quite old so the source website is no longer available. (https://www.vinhoverde.pt/en/) However, I was able to obtain the original article related to the data set (https://linkinghub.elsevier.com/retrieve/pii/S0167923609001377) from which I was able to get a better understanding of the data.
The data description indicates that the data was prepared to evaluate the correlation between 11 physicochemical variables and taste as the output variable. The original investigation included separate sets of data for white and red wine with 4898 and 1599 observations respectively. I have chosed to focus on red wine as I only need one data set for my hypothesis.
My first attempt to load the data placed all 12 variables into one column which was obviously incorrect. I opened the data source file in Notepad and saw that the column headings were run together with inconsistent separators. I tried several different methods to import this data using different commands in R, but none were able to correctly separaate the data into the requisite columns.
To overcome this, I imported my data but skipped the first row with column names. This would allow me to visually check the data to determine if there were any other data import errors. There were none, so I simply copied and pasted the header row from Notepad into my import function, and then manually cleaned up the column headings to make the more readable. I then viewed the data set as a visual check of its completeness.
Wine <- read.csv("Wine.csv", header = FALSE, sep = ";", skip = 1,
col.names = c("Fixed_Acidity", "Volatile_Acidity", "Citric_Acid", "Residual_Sugar",
"Chlorides", "Free_Sulfur_Dioxide", "Total_Sulfur_Dioxide", "Density",
"pH", "Sulphates", "Alcohol", "Quality"))
View(Wine)The two variables I am interested in are “Residual Sugar” and “Alcohol” (content). Both are continuous numeric variables, or quantitative variables.
Residual Sugar is measured as grams of sugar per litre of wine (g/dm3) and ranges from 0.9 g/dm3 to 15.5 g/dm3. (Cortez et al., 2009, p549). Alcohol is measured a percentage of volume and ranges from 8.4% to 14.9% (op. cit.). I have segragated my bins at whole number break points to make the data simple to refer to.
The data has already been pre-processing and did not require any further cleansing, other than correctly assigning colum names (see above).
There were no data issues such as missing or incorrect values, so the data did not need any further pre-processing.
I began my study with an examination of the basic summary statistics of each variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Looking at the interquartile range (IQR) between the first and third quartiles for Residual Sugar I can see my values fall within a fairly small range from 1.9 to 2.6 with a median of 2.2. This indicates a fairly centralised of values across this IQR.
However the maximum value is 15.5 which has pushed the mean up to 2.539 which is almost at my third quartile. This indicates that my data is skewed towards the right, but also that there are a few outliers that are having a disproportionate impact on my data.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Turning my attention to the alcoholic content of my data, I can see that the values for alcohol are much more evenly spread out. The first quartile is 9.5 with the third quartile at 11.1 with the median (10.2) and mean (10.42) both close to the center of this IQR, indicating a more normal distribution. At 8.4, my minimum is also quite close the IQR. However, I still have a high maximum value at 14.9 that is pulling my data to the right.
Running a box plot clearly exposes my outliers. I can quickly see that the Residual Sugar variable has a string of outliers extending from my third quartile at 2.6 all the way up to 15.5.
My box plot of Alcohol is more evenly dispersed but still shows a smattering of outliers extending to the right.
I also notice that my IQR for Alcohol is proportionally much larger that it was for Residual Sugar indicating that the outliers do not have as big an impact with Alcohol as they did with Residual Sugar.
I ran a set of histograms to visualise the distribution of data for both variables. I set my bins to 20 to reflect the whole number increments in Residual Sugar levels. I the overlaid a red line to show the mean of the distribution. This produced the first diagram that shows a few extreme outliers to the right. I can also see that the mean aligns with the largest bin of values.
Wine$Residual_Sugar %>% hist(,col="grey",xlim=c(0,20), xlab="Residual Sugar (g/dm3)",
main="Histogram of Residual Sugar - Bins 20", breaks = 20)
Wine$Residual_Sugar %>% mean() %>% abline(v=.,col='red',lw=2)However, I suspected that this was not the best representation of my data as I have nearly 1600 observations which were being represented by less than a dozen ranges (my whole number bins). So I decreased my bin width to 10% of each whole number. This showed me that my mean was not in fact alligned to the largest group of values, indicating that the outliers were indeed having an undue impact on my data set.
Wine$Residual_Sugar %>% hist(,col="grey",xlim=c(0,20), xlab="Residual Sugar (g/dm3)",
main="Histogram of Residual Sugar - Bins 100", breaks = 100)
Wine$Residual_Sugar %>% mean() %>% abline(v=.,col='red',lw=2)Interestingly, I can see that my main body of values is almost normally distributed, but I have a long tail to the right that is quite uneven, and with many gaps between data ranges. This seems to indicate that I have a few very unusual samples of wine with very high residual sugar content. This would skew my data abnormally to the right.
I then ran the same set of tests to visualise the alcoholic content of my wine samples.
Wine$Alcohol %>% hist(,col="grey",xlim=c(8,15), xlab="Alcohol (%)",
main="Histogram of Alcohol - Bins 10", breaks = 10)
Wine$Alcohol %>% mean() %>% abline(v=.,col='red',lw=2)The first histogram shows a maximum frequency between 9 and 9.5 with a steady decline to 13, before dropping suddenly, and with one outlier at 14.5. I also have an outlier at the other end of my data at 8.0, with a few values at 8.5, before my a sudden increase in alcholic contnet at 9. My mean looks to be roughly centered just below 10.5 giving a sense of normality to the distribution.
Wine$Alcohol %>% hist(,col="grey",xlim=c(8,15), xlab="Alcohol (%)",
main="Histogram of Alcohol - Bins 100", breaks = 100)
Wine$Alcohol %>% mean() %>% abline(v=.,col='red',lw=2) Running the same plot but with a smaller bin width (at 10%) clearly shows a number of abnormalities.
First, I can see that my values vary significantly jumping up and down across the range. My maximum values seem to be focussed on a very narrow range (9.3 – 9.4) with my mean isolated to the right.
I can also see now that I have extreme outliers at 14 and just below 15 that have pushed my data set to the right, but also a few outliers between 8.5 and 9 that may slighly offset this.
The next step was to use a scatter plot to see the spread of my variables against each other.
plot(Residual_Sugar ~ Alcohol, data = Wine, xlab = "Alcohol (%)", ylab = "Residual Sugar (g/dm3)",
ylim = c(0,16), col="red", main="Scatter Plot - Residual Sugar vs Alcohol ")
grid()The first scatter plot with Residual Sugar on the Y axis and Alcohol on the X axis shows a strong band of values between approximately 1g/dm3 and 3g/dm3 of Residual Sugar and stretching from just under 9% of Alcohol to 12%, where it tapers off slightly. The plot shows lots of outliers scattered above 5g/dm3 of Residual Sugar across the range of Alcohol.
The plot does not appear to show any trend between Residual Sugar and Alcohol. We could say that Residual Sugar is concentrated between 1g/dm3 and 3g/dm3 but there is no discernible change as Alcoholic content increases.
plot(Alcohol ~ Residual_Sugar, data = Wine, xlab = "Residual Sugar (g/dm3)", ylab = "Alcohol (%)",
col="red", main="Scatter Plot - Alcohol vs Residual Sugar")
abline(Wine, col = "red")Reversing the position of the varliables in the second plot does little to change this position. Again we have a concentration of values between 1g/dm3 and 3g/dm3 for Residual Sugar and between 9% and 12% of Alcohol, with most of the rest of our values, scattered throughout the plot. And again, we can see the same group of outliers above 10g/dm3 of Residual Sugar.
I then ran a linear regression model to determine if there is a statistically significant relationship between Residual Sugar and Alcohol. I do this by injecting my sample data for my two variables (Residual Sugar and Alcohol) into a new data set (“Wine-model2”), running this through the linear model function in R, and then viewing the summary data.
My Null Hypothesis is that the data does not fit a linear regression model. Thus my alternate hypothesis is that the data does fit a lnear regression model. I run the model and discuss each component in turn to determine what it tells me about my data.
##
## Call:
## lm(formula = Alcohol ~ Residual_Sugar, data = Wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0090 -0.8995 -0.2868 0.6767 4.3192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.34224 0.05487 188.476 <2e-16 ***
## Residual_Sugar 0.03180 0.01890 1.683 0.0926 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.065 on 1597 degrees of freedom
## Multiple R-squared: 0.00177, Adjusted R-squared: 0.001145
## F-statistic: 2.832 on 1 and 1597 DF, p-value: 0.09258
My analysis so far has not indicated any significant relationship between my two variables. This is confirmed by my R-Squared result which shows that there is only a 0.1% chance of any relationship between Residual Sugar and Alcohol. In other words, there is no linear relationship between my predictor variable (Residual Sugar) and my dependent variable (Alcohol).
The F-Statistic is 2.832 on degrees of freedom 1 and 1597. On its own, the F-Statistic does not tell me anything about my data. However, I can use it to determine a p-value that I can then interpret.
## [1] 0.09259869
At 0.09 my p-value is greater than my 0.05 significance level thus I should accept my null hypothesis. My p-value also confirms that there is no linear relationship between residual sugar and alcohol.
The final step of my statistical analysis is to validate my assumptions to determine that I have completed my modelling correctly. This will confirm the independence of my observations as well as ensuring that none of my data is repeated. I do this by running the plot function and displaying the key plots in a 2 by 2 matrix as shown below.
### Linearity (Residuals vs Fitted) This plot shows that my variables are not linear as the red line dips near the start and then drops off significantly in the second half of the plot. This supports my Null Hypothesis that there is no relationship between Residual Sugar and Alcohol in my wine sample.
It appears that my plots are not normal as my QQ plot shows some varation from the default line. So I cannot assume my residuals are normally distributed.
The indicator red line is not flat indicating that that my plot does not have homoscedasticity of variance. My data is therefore heteroscedastic, validating my finding that my data was not suited for linear regression modelling.
My final test for linear regression is the test for Cook’s distance. I can see from the final plot, that I have at least two values that fall outside the acceptable range. This indicates that they could be influencing my analysis unduly.
Given I had so many outliers for residual sugar, I removed them to determine if they had any major impact on my research. I removed all outliers above 3g/dm3 of residual sugar. I re-ran the basic plots for the filtered data.
I start my fresh analysis with a new boxplot to visually show the spread of my filtered data across my new reduced range of values.
I can clearly see now that my data is much more evenly distributed. My IQR for Residual Sugar is much broader and my first quartile and third quartile are closer to this IQR. I also only have one outlier which will have a much smaller impact on my data set.
I then run a new histogram to show the frequency of my values for Residual Sugar.
Wine_Filter$Residual_Sugar %>% hist(,col="grey",xlim=c(1,3), xlab="Residual Sugar Filtered (g/dm3)",
main="Histogram of Residual Sugar - Bins 20", breaks = 20)
Wine_Filter$Residual_Sugar %>% mean() %>% abline(v=.,col='red',lw=2) The filtered histogram shows a much more normal distribution for Residual Sugar. There is only one low and one high outlier, but these are quite close to the main group of values. This main group is also quite normally distributed with what appears to be a normal bell curve shape. Finally, my mean (as shown by the red line) is near the centre of my visualisation indicating that there are no extreme values (outliers) pulling it to any one side.
This gives me hope that I may now find a relationship between Residual Sugar and Alcohol, based on the assumption that it was the Residual Sugar outliers that prevented me this correlation from becoming apparent in my first pass analysis.
For the sake of completeness, I also run a new scatter plot to try to visualise any correlation between my variables.
plot(Residual_Sugar ~ Alcohol, data = Wine_Filter, xlab = "Alcohol (%)", ylab = "Filtered Residual Sugar (g/dm3)",
ylim = c(1,3), col="red", main="Scatter Plot - Filtered Residual Sugar vs Alcohol ")
grid() Unfortunately, my scatter plot does not show any significant relationship. My values appear to be fairly evenly scattered throughout the plot with no discernible trend in any particular direction. This would reinforce my Null Hypothesis from my first pass that there is no relationship between Residual Sugar and Alcohol in my wine sample.
I also then ran my linear model against my filtered data.
##
## Call:
## lm(formula = Alcohol ~ Residual_Sugar, data = Wine_Filter)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9849 -0.8437 -0.2818 0.6373 3.7182
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.95189 0.16338 60.914 < 2e-16 ***
## Residual_Sugar 0.20620 0.07612 2.709 0.00684 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.042 on 1357 degrees of freedom
## Multiple R-squared: 0.005378, Adjusted R-squared: 0.004645
## F-statistic: 7.338 on 1 and 1357 DF, p-value: 0.006836
My R-Squared model is still very low at 0.005 indicating that there is a 5% chance of there being a correlation between my two variables. This is the very threshold for accepting a relationship.
## [1] 0.04207544
Pearson’s correlation test shows a very low correlation between my variables, coming in at 0.04. This is very close to zero which would indicate that there is only a very slight positive correlation between my variables. But as it is below the 0.05, this is not statistically significant, thus I can only conclude that there is no correlation.
library(Hmisc)
bivariate <- as.matrix(dplyr::select(Wine, Residual_Sugar,Alcohol)) #Create a matrix of the variables to be correlated
rcorr(bivariate, type = "pearson")## Residual_Sugar Alcohol
## Residual_Sugar 1.00 0.04
## Alcohol 0.04 1.00
##
## n= 1599
##
##
## P
## Residual_Sugar Alcohol
## Residual_Sugar 0.0926
## Alcohol 0.0926
## [1] -0.006960058 0.090909069
We can cross-correlate this by looking at the Confidence Interval of our correlation using the Clr function. This gives me a CI ranging from -0.0069 to 0.0909 which does not capture my p-value of 0.0926, thus I have to accept my null hypothesis that there is no correlation between my variables.
My study set out to determine if there is a relationship between residual sugar and alcohol in wine. I wanted to find out if sweeter wine was more or less alcoholic to prove or disprove a common assumption that drinking too much wine makes leads to obesity.
My reasoning was that we drink to relax and socialise, with the alcohol in wine being the key relaxant. But the more alcohol we drink, the more sugar we consume, therefore the more weight we put on. So our efforts to relax can lead to obesity, which would make us more stressed and thus negate the intended effects of drinking.
I ran a battery of basic statistical visualisations that I thought would clearly show a relationship between the sugar content and alcohol content of wine. I started with the basic summary statistics of each variable to determine their normality. I then displayed the data in a one dimensional box plot to display the range of outliers before showing the frequency of each group of values through a histogram.
These two plots both showed that there were many outliers for Residual Sugar and that they could be having an undue impact on my distributions. I was able to prove this by refining the bins of my histogram that showed a far greater variability in the values for Residual Sugar than would be expected in a normal distribution.
The same set of plots for Alcohol did not show as much variability at the macro level which was re-assuring. However, my data was still not normally distributed. Refining the historgram bins again showed that my data was in fact quite variable with the frequency for each reading of Alcohol jumping about from value to value.
My final visualisation, a set of scatter plots, confirmed these preliminary views, that there was little if any correlation between Residual Sugar and Alcohol in my wine sample.
Regardless, I wanted to push on with my analysis hoping that a more mathematical study would provide more a accurate finding, based on the belief that while a diagrams may be worth a thousand words, a precise numerical answer would be more definitive.
Alas, this was not to be the case as my linear model did not provide the convincing evidence I was seeking that sweeter wines were less alcoholic. To the contrary, my linear regression model in fact proved that my Null Hypothesis was correct, and there is no relationship between my variables.
Not to be outdone, I decided to remove the outliers that seemed to be having the greatest impact on my study. These were the large number of outliers for Residual Sugar that were skewing my data to the right.
I then re-ran my visualisations which appeared to show a more normal distribution. I hoped that this would then allow me to finally prove a correlation between the Residual Sugar and alcoholic content of my wine sample.
Temptingly, my final linear regression model returned a result right on the cusp of being statistically significant at the 5% confidence level.
But should I accept this? It was right on the boundary for accepting or rejecting my hypothesis. What should I conclude?
Erring on the side of caution, I propably should not treat this borderline finding as statistically significant. I do this for a number of reasons as follows:
The very large number of outliers for Residual Sugar that seemed to skew my data so strongly. Should I have removed more of them? Or perhaps not remove them, but replace them with the median value?
The great variability of value for both Residual Sugar and Alcohol. Neither varirable was normally distributed, and in fact, both had significant variation across their range of values.
The lack of any clear correlation between my variables. Surely if I wanted to prove my hypothesis to a lay person, I would need a far clearer visual relationship. And to prove it to a statistician, I would need a much stronger set of mathematical findings.
For all these reasons, I decided in the end that my findings were not significant.
Thus, my research results did not find any significant relationship between the Residual Sugar content in wine and its alcoholic content.
Although as a budding student of data science, I was dissappointed that my results were not statistically significant, as a social person who likes to drink wine, I was initially pleased to discover that there was no correlation. I took this to mean that I could continue to enjoy the alcoholic effects of wine without gaining weight.
However, I quickly realised that my research question was fundamentally flawed for I was only looking at the residual sugar content of wine: that is the sugar that is left over after the fermentation process which had already converted a portion of the original sugar content into the alcohol in the wine.
Surely to understand the effects of sugar and alcohol, I needed to know the amount of sugar that the wine started with?
I tried to overcome this fundamental flaw with my hypothesis by making the assumption that all the wines in my sample started with the same sugar content. The logic being that the greater portion that was converted to alcohol leading to a higher alcoholic content, the less remaining (residual) sugar that would make the wine sweet. This would therefore lead to an inverse relationship between Residual Sugar and Alcohol.
There were many problems with my study. I have listed my key reasons for rejecting my hypothesis above.
But the fundamental problem with my study is that I started with the wrong premise. I thought I could prove a relationship between the alcoholic content of wine and residual sugar, when in fact the relationship should be between the initial amount of sugar and alcoholic content. This was my fundamental flaw!
To prove my hypothesis properly, I would need to actually know the amount of sugar that each wine in my sample started with. I could then have properly established if there was a relationship between the total sugar content of wine and the alcoholic content.
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009, ISSN: 0167-9236.
https://linkinghub.elsevier.com/retrieve/pii/S0167923609001377