Introduction

In wine, quality is a product of combining numerous different factors. A few of these factors might be pH, sulfate content, chloride content, density, and flatness. However this, then, begs the following question: can wine quality be accurately predicted based on such variables? Through this study, it was investigated whether wine’s pH and sulfur content could be used to accurately predict its quality.

The pH of a substance is its relative contents of hydrogen and hydroxide ions. pH is calculated by calculating the negative, base 10 logarithm of hydroxide ion concentration in a substance. Substances with higher pH (above 7) exhibit basic qualities, whereas those with lower pH (below 7) exhibit acidic qualities.

Many experts note that a wine’s pH affects almost every aspect of the wine: flavor, aroma, color, and potentially even quality (http://winemakersacademy.com/importance-ph-wine-making/). pH has even been unofficially dubbed the “backbone of a wine.” pH is, specifically, said to affect the ability of bacteria to populate the wine and the saturation of the wine’s color. Since quality of wine is a direct product of the wine’s combined characteristics, varying pH of most wines should tend to affect its quality since other factors related to pH would vary as well. Therefore, an association between pH and quality is to be expected.

Sulphates also tend to be seen as important factors in wine quality in expert eyes (http://www.decanter.com/learn/wine-terminology/sulfites-in-wine-friend-or-foe-295931/). A “sulphate” is a molecule containing sulfur as it a key component in its molecular structure. Most sulphates found in wine are sulfur dioxide molecules and sulfite ions.

Many experts believe that higher sulfurous content causes a duller taste in wine, and that high potency of sulfite ions presents a health risk and speeds up the wine’s fermentation process. This suggests that higher sulphate contents tend to correspond with lower wine quality. The presence of another type of sulfate, sulfur dioxide, is thought to help rid the wine of a wide variety of bacteria (good and bad); this seems to lower the wine quality as well because it dulls the wine’s fermentation process.

Since wine quality is a qualitative variable, which is also difficult to accurately describe, the dataset’s variable representative of wine quality was quantified, with values from 3 to 8 (3 being the lowest quality and 8 the highest). These values were established by the dataset’s creators through a series of meticulous sensory and visual tests to maximize the accuracy of the wine’s quality evaluation (http://www3.dsi.uminho.pt/pcortez/wine5.pdf). Some specific factors taken into account were consistency, clarity, and vibrance of taste.

Based on research conducted and expert opinions, it was predicted that changes in wine pH and sulfuric contents would correspond to changes in wine quality.

Methods

The UCI Machine Learning Repository (UCIMLR) is a collection of databases, domain theories, and data generators used in analysis of machine learning algorithms. The archive was created in 1987 by David Aha and fellow graduate students at UC Irvine.

The dataset used in this project was obtained from the module labeled “wine quality” on the UCIMLR. The module contained two different datasets: the red and white variants of the Portuguese “Vinho Verde” wine. For the purposes of this experiment, only the red wine data were used. No pricing, brand, or grape-type data were included. The classes were not balanced, but were ordered. The different variables present in the dataset included acidity, residual sugar, chlorines, alcohol, pH, sulphate content, free sulfur dioxide, density, and quality. The dataset was retreived from the following link: http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv

The dataset was imported from a comma-separated variable (CSV) file.

setwd("/Users/Michael/Desktop/STATSFINALPROJECT")
redwine <- read.csv("winequality-red.csv", header = TRUE)

No columns were factored to begin with, so no unfactoring was needed to be done. All variables were numeric. The dataset variables investigated throughout this project were wine pH (redwine$pH), sulphate contents (redwine$sulphates), and free sulfur dioxide (redwine$free.sulfur.dioxide), which were all individually plotted against wine quality to determine the strength of each of their associations with wine quality. A graph was created for each variable, plotted against wine quality. This graph was then used to analyze the associated strength between the given variable and quality.

For each variable, a linear regression model was created to accurately model its predictive power over wine quality. For each variable’s graph, the variable in question was plotted on the X axis and quality on the Y axis. As a linear fit for the relationship model was created, the R squared of the line was recorded. R squared values for each variable’s respective graph were compared. It was pre-established that an R squared value above 0.80 for the graph of any given variable would be considered a strong predictive relationship between that variable and wine quality. Conversely, an R squared value below 0.80 would suggest that the variable does not have sufficient predictive power over the wine quality.

The variables were split into their own sub-data frames as such:

pH <- redwine$pH
sulphateContent <- redwine$sulphates
freeSulfurDioxide <-redwine$sulphates

Results

None of the relationships modeled yeilded R-squared values above 0.80. Likewise, none of the graphs displayed any clear association between its X variable and wine quality.

plot(redwine$quality ~ redwine$pH, main = "The Relationship of pH and Wine Quality",
     xlab = "pH",
     ylab = "wine quality")
fit <- lm(redwine$quality ~ redwine$pH)
abline(fit, col = "red")

summary(fit)$r.squared
## [1] 0.003332914

There was no clear association between red wine quality and pH. Points seemed to be arranged almost randomly, as was apparent from the approximate circular shape visible in the graph. The graph’s R-squared value was found to be 0.003. This was below the pre-established threshold for “strong” association. Overall, these results suggested that almost no variation in quality could be explained by variation in pH, and thus that wine pH has very little predictive power over wine quality.

plot(redwine$quality ~ redwine$sulphates, main = "The Relationship of Sulphate Content and Wine Quality",
     xlab = "sulphate content (mg)",
     ylab = "wine quality")
fit <- lm(redwine$quality ~ redwine$sulphates)
abline(fit, col = "red")

summary(fit)$r.squared
## [1] 0.06320049

There was no clear association between red wine and sulphate content. The points seemed to be scattered randomly with no clear pattern. The R-squared value of the graph’s fit was found to be 0.06. This was below the pre-established threshold for a “strong” association (R-squared is at least 0.80). These results suggested that almost no variation in quality could be explained by variation in sulphate contents (mg), and thus that sulphate content has very little predictive power over wine quality.

plot(redwine$quality ~ redwine$free.sulfur.dioxide,
     main = "The Relationship of Free Sulfur Dioxide Content and Wine Quality",
     xlab = "free sulfur dioxide (mg)",
     ylab = "wine quality")
fit <- lm(redwine$quality ~ redwine$free.sulfur.dioxide)
abline(fit, col = "red")

summary(fit)$r.squared
## [1] 0.002566036

There was no clear association between red wine and free sulfur dioxide content. The points on the graph did not display any clear pattern, although more so than the previous two graphs. The R-squared value of the graph’s fit was found to be 0.003. This was below the pre-established threshold for a “strong” association. Overall, hese results suggested that almost no variation in quality could be explained by variation in free sulfur dixoide (mg), and thus that sulfur dioxide content has very little predictive power over wine quality.

Discussion

The results of this study did not support its hypothesis. While it was predicted that pH, sulphate content, and free sulfur dioxide content would affect wine quality, no factor seemed to have strong predictive power over wine quality, as all R-squared values were below the pre-established threshold for “strong” associations. Moreover, none of the graphs had any discernible trends to them; in each one, the points seemed to be scattered randomly. As R-squared is a quantitative display of how much variation in the dependent variable can be explained by the explanatory variable, a low R-squared value suggests that little variation could be explained. In simple terms, linear fits between two variables with very small R-squared values, as seen in the graphs above, model relationships that do not seem to vary dependently and hold no predictive power over one another. As such, it was observed that common beliefs about the ability of pH and sulphate content to affect wine quality were not necessarily accurate, as quality seemed to vary independently of different pH levels or distinct sulphate/sulfur dioxide content.

Theoretically, these results made sense because, as mentioned in the introduction, experts tend to see wine quality as a coupling characteristic of all the other unique characteristics of the wine. So, in other words, whereas one wine might have had a low pH and a high quality, another wine could have had a higher pH, but the same quality, because its other factors could have been different. It is very likely that factors other than pH and sulphate content are at play in the determination of wine quality.

This experiment’s main limitations were that the wine data was taken only from a specific type of red wine. It is possible that the quality of wines other than Vihno Verde is, in fact, affected by changes in pH and/or sulfuric contents. So, in the future, it would be interesting to investigate this experiment’s hypothesis within a larger scope, perhaps combining data from hundreds of different types of wines. Also, since variables such as grape type, selling price, barrel wood type, and brand were not included in the dataset, it would be interesting to find a dataset with these variables present in order to be able to investigate their association with wine quality. It is likely that they would have stronger predictive power over quality than would pH or sulphate content.

Conclusion

The linear regression analysis conducted suggested that changes in neither pH nor sulfuric (sulphate and/or free sulfur dioxide) content associated strongly with changes in quality of Vihno Verde Portuguese wine. As linear fit models for each variables’ associations with quality were created, none yielded R-squared values above 0.80 (the predetermined threshold for “strong” predictive power), barring the researchers from concluding that any of these three variables held strong predictive power over wine quality.