Overview: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
The goal of the project is to find out critical variables that significantly affect the quality of wine, and then to derive the optimal condition of the variables, given the dataset. To do this, the project is developed as follows. First, univariate analysis filters out variables that do not change the quality of wine. The filtering can be performed by analyzing the variation of the variables. In bivariate analysis, we look into relationships between each selected variable and the quality. Furthermore, selected variables narrow down to critical factors, based on the relationships. In multivariate analysis, we focus on the correlation of critical factors, leading to the optimal condition of them. Finally, we predict the quality of wine, based on the derived condition.
The dataset that we analyze has 1599 samples, each of which consists of 11 variables and qualty as an output. The sample size is related to the level of confidence in evaluation results.
## X fixed.acidity volatile.acidity
## 1599 1599 1599
## citric.acid residual.sugar chlorides
## 1599 1599 1599
## free.sulfur.dioxide total.sulfur.dioxide density
## 1599 1599 1599
## pH sulphates alcohol
## 1599 1599 1599
## quality
## 1599
The name of each variable is as follows:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
The distribution of each variable gives us an idea of how sensitive it is to the evaluation of the wine quality. The smaller variation, the more difficult to differentiate the quality by tasting and smelling. In other words, the variation should be large enough for evaluators to distiguish wines. The following figure shows the distribution of 11 variables. We see that chlorides and density seem too subtle to make difference, while fixed acidity, volatile acidity, and alcohol are decent.
We see more clearly in quantitative comparison with standard deviation in the figure below. Density is the most subtle of all variables and chlorides is the next one. So, we speculate that any change in these two variables are most likely not captured by evaluators, and hence they are likely irrelevant factors. We will evaluate this finding further. We calculated the standard deviation of each variable and then created a table to draw a bar chart. For the comparison purpose, values are converted in a log scale.
Scatter plots are used for bivariate analysis. The following figure shows the relationship between each variable and quality. As can be seen, volatile acidity, sulphates, and alcohol are a little bit correlated with quality. Furthermore, sulphates and alcohol have a positive relation with quality, while volatile acidity have a negative relation. On the contrary, besides chlorides and density, residual sugar, free sulfur dioxide, and pH seem almost irrelevant to quality.
In details, as shown in the figure of quantitative comparison, we role out additional two variables: fixed acidity and total sulur dioxides. Finally, we narrow down to two most correlated variables, i.e., volatile acidity and alcohol. We may take citric acid and sulphates into account. We calculated the Pearson’s product-moment correlation of each variable with respect to quality and created a table to draw a bar chart. For the comparison purpose, values are converted in a log scale.
Voltatile acidity and alcohol are chosen as critical factors most correlated with quality. As shown in the following multivariate plot of quality by voltatile acidity and alcohol, there are more wines of better quality as the level of volatile acidity gets lower. At the same time, the higher the alcohol percentage, the better the wine quality.
In addition to critical factors, the correlation of sulphates is worth evaluating. As can be seen in the figure, the wine quality gets better within the range from 0.50 to 0.90 together with increase in the alcohol percentage. In other words, as the level of sulphates gets beyond 0.90, sulphates becomes little effect on quality.
By now, we have two critical factors, i.e., volatile acidity and alcohol, and two additional factors, i.e., citric acid and sulphates. With these factors, we create a linear model as follows:
##
## Calls:
## m1: lm(formula = I(log(quality)) ~ I(alcohol), data = wine)
## m2: lm(formula = I(log(quality)) ~ I(alcohol) + volatile.acidity,
## data = wine)
## m3: lm(formula = I(log(quality)) ~ I(alcohol) + volatile.acidity +
## sulphates, data = wine)
## m4: lm(formula = I(log(quality)) ~ I(alcohol) + volatile.acidity +
## sulphates + citric.acid, data = wine)
##
## ================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------
## (Intercept) 1.074*** 1.302*** 1.219*** 1.229***
## (0.032) (0.034) (0.036) (0.037)
## I(alcohol) 0.062*** 0.053*** 0.052*** 0.052***
## (0.003) (0.003) (0.003) (0.003)
## volatile.acidity -0.258*** -0.230*** -0.243***
## (0.017) (0.018) (0.021)
## sulphates 0.116*** 0.121***
## (0.018) (0.019)
## citric.acid -0.023
## (0.019)
## ----------------------------------------------------------------
## R-squared 0.2 0.3 0.3 0.3
## adj. R-squared 0.2 0.3 0.3 0.3
## sigma 0.1 0.1 0.1 0.1
## F 412.1 344.9 248.9 187.1
## p 0.0 0.0 0.0 0.0
## Log-likelihood 996.8 1100.5 1120.3 1121.1
## Deviance 26.9 23.6 23.1 23.0
## AIC -1987.7 -2193.0 -2230.6 -2230.1
## BIC -1971.5 -2171.5 -2203.7 -2197.8
## N 1599 1599 1599 1599
## ================================================================
Based on our findings, the following values of the factors produce the good quality of wine, i.e., Grade 7 at the level of 95 percent confidence.
## fit lwr upr
## 1 7.045481 5.559666 8.928379
Quality as an output is represented as grades that wine evaluators provide. It ranges from Grade 3 to Grade 8. The majority of wine belong to Grade 6. The distribution of quality appears to follow a Guassian distribution with a mean of 5.636. In general, samples are to be selected randomly in order to evaluate properly. From this point, it may limit the findings and reasoning from the data analysis.
Alcohol is selected as one of critical factors that determine the wine quality. The following figure is that scatter plot that shows a decent positive correlation between alcohol and quality. As the alcohol percentage gets higher, quality gets better.
Volatile acidity is the other critical factor. To the contrary, it has a negative correlation with quality. We can find better wine as the volatile acidity level gets lower. At the same time, the higher percentage of alcohol with the lower level of volatile acidity is most likely to produce good quality wine. The finding may help wine producers make the better taste of wine. In addition to volatile acidity and alcohol, sulphates and citric acid are also considered candidates to critical factors.
We narrow down to two critical factors from 11 variables throughout the analysis. We found some variables, such as density, too subtle for evaluators to capture the variation of the variables. Also, chlorides turns out to be irrelevant to quality. Some variables have a decent correlation with quality, but unfortunately none of them are very strong. Based on our findings and using a linear model function, we predict the wine quality given a condition of variables. Although we can predict the wine quality, the reasoning is limited to the unevent sampling. Also, there can be other factors like the way of preservattion of wines, e.g., temperature, humidity, and so on. Nevertheless, data and models are always a great tool for analysis and prediction.