1 Background Information

Worldwide, red wine is a valuable and highly sought commodity. It is a type of wine made from dark-colored grapes variants. The color of red wine ranges from violet, for younger wines, to brick red for more mature wines. Red wine of high quality is highly valuable in the global market. According to YouGov, a global data company, red wine is more popular than white or rose wine.

image by Jamie Ballard from YouGov

The quality of wine is determined by its chemical properties. The problem is, a red wine drink has numerous chemical properties that must be adjusted to a certain level to create the highest quality of wine. In this analysis post, we are going to look at ad analyze several notable chemical properties (parameters) of red wine to find out the optimal properties of red wine with the highest qualities

2 Environment Setup and Dataset

2.1 Environment and Library Setup

To assist and perform data wrangling and data visualization process, we are going to use tidyverse Tidyverse is a package that contain several packages commonly used for data sciences. Some of these packages are:

  • dplyr -> Data Wrangling
  • ggplot -> Visualization
  • readr -> Input & Output

Later on, we are also going to use plyr to calcualte median for each group and GGally to calculate and plot correlation analysis results.

2.2 Data Explanation

## 
## -- Column specification --------------------------------------------------------
## cols(
##   `fixed acidity` = col_double(),
##   `volatile acidity` = col_double(),
##   `citric acid` = col_double(),
##   `residual sugar` = col_double(),
##   chlorides = col_double(),
##   `free sulfur dioxide` = col_double(),
##   `total sulfur dioxide` = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )

To get a quick look of the red wine data we can use summary.

However, the result of summary alone is too messy and sometimes hard to look at. As such, I utilized the code below to turn the rows of every column into summary statistics i.e., min, median, and max and present in dataframe format.

Looking at the table above, we still need to turn the quality column into factor (categorical) because this column indicate the graded quality of each red wine sample.

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

3 Discussion

3.1 Data Exploration

Before any analysis, lets do a quick check of each alcohol parameter distribution. Numerical and categorical columns are visualized differently.

For numerical columns, threre are these parameters: fixed acidity, volatile acidity, citric acid percentage, residual sugar, chlorides, free sulfur-oxide, total sulfur-dioxide, density, pH, sulphates, and alcohol level.

For caterogical column, there is only one column which is the quality column.

3.1.1 Visualize the Distribution of Numerical columns

To simplify the plotting process, I use for loop to plot the distribution of every numerical columns in the red wine dataset.

Distribution visualization is important for several reasons.

  1. Check for data distribution type

An interesting parameter is the citric acid percentage. We can get tricked just by knowing the median value of citric acid percentage because the distribution of citric acid is not normal. Knowing the median (or median) of distribution is usually useful to know where the center of data is. However, this would only work for normally distributed data.

Visualizing the distribution allows us to get a better understanding of our data. Now that we have visualized the distribution, we know that most red wine samples have citric acid percentage of 0 and 50%.

  1. Get a quick grasp of data volume and spread

Visualizing distribution also allows us to quickly grapsp the volume of data we are dealing. We can check maximum value and range (min to max) of data through visualization. We can also spot outliers through visualization distribution.

3.1.2 Visualize the Distribution of categorical column

Since the quality is a categorical column, we are going to visualize it it differently with geom_col

We can see that most of wine samples have average quality (5 and 6). There are some wine samples with good and very good quality (7 and 8) and there are some wine samples that are of low quality (3 and 4).

3.2 Effect of alcohol chemical properties on red wine quality

We first do correlation on analysis on red wine chemical properties againts red wine quality with the help of GGally package.

From the result above, we do not have any strongly correlated singular parameter that affect red wine quality. Fortunately, we found two parameters with moderate correlation values which are alcohol and volatile acidity. We investigate these 2 parameters further with ggpairs to see their correlation plot.

I am interested to see what is it that differentiates high quality red wine to lower quality red wine. As such, I categorized the samples as follow:

  1. High Quality = Quality score of 7 or 8
  2. Average Quality = Quality score 5 0r 6
  3. Low quality = Quality score 3 or 4

3.2.1 Alcohol Levels

Here we can see the significance of alchol levels (contents) on red wine quality. Red wine of high quality has alcohol levels higher than average and low quality red wine on average. It is also quite interesting that low and average quality alcohol has the same median value of 10. It is important to note however, that there are 6 samples that falls within the outlier for average quality red wine. This shows us that alcohol levels are important in determning red wine quality but other parameters also determine overall red wine quality. Lets look at other paramater.

3.2.2 Volatile Acidity

The boxplot results support our correlation analysis which indicate there is a moderate negative correlation between red wine quality and volatile acidity. But what is volatile acidity and why it’s content is profound on red wine quality?

According to Ellie Douglas from Decanter, in simple terms, volatile acidity is acidic elements of wine that are gaseous, rather than liquid, and therefore can be sensed and judged with smell. A high level of volatile acid in wine is generally undesirable because it gives a characteristic acidic smell and even worse rotten smell in some cases. A high level of volatile acidity could also be an indicator of unclean winemaking, which is a procedural fault.

4 Conclusion

  1. Many chemical parameters are responsible in determining red wine quality.
  2. Most of red wine samples are of average quality with very select few have high quality (above or equal to 7).
  3. Alcohol levels could determine high quality. High alcohol levels are usually found in higher quality red wine.
  4. High amount of volatile acidity is undesirable in red wine. Red wine of high quality usually have low levels of volatile acidity.