Key questions of my project

  1. Which chemical properties differentiate high and low quality red wine?
  2. Is there chemical composition pattern in high and low quality red wine?
  3. What chemical properties can we use to identify the color of the wine?

How much high, medium, and low quality wine do we have in the dataset?

Although the quality distribution are similar, the white wine data set has almost 3x more observations than the red wine dataset.

Let’s sample the white wine dataset to match the number of red wine observation for further analysis.

What are the top chemical properties that differentiate high and low quality wine?

Let’s categorize all the wines into high, medium, and low quality based on quality score quartile.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Based on the quality score quantile, score higher or equal to 7 is defined as high quality, lower than 5 is defined as low quality, the rest is medium quality. Here is the amount by quality category.

table(red.wine$qual.category)
## 
##   High    Low Medium 
##    217     63   1319

Let’s compare each chemical property by high and low wine quality to identify any pattern or distinction.

There are two patterns I am hoping to find, differences in distribution and density. However, it’s difficult to identify patterns in distribution by looking at the histograms. In addition, the histograms provide misleading perceptions of density due to the significant variation in the quantity of high and low quality wines. There are 217 high quality wine, but there are only 63 low quality wine. It’s about a 3 times difference.

Let’s create something in a more aggregated level by using boxplot. It helps us to compare the means and overall distribution.

Note that there is a significant distribution difference in Volatile Acidity and Alcohol by quality compare to properties such as Fixed Acidity or Chlorides.

The following chemical properties have distinct value distribution (the boxes are clearly offset from each other) and means between high and low quality red wine,

  1. Volatile Acidity
  2. Sulphates
  3. Alcohol

Let’s analyze the relationship between these chemical properties further by wine quality.

Can we see distinction between the key chemical properties among high and low quality wine?

To find chemical distinction, we can analyze the pair-wise composition of the key chemical properties we identified above. For example, we want to see if most high quality red wine will have alcohol > 10 and volatile acidity < 0.4.

Let’s use scatter plot since we are exploring 2 variable relationship. To identify potential composition pattern between quality, let’s define color of each point by using wine quality.

It’s interesting to see clustering by wine quality in each chemical property pair. For example, in the Sulphates and alcohol graph, most of the low quality red wine is clustered in the lower left corner while high quality wine is around mid-right region.

The clustering can be use to predict the quality of a red wine given the pairing of the key chemical properties.When a new wine is given, we can predict it is high quality if the data lands in the approximate area in the high quality regions in each pair composition.

What chemical property can we use to differential red and white wine without looking at the color or names?

We can leverage the same approach when we compare the chemical properties for high and low quality red wine. Let’s compare the means and value distribution of each chemical by red and white wine.

Note there is significant distribution differences in Volatile Acidity, where it is much lower in white wine.

Note there is significant distribution differences in Residual Sugar, Total Sulfur Dioxide, and density between red and white wine.

The following chemical properties have significant differences between red and white wine in both value distribution and means,

  1. Volatile Acidity - white wine is lower
  2. Residual Sugar - there are lot of outliers in red wine, but white wine tastes sweeter in general
  3. Total Sulfur Dioxide - white wine is much higher
  4. Density - white wine is lower

Let’s dive deeper to see how each chemical property vary by the quality of wine.

Do these key chemical properties vary significantly by the quality of the wine?

Let’s use frequency diagram to analyze the distribution of chemical property by wine color and quality with the hope to see distinctions in concentration and value spectrum. Also, let’s add a vertical line for the average of respective chemical property for more insightful comparison.

The distribution, such as the locations of the peaks, in Volatile Acidity is quite different among high, medium, and low quality red wine. In contrast, the distribution in white wine quality is quite similar.

Based on the graphs, some chemicals properties have unique distributions and concentration depends on wine quality. For example, density in high and medium quality white wine has very different profile. Density of high quality white wine is concentrated to the left below average while it’s close to a normal bell distribution for medium quality. Similarly, volatile acidity in red wine display the same differences.

Do we also see clustering in pair-wise chemical composition by wine quality and color?

Let’s use scatter plot matrices to save work since there are four variables.

Also, let’s adjust the alpha to reduce overplotting and create transparency to emphasize overlap areas. The alpha of each graph should be different based on the total number of observation. For example, medium quality has the highest number of observations, so the alpha should be the lowest to further reduce overplotting

There are apparent clustering among each pair-wise composition by wine color. However, many plots have significant overlapping areas.

Some chemical compositions have clear boundaries between red and white wine among all three quality tiers, such as Volatile Acidity vs. Total Sulfur Dioxide. Some compositions have less distinctive speration across all quality, such as Density-Residual Sugar pair, where many red and grey points are overlapping.

Final Plots and Summary

Plot #1: Wine Quality Distribution by Wine Color

Description: This histogram aims to provide an overview of the distribution of wine quality in the dataset. The wine quality has a normal bell curve distribution for both red and white wine. Majority of the red and white wine has medium quality score of 5 or 6.

Plot #2: Unique Composition of Key Chemistry Properties

Description: This group of scatter plot aims to demonstrate unique chemical compositions that distinguish high and low quality red wine. Each scatter plot draws the values of the pair combinations of Sulphates, Alcohol, and Volatile Acidity and colored by wine quality (red = high quality, blue = low quality).

In general, the high and low quality wine shows clustering in all pair-wise chemical composition. However, the clusters have some overlap in the Alcohol-Volatile Acidity and Sulphates-Volatile Acidity combinations. In contrast, the boundary between high and low quality wine in Alcohol-Sulphates composition is quite distinctive. Roughly, a straight boundary line intercepts y=0.8 and x=13.5.

Plot #3: Key Chemical Properties to Differentiate Red and White Wine

Description: This group of boxplots aims to highlight the key chemical properties that differentiate red and white wine. To identify such chemicals, we can compare the respective means value and ranges of value distribution side by side. Volatile Acidity, Residual Sugar, Total Sulfar Dioxide, and Density all demonstrated significant differences in mean values between red and white wine a minimum 50% gap (difference over the lower mean value).

Reflection

In retrospective, there are various challenges, success, and data enrichment for future work. In terms of challenge, I wasn’t sure how to handle the differences in the amount of high, medium, and low quality wine observations. There are significantly more medium quality wine compared to high and low quality ones. I hoped to have the same number of observation from each quality category in order to have a fair comparison and avoid misleading visualization. My approach was to sample 18 wine observations across all quality categories considering there are only 18 high quality wine based on my categorization. However, doing this will result in over and under representation of high and medium quality wines respectively as well as having too little data points for analysis.

I decided to take another approach after discussing with the course coach. I kept the data in in all quality categories to maintain the data distribution. However, the white wine dataset is twice larger than the red wine dataset. I used proportional stratified sampling to extract 50% of the total data while keeping the respective wine quality percentage breakdown the same. For sample, if the percentage breakdown by quality is 20%, 50%, and 30% in the original dataset. The sample with half of the dataset size will has the same distribution.

To avoid misleading visualization and overplotting due to dataset size difference, I used transparency technique in scatter plot comparison by adjusting the alpha accordingly.

Regarding success, I was glad to identify distinct differences in chemical composition in high and low quality red wine as well as key chemical properties to identify red and white wine.

In the future, the analysis can be enriched by incorporating attributes such as country where the grape was grown, years of operation of the wine manufacturer, retail price of the wine, or number of volume sold. With these additional data, we can further understand how the origin of the grape and and wine maker experience influence the wine quality. Also, we can also perform analysis on consumer preference due to wine quality and pricing.

For instance, we can create histograms by wine quality and grape origin to understand if good wines are largely depends on grapes from certain regions. Also, we can apply simple supervised learning, such as Gaussian Naive Baynes method, to create boundary functions for predicting wine quality and color based on the composition of chemical properties.