Red Wine Exploration by Tzu-Hao Wang

The dataset ia related to red variants of Portuguese “Vinho Verde” wine. There are total 11 independent variables and 1 dependent variable, quality.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 6 levels "3","4","5","6",..: 3 3 3 4 3 3 3 5 5 3 ...

Univariate Plots Section

In the beginning, we want to explore the distributrion of each different independent variables. First, let we check the distribution of y, quality. we can understand that total perecentage of quality level 5 and 6 is larger than 80 percent, which means most of the wine quality are between 5 to 6. Also, only about one percent of wine is categorized into level 8, best quality among all wine.

##        3        4        5        6        7        8 
##  "0.63%"  "3.31%" "42.59%"  "39.9%" "12.45%"  "1.13%"

We are going through all independent variables to check their distribution. By expolring the variable, we can check whether some characteristics of specific independent variables are true or not.

First graph is fixed acidity, which is distributed in a non-bias bell shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Second graph is volatile acidity. compare to fixed acidity, it isn’t right skewed and is a near normal distribution shape.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Third graph is citric acid. Its distribution reveal the decreasing trend which is devided to two parts. one is from 0 citric acid to 0.2 and the other part is 0.2 to 0.75.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Fourth graph is about residual sugar. Description stated that it’s rare to find wines with less than 1 gram/liter. Actually, most residual sugars are between 1 to 4 in this dataset. In order to fix the outlier and right skewed distribution, I add the log transform to x-axis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Fifth graph is distribution of chlorides. except the 1% outliers, most chlorides are beteween 0.25 to 1.25

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Sixth graph is the distribution of free SO2, which are right skewed severely.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Seventh graph is the distribution of total SO2.Similar to distribution of SO2, It obviously display severely right skewed distribution

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Next plot belowed showed that density of most wine are between 0.9925 to near 1.1, which is not similar to sugar and alcohol’s distribution. we can discuss it in later part.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

Ninth graph is about the distribution of pH values. According to description, most wines are between 3 to 4. In fact, most wines are between 3 to 3.7 in these datasets

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Tenth graph is sulphates and its distribution is slightly right skewed excepts outlier.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The last graph shows the distribution of alcohol. it is a right skewed distribution.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Besides the original attributes, I create several new attributes based on original datasets.One is called freeSO2Ratio, which calculate the ratio of free sulfur dioxide.

Univariate Analysis

What is the structure of your dataset?

Originally, the dataset are composed of 11 different attributes. 8 of them are belong to indivual unit and remaining 3 attributes are percentages.

What is/are the main feature(s) of interest in your dataset?

In the description , several attributes will directly affect the quality and flavor of wine itself. For example, volatile acidity will cause an unpleasant, vinegar taste and citric acid can add more ‘freshness’ favor into wine.

What other features in the dataset do you think will help support your
other physic properties as pH or density might be some surprising finding can

upgrade the quality of wine.

Did you create any new variables from existing variables in the dataset?

I’m trying to create some new variable such as ratio of SO2(FreeSO2Ratio) and sum of acidity to observe that whether there is something interesting finding.

Bivariate Plots Section

In the Bivariate plots part, we want to explore the relationship between two diffrent attribute or response. At the beginning, it is important to know the correlation ship between each variable

By observing the plot matrix, We can find four largest absolute value correlation value, which are pH to fixed acidity, citric acid to fixed acidity, total sulfur dioxide to free sulfur dioxide, and density to fixed acidity. the relation of acid will discuss later. Thus, let us plot the other three plot. We can find that free sulfur dioxide to total sulfur dioxide and fixed acidity to density are postive correlated and fixed acidity to pH is negative correlative. It is easy to understand that free sulfur dioxide is postive relative to total because total SO2 is composed by free and bound SO2. However, we are interested in the remained two graph.

The graph shows the scatterplot of pH to fixed acidity. We added an auxiliary line to plot the corresponding mean value to fixed acidity. When pH value is low, there exist some outliers that make the line looks unregular. However, it is explicit to display that there is an negative relation when pH between around 3.1 to 3.5.

The graph shows the scatterplot of density to fixed acidity. similar to the last graph, We added an auxiliary line to plot the corresponding mean value to fixed acidity. In these graph, although the correlation between two attributes is high, but we can not find the direct corresponding relation in these two variables. The auxiliary line makes graph much more messier.

Also, there is some apparent relation in each attribute to quality we can checked on plot matrix above. we took fixed acidity, volatile acidity, pH and alcohol, four attributes to obeserved thier boxplot.

Let us explore these four boxplot. First one, quality to fixed acidity, shows that excepts the quality level 3, the IQR become larger until quality level 8. Second graph, quality to volatile acidity, the median of volatile acidity become lower and lower when quaity level increase. Third graph, quality to pH shows that overall pH decrease when quality level rise. The final one, quality to alcohol, appears in the quality level 3 to 5, median keeps sames interval but drastically boom up at level 6 to 8. These four boxplots reveal the attribute mentioned above has relatively more effect than other attributes

In the following part, we explore the relationship around citric acid, which provide freshness to wine. The graph below shows that with higher citric acid, the ph value become lower and it appears a negative correlation.

Next graph is about the relationship between citric acid and quality. there are a fact that lower quality level’s wine has fewer citric acid, but it is not obvious in higher quality level’s wine.

In the following bivariate analysis, we want to explore some similar attributes. The graphs below are the comparison of three diffrent acidity. citric acid is postive correlated to fixed acidity but negative correlated to volatile acidity, and there are no correlation between volatile acidity.

The graphs below show the correlation beween free sulfor dioxide, bound sulfur dioxide , total sulfur dioxide and sulphates. It seems not so much relations between these four attributes

In the end of bivariate analysis part, we want to make a boxplot about my own created attributes how they are related to quailty. It is bound sulfur dioxide, Total acid and free so2 ratio.This plot alos verify my assumtion. Unfortunately, these attributes cannot show useful clues.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

In this Bivariate analysis, we can confirm that some reasonable inference still make sence in this dataset, like more acid cause to lower pH value, and more free SO2 is postive related to total amount of SO2. among the each attributes to quality, fixed acidity, volatile acidity, pH and alcohol affected the quality level in some degree.

Did you observe any interesting relationships between the other features
Relationship between pH and alcohol, density to fixed acidity show surprising

result to me and interaction comparison between similar variable(so2, acid) also explored interesting result such as fixed acidity with citric acid are positive relative but many of others are not related.

What was the strongest relationship you found?

as the beginning of bivariate part I said, there are four largest absolute value correlation value, which are pH to fixed acidity, citric acid to fixed acidity, total sulfur dioxide to free sulfur dioxide, and density to fixed acidity. These sets are the top four strongest relation ship in original dataset.

Multivariate Plots Section

We are going to combine thought and observation above to plot following multivariate graph. According to last section we got four sets of attribute with strong correlation.

First graph is about ph to fixed acidity and add color different point for quality. the pH of quality level 5 is more inclined to lower pH and relative higher fixed acidity.

We also plot the line plot with mean value of fixed acidity. it is funny that the relative fixed acidity drastically decrease from 8 to under 6 around pH 3.5 in the quality level 8. Furthermore, quality level 8 has strongest negative correlation than others.

In the graph with free sulfur dioxide to total sulfur dioxide, though these two attributes have strong correlation, it shows irrelevent to diffrentiate new finding by coloring quality of each point.

In the line plot with free sulfur dioxide to mean of total sulfur dioxide, we can find two interesting point. one is total sulfur dioxide increse steadily when free sulfur dioxide start rise up from near 50 until end of plot in quality level 6. the other one is there is one point that rocket up and suddenly decrease when free sulfur dioxide is near 35.

In the scatter plot of fixed acidiy to density, we can find that most quality level 5 wines are center around (x,y) = (7, 0.996), but it is relatively hard to find the relationship of others quality in this plot. let us check by line plot.

By line plot which x and y axis are fixed acidity are median of density, we can check few phenomenon when fixed acidity from near 8 to 10 in quality level 3, the density does not go upwpard. the other quailty level of wines shows the similar trend upward.

This graph is about the scatterplot of citric acid to fixed aicidty The most of quality level 5 and 6 wines are positive relative and scatter along citric acid from 0 to 0.50, but more quality level 6 wines are located at near citric acid equal to 0.4 to 6 than quality level 5. Furthermore, wine with quality level 8 are often located at fixed acidity around 0.50

The scatter plot of bound sulfur dioxide to total sulfur dioxide shows that in all quality level, bound sulfur dioxide are positive related to total sulfur dioxide, but slightly different in each level. Most quality level 6 wines are located at bound sulfur dioxide around 30.

In the scatter plot below is drawed by citric acid to median of fixed acidity. With qulaity level 5 and 6, the value of x-axis and y-axis increse with low variation until citric acid close to 0.38.

We also focus on the graph on pH to citric acid. Most quality level 5’s wine are distributed in lower place of citric acid than quality level 6.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Overall, the attributes set with strong correlation can not perform as well as I thought when I add quality to new variable. The multivariate analysis result shows that there are not so many strong segmentation by quality. However there are still useful distribution and relationship in quality level 5 and 6.


Final Plots and Summary

Now I’m going to conclude the exploration in three graphs.

Plot One

Description One

First graph is bound SO2 to total SO2 which colored by quality. we can observe that there is a strong correlation between two variable and shows different characteristics in quality level 5 and 6. wines in quality level 5 are distribute along the regression line, but most wines in quality level 6 are located around points which bound SO2 equal to 30. This description stated that most wines with quality level 6 are inclined to have lower bound SO2

Plot Two

Description Two

Second graph is about the scatterplot of citric acid to fixed aicidty The most of quality level 5 and 6 wines are positive relative and scatter along citric acid from 0 to 0.50, but more quality level 6 wines are located at near citric acid equal to 0.4 to 6 than quality level 5. Furthermore, wines with quality level 7 are often located at fixed acidity around 0.40 and quality level 8 are located near 0.50. However wines with quality level 3 are divided to two clusters. one are citric acid around 0, and the other are around 0.45. The observation above said that it is important to make wine with good process because wine with higher citric acid still have chance to fail to be a blockbuster.

Plot Three

Description Three

Third graphs are the line graph with fixed acidity to median of density. Overall, density is strong correlated with fixed acidity which proved by upwarding trend. However, there are some interesting finding.Talk about quality level 3 first. When fixed acidity from near 8 to 10 , the density does not go upwpard and except the zone mention above, it still perform a good increase trend. Second, we can find that in most time, the line of quality level 8 are always in the bottom, and line of quality level 7 along with level 8. This phenomenon indicate that high quality wines are along with lower density than lower quality line when fixed acidity is fixed.

Reflection

There are some reminders when I perform dataset exploration next time. First, it seems to takes much more time to this project than I thought. Because I have to get more familiar to syntax in ggplot and also thinking up with some useful clue to construct a good induction is a difficult thing, not as easy as I thought. Second, I think it can be more familiar by applying machine learning technique, such as K-means clustering or classification. I think the result of exploration might be a big clue for wine house to make a high quality wine. Third, the number of each quality of wine is not even, so the information and inference about lower number quality is not too accurate. especially winery want to know more about how to make high quality wine rather than mediocre one.