Li Zeng 曾黎
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
There are 13 variables in the white wine file, 4898 sample records, and the 13 variables are:
X: ID
fixed.acidity: Mainly Malic Acid
volatile.acidity: Acetic acid produced by bacteria during fermentation, with a characteristic odor
citric.acid: one of the fixed acidity
residual.suger: Composed of fructose and glucoose, the natural ingredients in grapes, most of the bacteria
become alcohol after fermentation, and the remaining sugar that is not converted is the residual #### sugar content
chlorides: : Chloride content makes the wine a little salty
free.sulfur.dioxide:
total.sulfur.dioxide:
density: The higher the ethanol content in the wine, the lower the density,
but the density does not change much.
sulphates: The sulfate produced by bacteria during the fermentation process.
Generally, white wine has a higher sulfate content than red wine.
alcohol: the alcohol content of the wine
quality: comprehensive score of the wine
All parameter data types are numeric. Only quality can be used as categorical data. Density is greatly affected by acidity, alcohol, sugar, and SO2 content. The relationship between density and these measured values can be explored in the following sections. If the density were set as a physical indicator, would it indirectly reflect the wine scores? In the impression of non-tasting professionals, the sweet and sour taste is more popular, does this sweet and sour tone make the judges add points? if not, what are the factors that affect the score?
Univariate Plots Section
The quality column data distribution is showing below. Most of the scores are 5-8. Only 5 cases in the 4898 group of data were rated as 9 points, and 20 samples showed the lowest score, 3 points
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5

Based on my own experience, the taste of white wine is affected by the content of the following substances: residual.sugar, alcohol, fixed.acidity, volatile.acidity, citric.acid, total.sulfur.dioxide, which define the white wine in chemistry level.In term, these substances determine the density and PH value of the wine, how does it affect the scores of the wine?
Second, during the winemaking process, does the addition of SO2 for antiseptic and antioxidant affect the score? let’s take a look at the data distribution of each chemical component in wine first. Is there an outerliner?
After excluding these outerliner data and verifying that the data is in a normal distribution, let’s explore whether the above problems are consistent with my experience
The content of sugar, residual.sugar, its maximum value is 65, minimum value is 0. median is 5.2, and the average is 6.3, let’s draw a graph to show the data distribution.

Figure 1 shows that when the value of the X axis covers all the values of Remaining Sugar, the data shows a non-normal distribution, and the right side is gradually reduced (skewed right). As we can limit the range of X -axis value to 0-30, enlarged data area so it can be seen more clearly. After the residual sugar amount is converted with log10, observe that it has a bimodal distribution. Further analysis will be performed.

The second parameter that is worth noting is the alcohol content.The maximum value of alcohol is 14.2, the minimum value is 8, and the average value is 10.51. White wines above 13 degrees are a minority. Most white wines have a degree between 9-11.

Explore the content of acetic acid, malic acid, and citric acid that are originally present or produced by bacterial metabolism during wine fermentation. As can be seen from the figure, the three types of data showing acidity are basically normal distribution, and most of the acidity values are concentrated around the mean

The SO2 added during the brewing process, because fermentation is a biochemical reaction, the free SO2 (FSO2) and total SO2 (TSO2) content in the grape juice are different. When TSO2 level falling, SO2 need to be added SO2 prevent it going bad completely. When we see that the ratio of FSO2 to TSO2 in wine is very small, it usually indicates that the wine has chemical and / or microbial instability. TSO2 can be used as a “buffer” for FSO2. When free SO2 is lost, the chemical equilibrium in the wine may change, so some bound SO2 may be released to its free state. The more TSO2 in the wine, the more stable FSO2

The correlation coefficient shows that the more total SO2, the higher the free SO2 content.
## [1] "C"
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 54.645, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5977994 0.6326026
## sample estimates:
## cor
## 0.615501

The proportion of free SO2 in total SO2 has little to do with the score, different content level of SO2% appears in all categories.

The density of most samples is lower than 1, which may be due to the presence of alcohol
The data distribution of pH value is shown in the figure, the pH value of the sample is concentrated between 3.0-3.4, pH is one of the main indicators to measure the smoothness of the fermentation process

Univariate Analysis
What is the structure of your dataset?
This dataset contains 4868 observations of 12 characteristics (fixed acidity, volatile acid, citric acid, residual sugar, chloride, free sulfur dioxide, total sulfur dioxide, density, pH, sulfate, alcohol, and mass). Here, 11 Characteristics are numerical data types that reflect the physical and chemical characteristics of wine, and the last one(quality) is the wine score. Although it is a numerical value, we use the wine’s score as an index to measure quality as a categorical variable.
What is/are the main feature(s) of interest in your dataset?
Can the amount of sugar, alcohol and citric acid alone determine the score? These are three main factors that can affect our experience of wine when we are tasting wine, so does the content of preservative SO2 affect the score? During the brewing process, the more SO2 is added to the wine, the more Total SO2 there is in the wine, the more stable the FSO2. Exploring the relationship between FSO2 and quality can understand this correlation.
What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
pH value can be used as an indicator to monitor the brewing environment
Did you create any new variables from existing variables in the dataset?
Create a new variable: the ratio of FSO2 to TSO2: so2.rate. Use this data to calculate whether there is a positive correlation with quality
Bivariate Plots Section
Before discussing the correlation between the two measured data, let’s see the correlation value
##
## CORRELATIONS
## ============
## - correlation type: pearson
## - correlations shown only when both variables are numeric
##
## X fixed.acidity volatile.acidity citric.acid
## X . -0.256 0.003 -0.150
## fixed.acidity -0.256 . -0.023 0.289
## volatile.acidity 0.003 -0.023 . -0.149
## citric.acid -0.150 0.289 -0.149 .
## residual.sugar 0.007 0.089 0.064 0.094
## chlorides -0.046 0.023 0.071 0.114
## free.sulfur.dioxide -0.012 -0.049 -0.097 0.094
## total.sulfur.dioxide -0.162 0.091 0.089 0.121
## density -0.186 0.265 0.027 0.150
## pH -0.116 -0.426 -0.032 -0.164
## sulphates 0.010 -0.017 -0.036 0.062
## alcohol 0.214 -0.121 0.068 -0.076
## quality 0.036 -0.114 -0.195 -0.009
## so2.rate 0.125 -0.139 -0.196 0.016
## residual.sugar chlorides free.sulfur.dioxide
## X 0.007 -0.046 -0.012
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar . 0.089 0.299
## chlorides 0.089 . 0.101
## free.sulfur.dioxide 0.299 0.101 .
## total.sulfur.dioxide 0.401 0.199 0.616
## density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## so2.rate 0.051 -0.033 0.739
## total.sulfur.dioxide density pH sulphates alcohol
## X -0.162 -0.186 -0.116 0.010 0.214
## fixed.acidity 0.091 0.265 -0.426 -0.017 -0.121
## volatile.acidity 0.089 0.027 -0.032 -0.036 0.068
## citric.acid 0.121 0.150 -0.164 0.062 -0.076
## residual.sugar 0.401 0.839 -0.194 -0.027 -0.451
## chlorides 0.199 0.257 -0.090 0.017 -0.360
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059 -0.250
## total.sulfur.dioxide . 0.530 0.002 0.135 -0.449
## density 0.530 . -0.094 0.074 -0.780
## pH 0.002 -0.094 . 0.156 0.121
## sulphates 0.135 0.074 0.156 . -0.017
## alcohol -0.449 -0.780 0.121 -0.017 .
## quality -0.175 -0.307 0.099 0.054 0.436
## so2.rate -0.013 -0.066 0.001 -0.022 0.064
## quality so2.rate
## X 0.036 0.125
## fixed.acidity -0.114 -0.139
## volatile.acidity -0.195 -0.196
## citric.acid -0.009 0.016
## residual.sugar -0.098 0.051
## chlorides -0.210 -0.033
## free.sulfur.dioxide 0.008 0.739
## total.sulfur.dioxide -0.175 -0.013
## density -0.307 -0.066
## pH 0.099 0.001
## sulphates 0.054 -0.022
## alcohol 0.436 0.064
## quality . 0.197
## so2.rate 0.197 .
From the above correlation coefficients, the coefficients worth our effort to explore further more.
- Between SO2 and density, residual sugar
- Correlation coefficients between alcohol content and chlorides, residual sugar content, total SO2, fixed acidity are both greater than 0.2
- Quality is related to the following factors with a degree <0.2: volatile acidity, total SO2;
0.2: density, alcohol, chloride,
The correlation between the density and the amount of residual sugar, and the amount of alcohol is relatively easy to understand.The sugar content is high,then the density will increase. The density of alcohol is smaller than water. , The pH is low, the key point for us to explore is the relationship between density, alcohol and chloride and quality score

The figure above shows that the higher the total SO2 content, the greater the density, and a positive correlation between the two

Samples with high total SO2 content tend to have higher residual sugar content

The correlation between compounds and alcohol content is very weak, and the chloride content does not change much in samples with different alcohol levels.

The link between alcohol and fixed acidity is also weak

In the samples with high alcohol content, the amount of residual sugar is generally low, and the samples with more complete fermentation have higher alcohol content and low amount of residual sugar.

In samples with high alcohol content, the total SO2 content is low, and in the samples with smooth fermentation, so2 will not be added again to adjust the fermentation reaction. This verifies the control logic of SO2 during the brewing.
Bivariate Analysis
Based on the general knowledge of biochemistry, you can judge the correlation between some parameters, such as sugar content and alcohol content can affect the density. At the same time, I want to understand the correlation between the density of all chemical substances in wine and the density, whether there is more than expected Are there any correlations? After listing the correlation coefficient matrix, based on the calculated correlation coefficients, roughly plot the coefficients above 0.2 to see if the coefficients and the map match. During this exploration, we found the density and the amount of residual sugar. Correlation with total SO2
Through the correlation coefficient matrix, we can see the quality has a greater correlation with the following factors: density, alcohol, alchohol, total SO2 content, these are the focus of further exploration.
The strongest correlation is between the density and the amount of residual sugar, with a coefficient of 0.839
Multivariate Plots Section

As we expected, the higher the sugar content of the sample, the higher the density value, the higher the score of the sample, the 8, 9 points, the sample density tended to be lower. If the sample has the same sugar content, the lower the density of the sample score relatively high.

As expected, the density of alcohol is negatively correlated with the density of wine. During the fermentation process, because the sugar is slowly converted to alcohol, the sugar decreases, the alcohol increases, and the overall density decreases. Also, the sample with high alcohol content indicates that the fermentation process is good, the chances of getting a high score increase. There are more blue or blue-green data points in the high-alcohol scatter area.

Samples with high alcohol content, such as samples above 10.5%, have a slightly lower residual sugar content of less than 9g / ml. There are more samples with high scores in this range (more blue)

Overall,the total SO2 content is low, and the sample with higher alcohol intent to have a higher score.

The figure above shows that when the chloride content is low and the alcohol content is high, the sample score will be higher

There is no obvious difference in the fixed acidity in the samples with different alcoholity, and there is no obvious trend in the score of the samples with different acidity.
Multivariate Analysis
Fixed acidity, malic acid, citric acid, and residual sugar have no positive effect on the score. High alcohol samples have a tendency to get high scores.
Under the influence of alcohol, the taste of the wine has more full body feel, and the amount of residual sugar has not played a role in increase score. With too much sugar, other tastes tends to be over powered by the sweetness.
Final Plots and Summary
Plot One

Description One
The acetic acid, malic acid, citric acid, consumed sugar, alcohol, and sterilization and freshness of SO2 produced in the winemaking process can be comprehensively reflected in the physical parameter density. Approximately, alcoholicity 10.5-11 as a reference, we can observe Low-alcohol samples have higher densities, and the scores are mainly represented by the color crimson indicating the wine quality of 3-4. In samples with an alcohol content of 11 or higher, the sample density shows a decreasing trend, with blue and blue scores representing 7, 8, 9. The green sample data points are mainly, the high-scoring samples are concentrated in the areas with low density and high alcohol content. Alcohol and wine density are negatively correlated. During the fermentation process, because sugar is slowly converted to alcohol, the sugar content is reduced, and the alcohol content is increased, it further decreased the density. At the same time, samples with high alcohol content indicate that the fermentation process is good, and the probability of the sample getting a high score is increased.
Plot Two

Description Two
Discuss how the relationship between the added SO2 and alcohol content during the brewing process affects the score. In the data description chart, we can see that as the alcohol content increases, the samples with high scores increase, and the total SO2 decreases. To maintain the freshness and stop the oxidation, the level of SO2 added needs to be monitored. When the fermentation process is smooth, a certain amount of SO2 is consumed and kept in balance with free SO2. When the fermentation is not smooth, SO2 needs to be added to maintain the normal pH value and prevent brewing failure. The sample with a lower total SO2 content tends to have a higher overall score. This phenomenon shows that the total SO2 content is generally low in samples with an alcohol content higher than 11%.
Plot Three

Description Three
In the previous data exploration, it was found that the residual sugar data showed a right skewed curve. In linear regression analysis, the data was converted by log10 function processing, and it was easier to find the correlation with alcohol content. In all the samples that were scored, the higher the score of the sample, the lower the residual sugar amount. We can observe that with the increase of alcohol content, the number of samples with a high residual sugar amount is decreasing.
Reflection
The entire process is to understand the data type and distribution by exploring a single data set, and to explore the correlation between the two sample parameters. In the second step, we are showing which factors are affected by the score. It is a gradual process. In each step of the drawing process, by adjusting the drawing parameters and adjusting the data image, the image is clearer and easier to discuss.
The process of exploring this data requires a general understanding of the process of making wine, such as why the SO2 in the sample need to be in balance, and how it affects the density, pH, alcohol content and fermentation process of the wine. When the graph drawn is not consistent with prognosis, there are clues to find the cause of the inconsistency.
From the process of data exploration, the influence of alcohol content on scoring is becoming more and more obvious. It is the ultimate product in the fermentation process. Fermentation goes smoothly, the amount of sugar in grapes gradually decreases, the density decreases, the alcohol content increases, and the organic substances in the alcohol affecting the taste and body of the wine, also by avoiding excessive SO2 in the wine to prevent the biochemical inbalance, thus the total SO2 content remain low.
Difficulties encountered: When getting the list of correlation coefficients, because there are many correlation coefficients exhausted, it is hard to determine which pairs of data can be set as a starting point, then I learned some basic knowledge of brewing and wine tasting, related with the data presented in the correlation list, for example, the maximum correlation coefficient is 0.83, and there are several correlation coefficients around 0.5, then we decided to have the visual inspection> 0.2 included in the next exploration process. Because it takes longer time to figure out how to constructe a the graph to filter coefficients, instead, I went on to understand the background first and manually find the pair of sample parameters that can be explored.
Lesson learned: When drawing multivariate graphics, I found that according to the original quality classification, the color difference of the graphs is not obvious since there are few samples with 8, 9 scores, and the expression of quality in that range is blurred out. I combined the sample with quality is 8 and 9 together, and transform the original 7 levels into 4 levels, so the differentiation becomes much better.