by Js Lims
December 26 2016
Contents
The purpose of this project is to use EDA(Exploratory Data Analysis) tequnique to figure out distributions, outliers, relations and any other surprising by exploring data from one variable to multiple variables. The goal of this project is to find important variables which influence the quality of red wine. This project is written out by using R programming.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## [1] 1599 13
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
As seen above graph, Fixed acidity is skewed positively. The mean is between median and 3rd quartile.
Volatile Acidity can describe condition of wine. Appropriate volatile acidity is necessary to the scent of wine. If it is too much, the wine could go bad.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
The distribution of volatile acidity close to normal distribution, but there is small tail on the right side of the plot. I wonder the quality of wine which is out of 3rd quartile.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
There are three peaks in this plot.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
It’s postively skewed. It has long tail on the right side. 75% of wines have residual sugar below 2.6 g/dm^3.
After removing ouliers, residual sugar looks normaly distributed.
This plot looks normally distributed, but there is long tail on the right side. I wonder effects of those outliers on quality of wine later.
After removing outliers, we can see the distribution looks normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
This plot is positively skewed. Sulfur dioxide is bad for human body, I wonder how this effects on quality of wine.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Also, the plot is positively skewed. There are outliers near 300.
After remvoing outliers and log scaling, the distribution looks normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
This plot is normally distributed well. The mean and medians are fairly closed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Also, the plot is normally distributed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates variale is left skewed.
With a log scale on x-axis, the distribution looks normal.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The plot is left skewed. 75% of wines have an alcohol below 11.10%.
## $x
## [1] "Quality ( 0 ~ 10 )"
##
## attr(,"class")
## [1] "labels"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
I grouped the quality attribute as level attribute.
Most of quality level is middle The mean quality score is 5.636
There are 1599 observation and 13 attributes in this data set. Except quality variable which is categorical, the variables are numeric.
Quality variable is main. We need to figure out how other variables effects on main value.
As i see some ariticles about wine, flavor and scent are important to quality of wines.
There would be many other factors effects on them and harmony of these factors would be important.
I think below variables would be support my investigation.
Total acidity, Fixed acidity, Citric acidity,Alcohol.
Not yet.
There are several plots were distributed positively skewed.
Since this data is tidy, I didn’t perform any process to adjust form of the data.
I’m going to check relation between features.
First, let’s check relations with making pair plot.
The plot is created as subtracting 500 samples from whole dataset.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
As seeing pair plot we can say,
Let’s check them out.
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$density
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6399847 0.6943302
## sample estimates:
## cor
## 0.6680473
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6438839 0.6977493
## sample estimates:
## cor
## 0.6717034
##
## Pearson's product-moment correlation
##
## data: wine$fixed.acidity and wine$pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.7082857 -0.6559174
## sample estimates:
## cor
## -0.6829782
Fixed acidity is positively correlated with density and citric acid, while negatively correlated with pH.
##
## Pearson's product-moment correlation
##
## data: wine$volatile.acidity and wine$citric.acid
## t = -26.489, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Volatile acidity is negatively correlated with citric acid.
##
## Pearson's product-moment correlation
##
## data: wine$density and wine$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Density is negatively correlated with alcohol. Since alcohol makes density of wine lower, there are negatively correlated.
There are 2 outliers on the right side. There are no data points around them. So, before getting linear regression model, let’s remove them.
##
## Call:
## lm(formula = free.sulfur.dioxide ~ total.sulfur.dioxide, data = wine[idx,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.600 -4.305 -1.693 3.605 34.972
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.656032 0.340589 16.61 <2e-16 ***
## total.sulfur.dioxide 0.220741 0.006074 36.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.723 on 1595 degrees of freedom
## Multiple R-squared: 0.453, Adjusted R-squared: 0.4526
## F-statistic: 1321 on 1 and 1595 DF, p-value: < 2.2e-16
##
## Pearson's product-moment correlation
##
## data: wine[idx, ]$total.sulfur.dioxide and wine[idx, ]$free.sulfur.dioxide
## t = 36.341, df = 1595, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6452693 0.6989950
## sample estimates:
## cor
## 0.673019
Total sulfur dioxide and free sulfur dioxide are positively correlated.
The quality of wine is positively correlated with alcohol, citric acid and sulphates and negatively correlated with volatile acidity, pH and density.
This chart shows how alcohol percent highly effects the quality level.
The wine with high alcohol has higher probablity to be a high quality wine.
##
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + sulphates +
## citric.acid + density + pH, data = wine)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64141 -0.38701 -0.06721 0.45480 2.11572
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.77058 11.92162 -0.987 0.323631
## alcohol 0.34190 0.01985 17.222 < 2e-16 ***
## volatile.acidity -1.32197 0.11597 -11.399 < 2e-16 ***
## sulphates 0.65627 0.10367 6.330 3.17e-10 ***
## citric.acid -0.37834 0.13479 -2.807 0.005064 **
## density 15.84518 11.88503 1.333 0.182655
## pH -0.47787 0.13381 -3.571 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6563 on 1592 degrees of freedom
## Multiple R-squared: 0.3421, Adjusted R-squared: 0.3396
## F-statistic: 138 on 6 and 1592 DF, p-value: < 2.2e-16
Linear model with 6 values explains describe 34.21% of variablity in quality, density and citric.acid are statistically unsignificant, there is likely to be no relationship between citric acid and density.
I found relationships between some variables.
Negative correlation between volatile acidity and citric acid is interesting.
It is not what i expected.
Relationship between fixed acidity and pH is strongest.
I grouped the quality attribute as level attribute.
The polygons are drawn in confidence interval 0.95.
High quality wines have higher citric acid and lower volatile aicidity, while low quality wines have lower citric acid and higher volatile acidity.
High quality wines have higher alcohol and citric acid. Middle and low quality have similar alcohol , but middle quality alcohol has more citric acid.
There is no relationship between alcohol and citric acid.
As quality of wines goes better, the relation between volatile acidity and alcohol is positive except for lowest quality of wine. Also, the more volatile alcohol, quality of wine goes worse.
Grouping qulity of wines in scatter plot with Citric acid and volatile acidity, show me clearly that higher citric acid and lower volatile acidity makes quality of wines be better.
There is no relation between alcohol and citric acid by looking at scatter plot. However, the plotting it with level of quality shows me that alcohol is really important variable to determine quality of wines high and citric acid attribute is also pretty important variable to determine a quality of wines.
In high quality of wines, most of wines which have low alcohol have high citric acid value and low volatile acidity. When high quality wine have low citric acidity and high volatile acidity, they have high level of alcohol.
I created a linear model to expect quality of wines in bivariate plots section with alcohol, volatile acidity, sulphates, citric acid, density and pH. However, it can explain 34.21% of variablity in quality which means it is not accracy.
As creating violin plots with box plot, we can see distribution of volatile acidity for each quality of wines. As quality of wine goes better, volatile acidity is distributed at lower level and citric acid is distributed at higher level. The black lines among the median of each quality support volatile acidity and quality is negatively related. Also, it supports citric acidity and quality is positively related.
As creating more ellipses on the right side, we can see there are quality level. The less volatile acidity and the more citric acid determine quality of wine better.
(confidence intaval : 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01)
I added a ecdf plot on the right side. A rate in high quality of wines begins to rise at higher density of alcohol than others. As looking both plots, there is no big differences between low and middle quality of wines. However, In high quality of alcohol, It’s pretty different both middle and low quality of alcohol.
This data set contains a lot of surprising information on red wines and their chemical properties. From each step, I did exploration data analysis one variable, two variables and more variables. I found what features are related to quality of wine.
I wish the data-set include other variables like measure of wine price, the place where wine made in or etc. That data set would ask us more interesting questions.
I was able to create a linear model to expect quality from new data, but that model was not accurate. If this dataset had quality variables as continuous, this analysis would be more accurate. With continuous taget variable, we could scale quality variable to get better visualization. That would make result clearer and be really good to make a linear model better. There might be still good ways to expect quality of wines with another kind of a model.
For exploring this dataset, i’ve tried to make a scatter plot. But, since the size of dataset is large, each data points are overlapped. That makes a plot bad view. Even adjusting color and opacity didn’t work well. Also, it makes me struggled to make a bubble chart in multivariate analysis. For this reason, i used ‘stat ellipse’ function and ‘stat smooth’ function which really helped me to get better plots.