## [1] "C:/Users/DSule/OneDrive/Udacity/Project Explore and Summarize Data"
## [1] "Explore and Summarize Data_Bossenz.Rmd"
## [2] "Explore and Summarize Data_Bossenz.v2"
## [3] "Explore and Summarize Data_Bossenz.v3.rmd"
## [4] "Explore and Summarize Data_Bossenz.zip"
## [5] "Explore_and_Summarize_Data_Bossenz.html"
## [6] "Explore_and_Summarize_Data_Bossenz.v3.html"
## [7] "Explore_and_Summarize_Data_Bossenz.v3.rmd"
## [8] "projecttemplate.rmd"
## [9] "rsconnect"
## [10] "wineQualityReds.csv"
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
I decided to explorer the data set: Red Wine_Quality, because it is interesting for me to know, what chemical properties speak for a good wine quality. This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). By analysing of this data set I want to set a focus on a question: Which chemical properties influence the quality of red wines?
We have 1599 records and 13 columns in the table.This are the names of columns: “X”, “fixed.acidity”, “volatile.acidity”, “citric.acid”,“residual.sugar”, “chlorides”, “free.sulfur.dioxide” “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”
It is also very important to keep this description in eye: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
At the first step I did some descriptic statistic in order get to know data.
## [1] 1599 12
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Most of data are numerical. We have no string data in the data set. In every record I can find a quality of wine and also what of chemical properties has this type. It is very relevant information.
All chemical properties which stay for good quality of wine. I would like to look at the correlation of each properties and compare it with the quality.
Different statistic models like histogramms or some line charts can support visually the data analysis. This I will look for some dimensions or mesuares to make some visual analysis.
Yes, I did. I did create quaity types: excellent, average, poor. This give me an opportunity to understand what type of quality I am using and which properties has every type of quality.
Now I will try to analyse every element separatly. By this analysis I will also look at the person correlation. I am interested in the elements which have a high correlation with the quality.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
We have a weak correlation between quality and citric.acid. The distribution is left skewed.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
We have a negative weak correlation between quality and volatile.acidity. The distribution is normal.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We have the moderate correlation and the distribution is normal.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03531327 0.06271056
## sample estimates:
## cor
## 0.01373164
The correlation is very week. The distribution is left skewed. The distribution has a long tail.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.106451268 -0.008734972
## sample estimates:
## cor
## -0.05773139
The correlation is about null. The distribution is normal.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2320162 -0.1373252
## sample estimates:
## cor
## -0.1851003
The correlation is about null. The distribution is left skewed.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
The correlation is weak. The distribution is left skewed with a long right tail.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2220365 -0.1269870
## sample estimates:
## cor
## -0.1749192
The correlation is about null. The distribution is not recognisable.
##
## Pearson's product-moment correlation
##
## data: pf$quality and pf$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.099430290 -0.001638987
## sample estimates:
## cor
## -0.05065606
The correlation is about null. The distribution is left skewed.
I am also curious if I can recognise more interesting data correlation by building the scatterplot matrics.
The scatterplot matrics gives me a big picture. How ever, I want to now more details. Thus I will do futher statistic analysis.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Acording to the summary we can see that red wine quality is normally distributed and concentrated around 5 and 6.
In this section I would like to look at the quality of wine and try to answer the question: If the quality of wine depend on the percent of alcohol? For this analysis I will create boxplots for each quality level.
On this boxplot chart you can see the distribution of alcohol (% by the volume) per quality. For example: By the red wine with quality with 5 score you can recognise the smallest box, i.e. the minimun and maximum values are not far from each other and the 50% of wines with this score are in this box.The average of alcohol (% by the volume) for this quality of wine is above 50% (Look at the red dot in the box).
The stongest relationship I could found is quality and alcohol. The corellation of this relation = 0.48
Now I will create a new variable called “quality.type” which is categorically divided into “poor”, “average”, and “excellant”. This grouping method will help me detect the difference among each group more easily. Also I want to know the distribution in each quality type. For this I will creat boxplots.
That’s obvious, that the red wine with the type “Excellent” have a low level of volatile acidity. Most red wine with the middle quality have about high values. Most red wine with the poor quality have the highest value of volatile acidity. This relationship speeks for the negative correlation. I calculated it above: -0.39
Furthermore, I would like to see all values for the wine with the type “excellent”. This will help me to understand what kind of chemical propertiece are neccessary for this type
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.900 Min. :0.1200 Min. :0.0000 Min. :1.200
## 1st Qu.: 7.400 1st Qu.:0.3000 1st Qu.:0.3000 1st Qu.:2.000
## Median : 8.700 Median :0.3700 Median :0.4000 Median :2.300
## Mean : 8.847 Mean :0.4055 Mean :0.3765 Mean :2.709
## 3rd Qu.:10.100 3rd Qu.:0.4900 3rd Qu.:0.4900 3rd Qu.:2.700
## Max. :15.600 Max. :0.9150 Max. :0.7600 Max. :8.900
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 3.00 Min. : 7.00
## 1st Qu.:0.06200 1st Qu.: 6.00 1st Qu.: 17.00
## Median :0.07300 Median :11.00 Median : 27.00
## Mean :0.07591 Mean :13.98 Mean : 34.89
## 3rd Qu.:0.08500 3rd Qu.:18.00 3rd Qu.: 43.00
## Max. :0.35800 Max. :54.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9906 Min. :2.880 Min. :0.3900 Min. : 9.20
## 1st Qu.:0.9947 1st Qu.:3.200 1st Qu.:0.6500 1st Qu.:10.80
## Median :0.9957 Median :3.270 Median :0.7400 Median :11.60
## Mean :0.9960 Mean :3.289 Mean :0.7435 Mean :11.52
## 3rd Qu.:0.9973 3rd Qu.:3.380 3rd Qu.:0.8200 3rd Qu.:12.20
## Max. :1.0032 Max. :3.780 Max. :1.3600 Max. :14.00
## quality quality.type
## Min. :7.000 excellent:217
## 1st Qu.:7.000 average : 0
## Median :7.000 poor : 0
## Mean :7.083
## 3rd Qu.:7.000
## Max. :8.000
Here I see the are 217 samples of red wine with the “excellent” type. In order to reach out this quality you need a lot of sulfur.dioxide, fixed.acidity and alcohol, because this values are the highest values.
In this section I would like to make a focus on sugar, alcohol and the quality. Thus I will work with the variable: residual sugar, alcohol and quality type. The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. I would like to know, if the wine with the good quality usally sweet or not and how much alcohol it contains.
I see that usually a wine with the excellant quality is not sweet and doesn’t contain much alsohol. The wine with the quality type “average” contain much much alcohol and a bit more suger. The wine with the quality type “poor” sontains less sugar and more alcohol. I would like to know in detail about this chart.
## [1] "Alcohol by quality type:"
## # A tibble: 3 x 2
## quality.type mean
## <ord> <dbl>
## 1 excellent 11.5
## 2 average 10.3
## 3 poor 10.2
## [1] "Sugar by type:"
## # A tibble: 3 x 2
## quality.type mean
## <ord> <dbl>
## 1 excellent 2.71
## 2 average 2.50
## 3 poor 2.68
It is interesting to see this numbers, because the visual presentation is difficult to asstimate exactly. Now we have a different picture: The type “excellent” contains the most sugar and alcohol in average. The type “poor” contains the least alcohol, but more suger than type “average”. The type “average” contains the least suger in average. The % of alcoholis in the middle range: more than “poor” and less than “excellent”.
Also I am interested in density and quality of the wine. Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content. I would like to know if the density is relevant for quality. Let’s have a look if we can recognise this (less alcohol –> more density and another way around) in our data set.
You can see, when we have high value of density, than we have less alcohol and another way around. The trends lin on this chart would help us understand the data better. Also I put alcohol to y axis and density to x axis. It help me visual to analyse the values of density and alcohol.
The trends line make the trend clearly. Now You can be sure: We have a high value of alcohol by higher value of density and another way around. Also the correlation between this two properties has to be negative. Let’s run a person correlation test to see the results:
##
## Pearson's product-moment correlation
##
## data: pf$density and pf$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5322547 -0.4583061
## sample estimates:
## cor
## -0.4961798
Let’s descover othe properties like salt and sugar:
## [1] "Salt by type:"
## # A tibble: 3 x 2
## quality.type mean
## <ord> <dbl>
## 1 excellent 0.0759
## 2 average 0.0890
## 3 poor 0.0957
## [1] "Sugar by type:"
## # A tibble: 3 x 2
## quality.type mean
## <ord> <dbl>
## 1 excellent 2.71
## 2 average 2.50
## 3 poor 2.68
All types contain much less salt than sugar. There is least salt in the “excellent” type and most salt in the “poor” type.
##
## Pearson's product-moment correlation
##
## data: pf$residual.sugar and pf$chloride
## t = 2.2257, df = 1597, p-value = 0.02617
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.006606405 0.104346223
## sample estimates:
## cor
## 0.05560954
The correlation between salt and sugar is very weak: 0.055
I observed a negative relationships between quality and salt (chlorides)| quality and volatily acid, and positive correlation between quality and alcohol | quality and citric acid. The correlation between quality and alcohol is 0.47. It is moderate correlation and is the strongest correlation in compare to all other correlation values, I compared.I also noticed that red wine with the excellent quality has the highest proportion of alcohol (in average) and sugar and the lowest proportion of salt (in average).
The last comparison I would like to see is, the relationship between pH and quality type of wine. pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. I want to proove if I can see this in the data.
I created a corelation between quality and all other chemical properties. I observed that all chemical properties show very weak relationship or do not show any relationship.
I think it is very powerful visualisation, as I can immediatly recognice the distribution. I can very easy compare the average values of every quality level. I see that the wines with the high quality (7-8) have the highest average value of alcohol in campare with ather quality levels. The effectice quantiles and median values of boxplot give me an opportunity to make interpretation. 75% of wine with the middle quality (5) have ca. 10% of alcohol. This is the lowest value of alcohol in average. 75% of wine with the highest quality (8) have ca. 13% of alcohol. This is the highest value in compare to another quality levels.
This line chart help me understand a very complex relationship between pH, density and quality of wine on a very easy way. Every line stays for one quality level. The highest level is dark blue and the lowest level is light blue. Basically pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. I can clearly see this on this viz. Also I can see that pH values between 3.3 and 3.5 have the highest density by all level of qualities. The highest quality value has density around 3 and pH also around 3.1. The wine with the lowest quality level has the lowest value of densit and pH.
This scatter plot with trend line is a very effective visualisation. You can easly recognise the negative correlation between alcohol and density. You can recognise that the poor type of quality have both lowest value of density and lowest value of alcohol. The wine with the excellent quality have usualy the highest level of alcohol and the highest level of density. Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density.
The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions. I wanted to know: - What characteristics (chemical properties) make “excellent” wine? - Does an excellent red wine consist more alcohol or sugar? - Does an excellent red wine consist more salt? - What about the freshness in the test: How was critical acidity dosed? - Which characteristics make wine unpleasant, vinegar taste: How was volatile.acidity dosed? - And last but not least: How much pH values does an excellent value contains?
All this questions I could answered with my analysis. I found out, that an “excellante” taste of red wine we can really reach with more alcohol, sugar and citric acid and less volatily acid and pH.
In addition, I compare the correlation between quality and all other chemical characteristis. I found out, we hova mostly very weak or no correlation.
There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.