Dilyana Bossenz

## [1] "C:/Users/DSule/OneDrive/Udacity/Project Explore and Summarize Data"

##  [1] "Explore and Summarize Data_Bossenz.Rmd"    
##  [2] "Explore and Summarize Data_Bossenz.v2"     
##  [3] "Explore and Summarize Data_Bossenz.v3.rmd" 
##  [4] "Explore and Summarize Data_Bossenz.zip"    
##  [5] "Explore_and_Summarize_Data_Bossenz.html"   
##  [6] "Explore_and_Summarize_Data_Bossenz.v3.html"
##  [7] "Explore_and_Summarize_Data_Bossenz.v3.rmd" 
##  [8] "projecttemplate.rmd"                       
##  [9] "rsconnect"                                 
## [10] "wineQualityReds.csv"

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

I decided to explorer the data set: Red Wine_Quality, because it is interesting for me to know, what chemical properties speak for a good wine quality. This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). By analysing of this data set I want to set a focus on a question: Which chemical properties influence the quality of red wines?

Univariate Plots Section

We have 1599 records and 13 columns in the table.This are the names of columns: “X”, “fixed.acidity”, “volatile.acidity”, “citric.acid”,“residual.sugar”, “chlorides”, “free.sulfur.dioxide” “total.sulfur.dioxide”, “density”, “pH”, “sulphates”, “alcohol”, “quality”
It is also very important to keep this description in eye: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

At the first step I did some descriptic statistic in order get to know data.

## [1] 1599   12

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Univariate Analysis

What is the structure of your dataset?

Most of data are numerical. We have no string data in the data set. In every record I can find a quality of wine and also what of chemical properties has this type. It is very relevant information.

What is/are the main feature(s) of interest in your dataset?

All chemical properties which stay for good quality of wine. I would like to look at the correlation of each properties and compare it with the quality.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Different statistic models like histogramms or some line charts can support visually the data analysis. This I will look for some dimensions or mesuares to make some visual analysis.

Did you create any new variables from existing variables in the dataset?

Yes, I did. I did create quaity types: excellent, average, poor. This give me an opportunity to understand what type of quality I am using and which properties has every type of quality.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Now I will try to analyse every element separatly. By this analysis I will also look at the person correlation. I am interested in the elements which have a high correlation with the quality.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1793415 0.2723711
## sample estimates:
##       cor 
## 0.2263725

We have a weak correlation between quality and citric.acid. The distribution is left skewed.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4313210 -0.3482032
## sample estimates:
##        cor 
## -0.3905578

We have a negative weak correlation between quality and volatile.acidity. The distribution is normal.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663

We have the moderate correlation and the distribution is normal.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$residual.sugar
## t = 0.5488, df = 1597, p-value = 0.5832
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.03531327  0.06271056
## sample estimates:
##        cor 
## 0.01373164

The correlation is very week. The distribution is left skewed. The distribution has a long tail.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$pH
## t = -2.3109, df = 1597, p-value = 0.02096
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.106451268 -0.008734972
## sample estimates:
##         cor 
## -0.05773139

The correlation is about null. The distribution is normal.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$total.sulfur.dioxide
## t = -7.5271, df = 1597, p-value = 8.622e-14
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2320162 -0.1373252
## sample estimates:
##        cor 
## -0.1851003

The correlation is about null. The distribution is left skewed.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2049011 0.2967610
## sample estimates:
##       cor 
## 0.2513971

The correlation is weak. The distribution is left skewed with a long right tail.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$density
## t = -7.0997, df = 1597, p-value = 1.875e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2220365 -0.1269870
## sample estimates:
##        cor 
## -0.1749192

The correlation is about null. The distribution is not recognisable.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$quality and pf$free.sulfur.dioxide
## t = -2.0269, df = 1597, p-value = 0.04283
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.099430290 -0.001638987
## sample estimates:
##         cor 
## -0.05065606

The correlation is about null. The distribution is left skewed.

Bivariate Plots Section

I am also curious if I can recognise more interesting data correlation by building the scatterplot matrics.

The scatterplot matrics gives me a big picture. How ever, I want to now more details. Thus I will do futher statistic analysis.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Acording to the summary we can see that red wine quality is normally distributed and concentrated around 5 and 6.

Bivariate Analysis

In this section I would like to look at the quality of wine and try to answer the question: If the quality of wine depend on the percent of alcohol? For this analysis I will create boxplots for each quality level.

On this boxplot chart you can see the distribution of alcohol (% by the volume) per quality. For example: By the red wine with quality with 5 score you can recognise the smallest box, i.e. the minimun and maximum values are not far from each other and the 50% of wines with this score are in this box.The average of alcohol (% by the volume) for this quality of wine is above 50% (Look at the red dot in the box).

What was the strongest relationship you found?

The stongest relationship I could found is quality and alcohol. The corellation of this relation = 0.48

Now I will create a new variable called “quality.type” which is categorically divided into “poor”, “average”, and “excellant”. This grouping method will help me detect the difference among each group more easily. Also I want to know the distribution in each quality type. For this I will creat boxplots.

That’s obvious, that the red wine with the type “Excellent” have a low level of volatile acidity. Most red wine with the middle quality have about high values. Most red wine with the poor quality have the highest value of volatile acidity. This relationship speeks for the negative correlation. I calculated it above: -0.39

Furthermore, I would like to see all values for the wine with the type “excellent”. This will help me to understand what kind of chemical propertiece are neccessary for this type

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000   Min.   :1.200  
##  1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000   1st Qu.:2.000  
##  Median : 8.700   Median :0.3700   Median :0.4000   Median :2.300  
##  Mean   : 8.847   Mean   :0.4055   Mean   :0.3765   Mean   :2.709  
##  3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900   3rd Qu.:2.700  
##  Max.   :15.600   Max.   :0.9150   Max.   :0.7600   Max.   :8.900  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 3.00       Min.   :  7.00      
##  1st Qu.:0.06200   1st Qu.: 6.00       1st Qu.: 17.00      
##  Median :0.07300   Median :11.00       Median : 27.00      
##  Mean   :0.07591   Mean   :13.98       Mean   : 34.89      
##  3rd Qu.:0.08500   3rd Qu.:18.00       3rd Qu.: 43.00      
##  Max.   :0.35800   Max.   :54.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9906   Min.   :2.880   Min.   :0.3900   Min.   : 9.20  
##  1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500   1st Qu.:10.80  
##  Median :0.9957   Median :3.270   Median :0.7400   Median :11.60  
##  Mean   :0.9960   Mean   :3.289   Mean   :0.7435   Mean   :11.52  
##  3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200   3rd Qu.:12.20  
##  Max.   :1.0032   Max.   :3.780   Max.   :1.3600   Max.   :14.00  
##     quality         quality.type
##  Min.   :7.000   excellent:217  
##  1st Qu.:7.000   average  :  0  
##  Median :7.000   poor     :  0  
##  Mean   :7.083                  
##  3rd Qu.:7.000                  
##  Max.   :8.000

Here I see the are 217 samples of red wine with the “excellent” type. In order to reach out this quality you need a lot of sulfur.dioxide, fixed.acidity and alcohol, because this values are the highest values.

Multivariate Plots Section

In this section I would like to make a focus on sugar, alcohol and the quality. Thus I will work with the variable: residual sugar, alcohol and quality type. The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. I would like to know, if the wine with the good quality usally sweet or not and how much alcohol it contains.

I see that usually a wine with the excellant quality is not sweet and doesn’t contain much alsohol. The wine with the quality type “average” contain much much alcohol and a bit more suger. The wine with the quality type “poor” sontains less sugar and more alcohol. I would like to know in detail about this chart.

## [1] "Alcohol by quality type:"

## # A tibble: 3 x 2
##   quality.type  mean
##   <ord>        <dbl>
## 1 excellent     11.5
## 2 average       10.3
## 3 poor          10.2

## [1] "Sugar by type:"

## # A tibble: 3 x 2
##   quality.type  mean
##   <ord>        <dbl>
## 1 excellent     2.71
## 2 average       2.50
## 3 poor          2.68

It is interesting to see this numbers, because the visual presentation is difficult to asstimate exactly. Now we have a different picture: The type “excellent” contains the most sugar and alcohol in average. The type “poor” contains the least alcohol, but more suger than type “average”. The type “average” contains the least suger in average. The % of alcoholis in the middle range: more than “poor” and less than “excellent”.

Also I am interested in density and quality of the wine. Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density. The difference between the two is used to calculate the alcohol content. I would like to know if the density is relevant for quality. Let’s have a look if we can recognise this (less alcohol –> more density and another way around) in our data set.

You can see, when we have high value of density, than we have less alcohol and another way around. The trends lin on this chart would help us understand the data better. Also I put alcohol to y axis and density to x axis. It help me visual to analyse the values of density and alcohol.

The trends line make the trend clearly. Now You can be sure: We have a high value of alcohol by higher value of density and another way around. Also the correlation between this two properties has to be negative. Let’s run a person correlation test to see the results:

## 
##  Pearson's product-moment correlation
## 
## data:  pf$density and pf$alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

Let’s descover othe properties like salt and sugar:

## [1] "Salt by type:"

## # A tibble: 3 x 2
##   quality.type   mean
##   <ord>         <dbl>
## 1 excellent    0.0759
## 2 average      0.0890
## 3 poor         0.0957

## [1] "Sugar by type:"

## # A tibble: 3 x 2
##   quality.type  mean
##   <ord>        <dbl>
## 1 excellent     2.71
## 2 average       2.50
## 3 poor          2.68

All types contain much less salt than sugar. There is least salt in the “excellent” type and most salt in the “poor” type.

## 
##  Pearson's product-moment correlation
## 
## data:  pf$residual.sugar and pf$chloride
## t = 2.2257, df = 1597, p-value = 0.02617
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.006606405 0.104346223
## sample estimates:
##        cor 
## 0.05560954

The correlation between salt and sugar is very weak: 0.055

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I observed a negative relationships between quality and salt (chlorides)| quality and volatily acid, and positive correlation between quality and alcohol | quality and citric acid. The correlation between quality and alcohol is 0.47. It is moderate correlation and is the strongest correlation in compare to all other correlation values, I compared.I also noticed that red wine with the excellent quality has the highest proportion of alcohol (in average) and sugar and the lowest proportion of salt (in average).

The last comparison I would like to see is, the relationship between pH and quality type of wine. pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. I want to proove if I can see this in the data.

Were there any interesting or surprising interactions between features?

I created a corelation between quality and all other chemical properties. I observed that all chemical properties show very weak relationship or do not show any relationship.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Final Plots and Summary

Plot One

Description One

I think it is very powerful visualisation, as I can immediatly recognice the distribution. I can very easy compare the average values of every quality level. I see that the wines with the high quality (7-8) have the highest average value of alcohol in campare with ather quality levels. The effectice quantiles and median values of boxplot give me an opportunity to make interpretation. 75% of wine with the middle quality (5) have ca. 10% of alcohol. This is the lowest value of alcohol in average. 75% of wine with the highest quality (8) have ca. 13% of alcohol. This is the highest value in compare to another quality levels.

Plot Two

Description Two

This line chart help me understand a very complex relationship between pH, density and quality of wine on a very easy way. Every line stays for one quality level. The highest level is dark blue and the lowest level is light blue. Basically pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale. I can clearly see this on this viz. Also I can see that pH values between 3.3 and 3.5 have the highest density by all level of qualities. The highest quality value has density around 3 and pH also around 3.1. The wine with the lowest quality level has the lowest value of densit and pH.

Plot Three

Description Three

This scatter plot with trend line is a very effective visualisation. You can easly recognise the negative correlation between alcohol and density. You can recognise that the poor type of quality have both lowest value of density and lowest value of alcohol. The wine with the excellent quality have usualy the highest level of alcohol and the highest level of density. Density is generally used as a measure of the conversion of sugar to alcohol. The must, with sugar but no alcohol, has a high density. The finished wine has less sugar but lots of alcohol and thus has a lower density.

Reflection

The wines data set contains information on 1599 wines across twelve variables from around 2009. I started by understanding the individual variables in the data set, and then I explored interesting questions. I wanted to know: - What characteristics (chemical properties) make “excellent” wine? - Does an excellent red wine consist more alcohol or sugar? - Does an excellent red wine consist more salt? - What about the freshness in the test: How was critical acidity dosed? - Which characteristics make wine unpleasant, vinegar taste: How was volatile.acidity dosed? - And last but not least: How much pH values does an excellent value contains?

All this questions I could answered with my analysis. I found out, that an “excellante” taste of red wine we can really reach with more alcohol, sugar and citric acid and less volatily acid and pH.

In addition, I compare the correlation between quality and all other chemical characteristis. I found out, we hova mostly very weak or no correlation.

There are very few wines that are rated as low or high quality. We could improve the quality of our analysis by collecting more data, and creating more variables that may contribute to the quality of wine. This will certainly improve the accuracy of the prediction models. Having said that, we have successfully identified features that impact the quality of red wine, visualized their relationships and summarized their statistics.