Dataset Overview: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine.The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts).At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Variables: Input variables (based on physicochemical tests): fixed acidity (tartaric acid - g / dm^3), volatile acidity (acetic acid - g / dm^3), citric acid (g / dm^3), residual sugar (g / dm^3), chlorides (sodium chloride - (g / dm^3), free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3), density (g / cm^3), pH, sulphates (potassium sulphate - g / dm3), alcohol (% by volume),

Output variable (based on sensory data): quality (score between 0 and 10)

Scope of Analysis: In this analysis we will try to evaluate the chemical properties which affect the quality of red wine.

## [1] "/Users/animesh01/Desktop/Udacity-R/R-Project"

Checking first few rows of the dataset

##   ID fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1  1           7.4             0.70        0.00            1.9     0.076
## 2  2           7.8             0.88        0.00            2.6     0.098
## 3  3           7.8             0.76        0.04            2.3     0.092
## 4  4          11.2             0.28        0.56            1.9     0.075
## 5  5           7.4             0.70        0.00            1.9     0.076
## 6  6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

The data file summary

##        ID         fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Observation: Since mean for none of the variables are NA hence there are no missing values in the dataset.

Univariate Plot Section

In the univariate analysis we will try to evaluate individual attributes in our dataframe and do some initial data exploration by individual attribute.

Plotting the pH variable

Observation: The above bar chart visualization shows distribution for the pH variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed.

Transforming the data to make a normal distribution

Summary before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Summary after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.008   1.166   1.197   1.196   1.224   1.389

Observation: From the above visual and summary statistics we can infer that the distribution is now normally distributed. After the log transformation the scale on the x axis has reduced from 2.7 to 4.0 into 1.0 to 1.3

Plotting fixed.acidity variable

Observation: The above bar chart visualization shows distribution for the fixed.acidity variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Transforming the data into a normal distribution

Summary before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Summary after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.526   1.960   2.067   2.098   2.219   2.766

Plotting the volatile.acidity variable

Observation: The above bar chart visualization shows distribution for the volatile.acidity variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Transforming the data into a normal distribution

Summary before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Summary after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.1200 -0.9416 -0.6539 -0.6985 -0.4463  0.4574

Plotting the citric.acid variable

Observation: The above bar chart visualization shows distribution for the citric.acid variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Data transformation

Summarizing the data before transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Summarizing the data after transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf -2.4080 -1.3470    -Inf -0.8675  0.0000

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.0 to 1 into -Inf to 0. Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Plotting the residual.sugar variable

Observation: The above bar chart visualization shows distribution for the residual.sugar variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Transforming the data to a normal distribution

Summarizing the data before data transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Summarizing the data after data transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.1054  0.6419  0.7885  0.8502  0.9555  2.7410

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.9 to 15.5 into -0.1 to 2.7 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Plotting the chloride variable

Observation: The above bar chart visualization shows distribution for the chlorides variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Transforming to a normal distribution

Summarizng before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Summarizing after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -4.4230 -2.6590 -2.5380 -2.5050 -2.4080 -0.4927

Plotting the free.sulfur.dioxide variable

Observation: The above bar chart visualization shows distribution for free.sulfur.dioxide variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Plotting after the transformation to a normal distribution

Summarizing before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Summarizing after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.946   2.639   2.546   3.045   4.277

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 1 to 72 into 0 to 4.2 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Plotting total.sulfur.dioxide variable

Observation: The above bar chart visualization shows distribution for total.sulfur.dioxide variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.

Plotting total.sulfur.dioxide after the data transformation

Summarizing the data before transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Summarizing after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.792   3.091   3.638   3.601   4.127   5.666

Observation: From the above visual and summary statistics we can infer that the distribution is normally distributed. After the log transformation the scale on the x axis has reduced from 6 to 289 into 1.7 to 5.6

Plotting density variable

Observation: The above bar chart visualization shows distribution for density variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed.

Plotting Density after the data transformation

Summarizing density before transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Summarizing after the transformation

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.009980 -0.004410 -0.003255 -0.003260 -0.002167  0.003683

Plotting sulphates variable

Observation: The above bar chart visualization shows distribution for sulphates variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. Infact the distribution is right skewed.

Plotting sulphates variable after the transformation

Summarizing before transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Summarizing after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.1090 -0.5978 -0.4780 -0.4453 -0.3147  0.6931

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.3 to 2 into -1.1 to 0.6 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Plotting alcohol variable

Observation: The above bar chart visualization shows distribution for alcohol variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. Infact the distribution is right skewed.

Plotting alcohol variable after the transformation

Summarizing alcohol variable before the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Summarizing alcohol variable after the transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.128   2.251   2.322   2.339   2.407   2.701

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 8.4 to 14.9 into 2.1 to 2.7 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Plotting quality variable

Observation: The above bar chart visualization shows distribution for quality variable. Since we observe no break points in the distribution hence we can infer that it is uniformly distributed across all values. Since it is an ordinal data we won’t perform any log transformation for the quality attribute.

Summarizing the quality variable before transformation

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 3 to 8 into 1 to 2 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.

Bivariate Plots Section

In the bivariate analysis we will try to evaluate the relationship among different attributes in our dataframe and do some data exploration by comparing two attributes.

Bivariate Analysis

Correlation Matrix

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                 1.00            -0.26        0.67
## volatile.acidity             -0.26             1.00       -0.55
## citric.acid                   0.67            -0.55        1.00
## residual.sugar                0.11             0.00        0.14
## chlorides                     0.09             0.06        0.20
## free.sulfur.dioxide          -0.15            -0.01       -0.06
## total.sulfur.dioxide         -0.11             0.08        0.04
## density                       0.67             0.02        0.36
## pH                           -0.68             0.23       -0.54
## sulphates                     0.18            -0.26        0.31
## alcohol                      -0.06            -0.20        0.11
## quality                       0.12            -0.39        0.23
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                  0.11      0.09               -0.15
## volatile.acidity               0.00      0.06               -0.01
## citric.acid                    0.14      0.20               -0.06
## residual.sugar                 1.00      0.06                0.19
## chlorides                      0.06      1.00                0.01
## free.sulfur.dioxide            0.19      0.01                1.00
## total.sulfur.dioxide           0.20      0.05                0.67
## density                        0.36      0.20               -0.02
## pH                            -0.09     -0.27                0.07
## sulphates                      0.01      0.37                0.05
## alcohol                        0.04     -0.22               -0.07
## quality                        0.01     -0.13               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## fixed.acidity                       -0.11    0.67 -0.68      0.18   -0.06
## volatile.acidity                     0.08    0.02  0.23     -0.26   -0.20
## citric.acid                          0.04    0.36 -0.54      0.31    0.11
## residual.sugar                       0.20    0.36 -0.09      0.01    0.04
## chlorides                            0.05    0.20 -0.27      0.37   -0.22
## free.sulfur.dioxide                  0.67   -0.02  0.07      0.05   -0.07
## total.sulfur.dioxide                 1.00    0.07 -0.07      0.04   -0.21
## density                              0.07    1.00 -0.34      0.15   -0.50
## pH                                  -0.07   -0.34  1.00     -0.20    0.21
## sulphates                            0.04    0.15 -0.20      1.00    0.09
## alcohol                             -0.21   -0.50  0.21      0.09    1.00
## quality                             -0.19   -0.17 -0.06      0.25    0.48
##                      quality
## fixed.acidity           0.12
## volatile.acidity       -0.39
## citric.acid             0.23
## residual.sugar          0.01
## chlorides              -0.13
## free.sulfur.dioxide    -0.05
## total.sulfur.dioxide   -0.19
## density                -0.17
## pH                     -0.06
## sulphates               0.25
## alcohol                 0.48
## quality                 1.00

Correlation Plot

Observation: From the above visualization we can infer that the quality attribute is correlated with alcohol, sulphates and citric.acid. Among these 3 chemical properties quality has correlation coefficient highest with alcohol at 0.48 and reduces with sulphates at 0.25 and citric.acid 0.23. If we observe the correlation coefficients it still does not show a strong relationship (Strong positive correlation will usually have coefficients of 0.5 and above) but in the analysis among the available attributes these 3 variables are correlated with quality of red wine to some extent. Therefore we should try to reduce our scope of analysis and try to explore more with these 3 attributes as far as quality of red wine is concerned.

Anova Analysis: Anova is going to compare means of our attribute in consideration across the wines and check if differences are statistically significant compared to the quality of a wine.

Anova Analysis between alcohol and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between alcohol and quality of a wine Alternate Hypothesis (H1): There is a relationship between alcohol and quality of wine

Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.

## Analysis of Variance Table
## 
## Response: log(alcohol)
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## quality      1  3.556  3.5560  470.34 < 2.2e-16 ***
## Residuals 1597 12.074  0.0076                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation: As the p value is less than 0.05 we reject the null hypothesis and therefore there is a relationship betweeen quality of wine and alcohol.

Anova Analysis between sulphates and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between sulphates and quality of a wine Alternate Hypothesis (H1): There is a relationship between sulphates and quality of wine

Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.

## Analysis of Variance Table
## 
## Response: log(sulphates)
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## quality      1  7.609  7.6085  168.15 < 2.2e-16 ***
## Residuals 1597 72.263  0.0452                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation: As p<0.05 hence we reject null hypothesis and there is a relationship between sulphates and quality of a wine

Anova Analysis between citric.acid and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between citric.acid and quality of a wine Alternate Hypothesis (H1): There is a relationship between citric.acid and quality of wine

Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.

## Analysis of Variance Table
## 
## Response: citric.acid
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## quality      1  3.107 3.10747  86.258 < 2.2e-16 ***
## Residuals 1597 57.533 0.03603                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Observation: As p<0.05 we reject the null hypothesis and accept the alternate hypothesis. Therefore, there is a relationship between citric.acid and quality of a wine

Multivariate Plots Section

In the multivariate analysis we will try to evaluate the relationship among different attributes in our dataframe and do some data exploration by comparing multiple attributes (More than 2 attributes). Since we have reduced our scope to 3 attributes : alcohol, sulphates and citric.acid we will try to observe what relationship these variables share with eachother along with the quality attribute in our dataframe.

Multivariate Analysis

Comparing Citric.Acid vs Sulphates By Quality

Observation: In the above scatterplot visualization we can infer that citric.acid has a strong linear positive relationship with sulphates attribute. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.

Comparing Alcohol vs Citric.Acid By Quality

Observation: In the above scatterplot visualization we can infer that alcohol has a linear positive relationship with citric.acid attribute but we don’t observe a strong relationship. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.

Comparing Sulphates vs Alcohol By Quality

Observation: In the above scatterplot visualization we can infer that sulphates has a linear positive relationship with alcohol attribute but we don’t observe a strong relationship. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.

Comparing Citric.acid vs Sulphates vs Alcohol By Quality

Observation: In the above scatterplot visualization we can infer that sulphates has a strong linear positive relationship with citric.acid attribute. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval. We can conclude that major red wines with high alcohol and high quality content are in sulphates between 0.5-1 and citric.acid 0-0.75 range.

Final Plots and Summary

In this section I have shortlisted three visualizations of the highest significance which helped me analyze the chemical properties which affect the quality of red wine.

Plot One

Description One

Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. Reason for selection of the visualization, 1. The correlation plot helped to reduce the scope of analysis by providing us the 3 variables which had the strongest relationship with the quality attribute.

Plot Two

Description Two

Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. 2. Appropriate axis labels were included. (Includes both x and y axis) Reason for selection of the visualization, 1. After the correlation plot i wanted to have an understanding on what were the relationships shared among the three variables. This visualization helped me to understand that a strong linear positive relationship is associated between citric.acid and sulphates. Therefore if we aggregate other attributes on top of this visualization it will be easier to get an understanding on what chemical factors and to what extent affect the quality of red wine.

Plot Three

Description Three

Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. 2. Appropriate axis labels were included. (Includes both x and y axis) Reason for selection of the visualization, 1. After understanding that citric acid and sulphates have a strong positive linear relationship i wanted to include other attributes of our interest. In this case as we knew alcohol is a chemical property of interest we included it as a size. Which helped us to conclude that major red wines with high alcohol and high quality content are in sulphates between 0.5-1 (g/dm3) and citric.acid 0-0.75 (g/dm^3) range.

Reflection

What went well? Since the dataset was clean and without missing values it was easy to start the data exploration peice without any significant cleaning. What was surprising? Before starting the data exploration to my limited understanding on wines i thought residual.sugar along with alcohol could be the attributes of our interest as far as quality of wine is considered but while evaluating the data i got clarified that residual.sugar does not have a strong correlation with quality and second aspect of surprise was that alcohol had the strongest correlation coefficient when it comes to quality of red wine. Future scope: There could be machine learning algorithms applied to the dataset for example: Outlier detection algorithms could be used to detect the few excellent or poor wines. We could also perform some classification algorithm to classify excellent and poor wines.

References

http://rcompanion.org/handbook/I_12.html http://ggplot2.tidyverse.org/reference/geom_density.html http://www.analyticsforfun.com/2014/06/performing-anova-test-in-r-results-and.html http://www.imachordata.com/extra-extra-get-your-gridextra/ http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://stackoverflow.com/questions/5446426/calculate-correlation-for-more-than-two-variables

Red Wine Quality Analysis

Author : Animesh Chowdhury

Univariate Plot Section

Bivariate Plots Section

Bivariate Analysis

Multivariate Plots Section

Multivariate Analysis

Final Plots and Summary

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

References