Dataset Overview: This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine.The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts).At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Variables: Input variables (based on physicochemical tests): fixed acidity (tartaric acid - g / dm^3), volatile acidity (acetic acid - g / dm^3), citric acid (g / dm^3), residual sugar (g / dm^3), chlorides (sodium chloride - (g / dm^3), free sulfur dioxide (mg / dm^3), total sulfur dioxide (mg / dm^3), density (g / cm^3), pH, sulphates (potassium sulphate - g / dm3), alcohol (% by volume),
Output variable (based on sensory data): quality (score between 0 and 10)
Scope of Analysis: In this analysis we will try to evaluate the chemical properties which affect the quality of red wine.
## [1] "/Users/animesh01/Desktop/Udacity-R/R-Project"
Checking first few rows of the dataset
## ID fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.4 0.70 0.00 1.9 0.076
## 2 2 7.8 0.88 0.00 2.6 0.098
## 3 3 7.8 0.76 0.04 2.3 0.092
## 4 4 11.2 0.28 0.56 1.9 0.075
## 5 5 7.4 0.70 0.00 1.9 0.076
## 6 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
The data file summary
## ID fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Observation: Since mean for none of the variables are NA hence there are no missing values in the dataset.
In the univariate analysis we will try to evaluate individual attributes in our dataframe and do some initial data exploration by individual attribute.
Plotting the pH variable
Observation: The above bar chart visualization shows distribution for the pH variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed.
Transforming the data to make a normal distribution
Summary before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
Summary after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.008 1.166 1.197 1.196 1.224 1.389
Observation: From the above visual and summary statistics we can infer that the distribution is now normally distributed. After the log transformation the scale on the x axis has reduced from 2.7 to 4.0 into 1.0 to 1.3
Plotting fixed.acidity variable
Observation: The above bar chart visualization shows distribution for the fixed.acidity variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Transforming the data into a normal distribution
Summary before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Summary after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.526 1.960 2.067 2.098 2.219 2.766
Observation: From the above visual and summary statistics we can infer that the distribution is now normally distributed. After the log transformation the scale on the x axis has reduced from 4.6 to 15.9 into 1.5 to 2.7
Plotting the volatile.acidity variable
Observation: The above bar chart visualization shows distribution for the volatile.acidity variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Transforming the data into a normal distribution
Summary before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Summary after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.1200 -0.9416 -0.6539 -0.6985 -0.4463 0.4574
Observation: From the above visual and summary statistics we can infer that the distribution is now normally distributed. After the log transformation the scale on the x axis has reduced from 0.12 to 1.5 into -2.1 to 0.4
Plotting the citric.acid variable
Observation: The above bar chart visualization shows distribution for the citric.acid variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Data transformation
Summarizing the data before transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Summarizing the data after transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf -2.4080 -1.3470 -Inf -0.8675 0.0000
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.0 to 1 into -Inf to 0. Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
Plotting the residual.sugar variable
Observation: The above bar chart visualization shows distribution for the residual.sugar variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Transforming the data to a normal distribution
Summarizing the data before data transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Summarizing the data after data transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.1054 0.6419 0.7885 0.8502 0.9555 2.7410
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.9 to 15.5 into -0.1 to 2.7 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
Plotting the chloride variable
Observation: The above bar chart visualization shows distribution for the chlorides variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Transforming to a normal distribution
Summarizng before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Summarizing after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.4230 -2.6590 -2.5380 -2.5050 -2.4080 -0.4927
Observation: From the above visual and summary statistics we can infer that the distribution is now normally distributed. After the log transformation the scale on the x axis has reduced from 0 to 0.6 into -4.4 to -0.4
Plotting the free.sulfur.dioxide variable
Observation: The above bar chart visualization shows distribution for free.sulfur.dioxide variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Plotting after the transformation to a normal distribution
Summarizing before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Summarizing after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.946 2.639 2.546 3.045 4.277
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 1 to 72 into 0 to 4.2 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
Plotting total.sulfur.dioxide variable
Observation: The above bar chart visualization shows distribution for total.sulfur.dioxide variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. The bar chart infact helps us to conclude that the distribution is right skewed.
Plotting total.sulfur.dioxide after the data transformation
Summarizing the data before transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Summarizing after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.792 3.091 3.638 3.601 4.127 5.666
Observation: From the above visual and summary statistics we can infer that the distribution is normally distributed. After the log transformation the scale on the x axis has reduced from 6 to 289 into 1.7 to 5.6
Plotting density variable
Observation: The above bar chart visualization shows distribution for density variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed.
Plotting Density after the data transformation
Summarizing density before transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Summarizing after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.009980 -0.004410 -0.003255 -0.003260 -0.002167 0.003683
Observation: From the above visual and summary statistics we can infer that the distribution is normally distributed. After the log transformation the scale on the x axis has reduced from 0.9 to 1 into -0.0 to 0.0
Plotting sulphates variable
Observation: The above bar chart visualization shows distribution for sulphates variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. Infact the distribution is right skewed.
Plotting sulphates variable after the transformation
Summarizing before transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Summarizing after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.1090 -0.5978 -0.4780 -0.4453 -0.3147 0.6931
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 0.3 to 2 into -1.1 to 0.6 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
Plotting alcohol variable
Observation: The above bar chart visualization shows distribution for alcohol variable. Since we observe break points in the distribution hence we can infer that it is not uniformly distributed across all values. Also we can observe that the data is not normally distributed. Infact the distribution is right skewed.
Plotting alcohol variable after the transformation
Summarizing alcohol variable before the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Summarizing alcohol variable after the transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.128 2.251 2.322 2.339 2.407 2.701
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 8.4 to 14.9 into 2.1 to 2.7 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
Plotting quality variable
Observation: The above bar chart visualization shows distribution for quality variable. Since we observe no break points in the distribution hence we can infer that it is uniformly distributed across all values. Since it is an ordinal data we won’t perform any log transformation for the quality attribute.
Summarizing the quality variable before transformation
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
Observation: From the above visual and summary statistics we can infer that the distribution is still not normally distributed. After the log transformation the scale on the x axis has reduced from 3 to 8 into 1 to 2 Some alternate transformation procedures need to be followed in order to convert this attribute into a normal distribution.
In the bivariate analysis we will try to evaluate the relationship among different attributes in our dataframe and do some data exploration by comparing two attributes.
Correlation Matrix
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00 -0.26 0.67
## volatile.acidity -0.26 1.00 -0.55
## citric.acid 0.67 -0.55 1.00
## residual.sugar 0.11 0.00 0.14
## chlorides 0.09 0.06 0.20
## free.sulfur.dioxide -0.15 -0.01 -0.06
## total.sulfur.dioxide -0.11 0.08 0.04
## density 0.67 0.02 0.36
## pH -0.68 0.23 -0.54
## sulphates 0.18 -0.26 0.31
## alcohol -0.06 -0.20 0.11
## quality 0.12 -0.39 0.23
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.11 0.09 -0.15
## volatile.acidity 0.00 0.06 -0.01
## citric.acid 0.14 0.20 -0.06
## residual.sugar 1.00 0.06 0.19
## chlorides 0.06 1.00 0.01
## free.sulfur.dioxide 0.19 0.01 1.00
## total.sulfur.dioxide 0.20 0.05 0.67
## density 0.36 0.20 -0.02
## pH -0.09 -0.27 0.07
## sulphates 0.01 0.37 0.05
## alcohol 0.04 -0.22 -0.07
## quality 0.01 -0.13 -0.05
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.11 0.67 -0.68 0.18 -0.06
## volatile.acidity 0.08 0.02 0.23 -0.26 -0.20
## citric.acid 0.04 0.36 -0.54 0.31 0.11
## residual.sugar 0.20 0.36 -0.09 0.01 0.04
## chlorides 0.05 0.20 -0.27 0.37 -0.22
## free.sulfur.dioxide 0.67 -0.02 0.07 0.05 -0.07
## total.sulfur.dioxide 1.00 0.07 -0.07 0.04 -0.21
## density 0.07 1.00 -0.34 0.15 -0.50
## pH -0.07 -0.34 1.00 -0.20 0.21
## sulphates 0.04 0.15 -0.20 1.00 0.09
## alcohol -0.21 -0.50 0.21 0.09 1.00
## quality -0.19 -0.17 -0.06 0.25 0.48
## quality
## fixed.acidity 0.12
## volatile.acidity -0.39
## citric.acid 0.23
## residual.sugar 0.01
## chlorides -0.13
## free.sulfur.dioxide -0.05
## total.sulfur.dioxide -0.19
## density -0.17
## pH -0.06
## sulphates 0.25
## alcohol 0.48
## quality 1.00
Correlation Plot
Observation: From the above visualization we can infer that the quality attribute is correlated with alcohol, sulphates and citric.acid. Among these 3 chemical properties quality has correlation coefficient highest with alcohol at 0.48 and reduces with sulphates at 0.25 and citric.acid 0.23. If we observe the correlation coefficients it still does not show a strong relationship (Strong positive correlation will usually have coefficients of 0.5 and above) but in the analysis among the available attributes these 3 variables are correlated with quality of red wine to some extent. Therefore we should try to reduce our scope of analysis and try to explore more with these 3 attributes as far as quality of red wine is concerned.
Anova Analysis: Anova is going to compare means of our attribute in consideration across the wines and check if differences are statistically significant compared to the quality of a wine.
Anova Analysis between alcohol and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between alcohol and quality of a wine Alternate Hypothesis (H1): There is a relationship between alcohol and quality of wine
Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.
## Analysis of Variance Table
##
## Response: log(alcohol)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 1 3.556 3.5560 470.34 < 2.2e-16 ***
## Residuals 1597 12.074 0.0076
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation: As the p value is less than 0.05 we reject the null hypothesis and therefore there is a relationship betweeen quality of wine and alcohol.
Anova Analysis between sulphates and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between sulphates and quality of a wine Alternate Hypothesis (H1): There is a relationship between sulphates and quality of wine
Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.
## Analysis of Variance Table
##
## Response: log(sulphates)
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 1 7.609 7.6085 168.15 < 2.2e-16 ***
## Residuals 1597 72.263 0.0452
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation: As p<0.05 hence we reject null hypothesis and there is a relationship between sulphates and quality of a wine
Anova Analysis between citric.acid and quality: Here are the null and alternative hypothesis, Null Hypothesis (H0): There is no relationship between citric.acid and quality of a wine Alternate Hypothesis (H1): There is a relationship between citric.acid and quality of wine
Observation: To account for overplotting we have used geom_jitter against geom_point with a transparency set to 0.0000000005 to observe the distribution a bit clearly.
## Analysis of Variance Table
##
## Response: citric.acid
## Df Sum Sq Mean Sq F value Pr(>F)
## quality 1 3.107 3.10747 86.258 < 2.2e-16 ***
## Residuals 1597 57.533 0.03603
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Observation: As p<0.05 we reject the null hypothesis and accept the alternate hypothesis. Therefore, there is a relationship between citric.acid and quality of a wine
In the multivariate analysis we will try to evaluate the relationship among different attributes in our dataframe and do some data exploration by comparing multiple attributes (More than 2 attributes). Since we have reduced our scope to 3 attributes : alcohol, sulphates and citric.acid we will try to observe what relationship these variables share with eachother along with the quality attribute in our dataframe.
Comparing Citric.Acid vs Sulphates By Quality
Observation: In the above scatterplot visualization we can infer that citric.acid has a strong linear positive relationship with sulphates attribute. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.
Comparing Alcohol vs Citric.Acid By Quality
Observation: In the above scatterplot visualization we can infer that alcohol has a linear positive relationship with citric.acid attribute but we don’t observe a strong relationship. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.
Comparing Sulphates vs Alcohol By Quality
Observation: In the above scatterplot visualization we can infer that sulphates has a linear positive relationship with alcohol attribute but we don’t observe a strong relationship. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval.
Comparing Citric.acid vs Sulphates vs Alcohol By Quality
Observation: In the above scatterplot visualization we can infer that sulphates has a strong linear positive relationship with citric.acid attribute. We have used lm function to fit linear models. The Linear regression line is plotted at 95% confidence interval. We can conclude that major red wines with high alcohol and high quality content are in sulphates between 0.5-1 and citric.acid 0-0.75 range.
In this section I have shortlisted three visualizations of the highest significance which helped me analyze the chemical properties which affect the quality of red wine.
Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. Reason for selection of the visualization, 1. The correlation plot helped to reduce the scope of analysis by providing us the 3 variables which had the strongest relationship with the quality attribute.
Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. 2. Appropriate axis labels were included. (Includes both x and y axis) Reason for selection of the visualization, 1. After the correlation plot i wanted to have an understanding on what were the relationships shared among the three variables. This visualization helped me to understand that a strong linear positive relationship is associated between citric.acid and sulphates. Therefore if we aggregate other attributes on top of this visualization it will be easier to get an understanding on what chemical factors and to what extent affect the quality of red wine.
Improvements made in the visualization, 1. A title was added to the visualization and title alignment was appropriately set. 2. Appropriate axis labels were included. (Includes both x and y axis) Reason for selection of the visualization, 1. After understanding that citric acid and sulphates have a strong positive linear relationship i wanted to include other attributes of our interest. In this case as we knew alcohol is a chemical property of interest we included it as a size. Which helped us to conclude that major red wines with high alcohol and high quality content are in sulphates between 0.5-1 (g/dm3) and citric.acid 0-0.75 (g/dm^3) range.
What went well? Since the dataset was clean and without missing values it was easy to start the data exploration peice without any significant cleaning. What was surprising? Before starting the data exploration to my limited understanding on wines i thought residual.sugar along with alcohol could be the attributes of our interest as far as quality of wine is considered but while evaluating the data i got clarified that residual.sugar does not have a strong correlation with quality and second aspect of surprise was that alcohol had the strongest correlation coefficient when it comes to quality of red wine. Future scope: There could be machine learning algorithms applied to the dataset for example: Outlier detection algorithms could be used to detect the few excellent or poor wines. We could also perform some classification algorithm to classify excellent and poor wines.
http://rcompanion.org/handbook/I_12.html http://ggplot2.tidyverse.org/reference/geom_density.html http://www.analyticsforfun.com/2014/06/performing-anova-test-in-r-results-and.html http://www.imachordata.com/extra-extra-get-your-gridextra/ http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software https://stackoverflow.com/questions/5446426/calculate-correlation-for-more-than-two-variables