Introduction:
For this study I will analyze a Red Wine dataset created by
Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009. This data set contains the following input variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and the output variable quality.
A description of the variables are below:
1 - Fixed Acidity: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily) (tartaric acid - g/dm^3)
2 - Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. (acetic acid - g/dm^3)
3 - Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines. (g/dm^3)
4 - Residual Sugar: The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. (g/dm^3)
5 - Chlorides: The amount of salt in the wine. (sodium chloride - g/dm^3)
6 - Free Sulfur Dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. (mg/dm^3)
7 - Total Sulfur Dioxide: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg/dm^3)
8 - Density: The density of water is close to that of water depending on the percent alcohol and sugar content. (g/cm^3)
9 - pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - Sulphates: A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. (g/dm3)
11 - Alcohol: The percent alcohol content of the wine
Output variable (based on sensory data):
12 - Quality (Score between 0 and 10)
The goal of the project is to explore the data and see what inferences can be drawn from how the variables interact with each other.
## [1] "Looking at structure of the data"
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Looking at structure of the data
This barplot shows that most of the wines are rated either a 5 or 6. This plot helped me to see how the wine ratings are grouped together. For example we can see that there are really no wines that are rated 1,2,9, or 10.
## [1] Normal Normal Normal Normal Normal Normal Normal Good Good Normal
## Levels: Good < Normal < Poor
Created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality quality.rating
## Min. :3.000 Good : 217
## 1st Qu.:5.000 Normal:1319
## Median :6.000 Poor : 63
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Most of the variables have a similar median and mean which would lead me to believe that their should be a symetrical distribution.
I wanted to view the histograms of all the variable to help mee determine which one may be good to analyze. During the process I determined that some of the variables needed to be transformed to be able to fit into a more normal distribution.
There are 1599 observation and 12 attributes in this data set, the variables are numeric.
Other observations include:
Most of the wines have a quality rating of 5 or 6 on the scale of 0-10. Most of the wines have pH ranging between 3.2 and 3.4
The main feature of interest for me is quality. I would like to know what variables will likely lead to a better quality wine.
Sugar and alcohol content are figures that I would guess would be important in the investigation.
I created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good.
I removed the first row as it was an index column and was not needed.
This is a graph of quality compared to alcohol. We can see an upward trend as alchohol increases quality increases. This falls in line with our assumption as the percentage of alcohol increases the quality increases.
This graph shows how on average as you increase alcohol the quality of the wine increases. The normal boxplot should be higher but since I combined 5 and 6 there are now a lot of outliers which aren’t showing in the box plot.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.0 -0.3 0.7
## volatile.acidity -0.3 1.0 -0.6
## citric.acid 0.7 -0.6 1.0
## residual.sugar 0.1 0.0 0.1
## chlorides 0.1 0.1 0.2
## free.sulfur.dioxide -0.2 0.0 -0.1
## total.sulfur.dioxide -0.1 0.1 0.0
## density 0.7 0.0 0.4
## pH -0.7 0.2 -0.5
## sulphates 0.2 -0.3 0.3
## alcohol -0.1 -0.2 0.1
## quality 0.1 -0.4 0.2
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.1 0.1 -0.2
## volatile.acidity 0.0 0.1 0.0
## citric.acid 0.1 0.2 -0.1
## residual.sugar 1.0 0.1 0.2
## chlorides 0.1 1.0 0.0
## free.sulfur.dioxide 0.2 0.0 1.0
## total.sulfur.dioxide 0.2 0.0 0.7
## density 0.4 0.2 0.0
## pH -0.1 -0.3 0.1
## sulphates 0.0 0.4 0.1
## alcohol 0.0 -0.2 -0.1
## quality 0.0 -0.1 -0.1
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity -0.1 0.7 -0.7 0.2 -0.1
## volatile.acidity 0.1 0.0 0.2 -0.3 -0.2
## citric.acid 0.0 0.4 -0.5 0.3 0.1
## residual.sugar 0.2 0.4 -0.1 0.0 0.0
## chlorides 0.0 0.2 -0.3 0.4 -0.2
## free.sulfur.dioxide 0.7 0.0 0.1 0.1 -0.1
## total.sulfur.dioxide 1.0 0.1 -0.1 0.0 -0.2
## density 0.1 1.0 -0.3 0.1 -0.5
## pH -0.1 -0.3 1.0 -0.2 0.2
## sulphates 0.0 0.1 -0.2 1.0 0.1
## alcohol -0.2 -0.5 0.2 0.1 1.0
## quality -0.2 -0.2 -0.1 0.3 0.5
## quality
## fixed.acidity 0.1
## volatile.acidity -0.4
## citric.acid 0.2
## residual.sugar 0.0
## chlorides -0.1
## free.sulfur.dioxide -0.1
## total.sulfur.dioxide -0.2
## density -0.2
## pH -0.1
## sulphates 0.3
## alcohol 0.5
## quality 1.0
Correlation Matrix that displays the relationship (correlation) between the different variables. Since my variable of interest is quality I am looking for the closest number to one in the quality column which would be alcohol.
Alcohol correlated strongly with quality, sugar suprisingly did not correlate with quality.
Density correlates strongly with fixed acidity.
Besides the acids and dioxides that were strongly correlated to each other I found Density positively correlates with fixed acidity.
This graph compares alcohol content to density. We can see as alcohol increases and density decreases quality tends to get slightly better. While this is an interesting observation I will not be using it for my final analysis.
This graph compares fixed acidity to citric acid. We can see as both the acids increase the quality slightly increases. While this is an interesting observation I will not be using it for my final analysis.
I decided to look at the density vs alcohol content and saw as alcohol increases and density decreases quality tends to get slightly better.
I also decided to look at fixed acidity vs citric acid and saw as both the acids increase the quality slightly increases.
I found it interesting/ suprising that as acidity increase overall quality of the wine increased.
This barplot shows that most of the wines are rated either a 5 or 6.
This graph shows how on average as you increase alcohol the quality of the wine increases. The normal boxplot should be higher but since I combined 5 and 6 there are now a lot of outliers which aren’t showing in the box plot.
This graph shows a strong positive correlation between alcohol rate and quality.
This was an interesting dataset to work with as we were able to explore the different things that make up the quality of wine.
I started out by investigating each of the variable and plotting them on a histogram. I created a barplot of the quality levels and saw that most of the wines consisted of wines rated between a 5 and 6. I created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good. I then created a histogram of the variables to view their distribution.
From there I created a correlation plot showed a strong positive correlation between alcohol rate and quality. After using the correlation plot to come up with the variables that would probably be the most successful to test I put together a box plot outlining the the effects of alcohol on wine quality. Finally I plotted a scatter plot that displayed an overall increase in quality as acohol increases.
Some of the difficulties that I ran into during this project was my own assumptions. Coming into this I has the assumption that sugar content would be a strong contributing factor to the quality of the wine. Also this is my first attempt at using R and I was suprised as to how similar it is with using the different plotting libraries in Python.
One thing I would like to investigate further if I had the chance would be to find out if there is a specific reason that the wine ratings data range from only 3 to 8. Did they not include data outside that range, or did none of the wines warrant a score below 3 or above 8?