Red Wine Analysis by Kyle Santana

Introduction:

For this study I will analyze a Red Wine dataset created by
Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009. This data set contains the following input variables: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol and the output variable quality.

A description of the variables are below:

1 - Fixed Acidity: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily) (tartaric acid - g/dm^3)

2 - Volatile Acidity: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. (acetic acid - g/dm^3)

3 - Citric Acid: Found in small quantities, citric acid can add ‘freshness’ and flavor to wines. (g/dm^3)

4 - Residual Sugar: The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. (g/dm^3)

5 - Chlorides: The amount of salt in the wine. (sodium chloride - g/dm^3)

6 - Free Sulfur Dioxide: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. (mg/dm^3)

7 - Total Sulfur Dioxide: Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. (mg/dm^3)

8 - Density: The density of water is close to that of water depending on the percent alcohol and sugar content. (g/cm^3)

9 - pH: Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - Sulphates: A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant. (g/dm3)

11 - Alcohol: The percent alcohol content of the wine

Output variable (based on sensory data):

12 - Quality (Score between 0 and 10)

The goal of the project is to explore the data and see what inferences can be drawn from how the variables interact with each other.

Univariate Plots Section

## [1] "Looking at structure of the data"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Looking at structure of the data

This barplot shows that most of the wines are rated either a 5 or 6. This plot helped me to see how the wine ratings are grouped together. For example we can see that there are really no wines that are rated 1,2,9, or 10.

##  [1] Normal Normal Normal Normal Normal Normal Normal Good   Good   Normal
## Levels: Good < Normal < Poor

Created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality      quality.rating
##  Min.   :3.000   Good  : 217   
##  1st Qu.:5.000   Normal:1319   
##  Median :6.000   Poor  :  63   
##  Mean   :5.636                 
##  3rd Qu.:6.000                 
##  Max.   :8.000

Most of the variables have a similar median and mean which would lead me to believe that their should be a symetrical distribution.

I wanted to view the histograms of all the variable to help mee determine which one may be good to analyze. During the process I determined that some of the variables needed to be transformed to be able to fit into a more normal distribution.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observation and 12 attributes in this data set, the variables are numeric.

Other observations include:

Most of the wines have a quality rating of 5 or 6 on the scale of 0-10. Most of the wines have pH ranging between 3.2 and 3.4

What is/are the main feature(s) of interest in your dataset?

The main feature of interest for me is quality. I would like to know what variables will likely lead to a better quality wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Sugar and alcohol content are figures that I would guess would be important in the investigation.

Did you create any new variables from existing variables in the dataset?

I created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the
form of the data? If so, why did you do this?

I removed the first row as it was an index column and was not needed.

Bivariate Plots Section

This is a graph of quality compared to alcohol. We can see an upward trend as alchohol increases quality increases. This falls in line with our assumption as the percentage of alcohol increases the quality increases.

This graph shows how on average as you increase alcohol the quality of the wine increases. The normal boxplot should be higher but since I combined 5 and 6 there are now a lot of outliers which aren’t showing in the box plot.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity                  1.0             -0.3         0.7
## volatile.acidity              -0.3              1.0        -0.6
## citric.acid                    0.7             -0.6         1.0
## residual.sugar                 0.1              0.0         0.1
## chlorides                      0.1              0.1         0.2
## free.sulfur.dioxide           -0.2              0.0        -0.1
## total.sulfur.dioxide          -0.1              0.1         0.0
## density                        0.7              0.0         0.4
## pH                            -0.7              0.2        -0.5
## sulphates                      0.2             -0.3         0.3
## alcohol                       -0.1             -0.2         0.1
## quality                        0.1             -0.4         0.2
##                      residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity                   0.1       0.1                -0.2
## volatile.acidity                0.0       0.1                 0.0
## citric.acid                     0.1       0.2                -0.1
## residual.sugar                  1.0       0.1                 0.2
## chlorides                       0.1       1.0                 0.0
## free.sulfur.dioxide             0.2       0.0                 1.0
## total.sulfur.dioxide            0.2       0.0                 0.7
## density                         0.4       0.2                 0.0
## pH                             -0.1      -0.3                 0.1
## sulphates                       0.0       0.4                 0.1
## alcohol                         0.0      -0.2                -0.1
## quality                         0.0      -0.1                -0.1
##                      total.sulfur.dioxide density   pH sulphates alcohol
## fixed.acidity                        -0.1     0.7 -0.7       0.2    -0.1
## volatile.acidity                      0.1     0.0  0.2      -0.3    -0.2
## citric.acid                           0.0     0.4 -0.5       0.3     0.1
## residual.sugar                        0.2     0.4 -0.1       0.0     0.0
## chlorides                             0.0     0.2 -0.3       0.4    -0.2
## free.sulfur.dioxide                   0.7     0.0  0.1       0.1    -0.1
## total.sulfur.dioxide                  1.0     0.1 -0.1       0.0    -0.2
## density                               0.1     1.0 -0.3       0.1    -0.5
## pH                                   -0.1    -0.3  1.0      -0.2     0.2
## sulphates                             0.0     0.1 -0.2       1.0     0.1
## alcohol                              -0.2    -0.5  0.2       0.1     1.0
## quality                              -0.2    -0.2 -0.1       0.3     0.5
##                      quality
## fixed.acidity            0.1
## volatile.acidity        -0.4
## citric.acid              0.2
## residual.sugar           0.0
## chlorides               -0.1
## free.sulfur.dioxide     -0.1
## total.sulfur.dioxide    -0.2
## density                 -0.2
## pH                      -0.1
## sulphates                0.3
## alcohol                  0.5
## quality                  1.0

Correlation Matrix that displays the relationship (correlation) between the different variables. Since my variable of interest is quality I am looking for the closest number to one in the quality column which would be alcohol.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features
in the dataset?

Alcohol correlated strongly with quality, sugar suprisingly did not correlate with quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Density correlates strongly with fixed acidity.

What was the strongest relationship you found?

Besides the acids and dioxides that were strongly correlated to each other I found Density positively correlates with fixed acidity.

Multivariate Plots Section

This graph compares alcohol content to density. We can see as alcohol increases and density decreases quality tends to get slightly better. While this is an interesting observation I will not be using it for my final analysis.

This graph compares fixed acidity to citric acid. We can see as both the acids increase the quality slightly increases. While this is an interesting observation I will not be using it for my final analysis.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I decided to look at the density vs alcohol content and saw as alcohol increases and density decreases quality tends to get slightly better.

I also decided to look at fixed acidity vs citric acid and saw as both the acids increase the quality slightly increases.

Were there any interesting or surprising interactions between features?

I found it interesting/ suprising that as acidity increase overall quality of the wine increased.


Final Plots and Summary

Plot One

Plot One Summary

This barplot shows that most of the wines are rated either a 5 or 6.

Plot Two

Description Two

This graph shows how on average as you increase alcohol the quality of the wine increases. The normal boxplot should be higher but since I combined 5 and 6 there are now a lot of outliers which aren’t showing in the box plot.

Plot Three

Description Three

This graph shows a strong positive correlation between alcohol rate and quality.


Reflection

This was an interesting dataset to work with as we were able to explore the different things that make up the quality of wine.

I started out by investigating each of the variable and plotting them on a histogram. I created a barplot of the quality levels and saw that most of the wines consisted of wines rated between a 5 and 6. I created a new quality rating column that categorizes a rating of 1-4 as Poor 5-6 as Normal and 7-10 as Good. I then created a histogram of the variables to view their distribution.

From there I created a correlation plot showed a strong positive correlation between alcohol rate and quality. After using the correlation plot to come up with the variables that would probably be the most successful to test I put together a box plot outlining the the effects of alcohol on wine quality. Finally I plotted a scatter plot that displayed an overall increase in quality as acohol increases.

Some of the difficulties that I ran into during this project was my own assumptions. Coming into this I has the assumption that sugar content would be a strong contributing factor to the quality of the wine. Also this is my first attempt at using R and I was suprised as to how similar it is with using the different plotting libraries in Python.

One thing I would like to investigate further if I had the chance would be to find out if there is a specific reason that the wine ratings data range from only 3 to 8. Did they not include data outside that range, or did none of the wines warrant a score below 3 or above 8?