Explore and Summarize Red Wine Data

Author: Carl (Andrew) Perkins

The following data report is focused on red wine data that contains a set of 1,599 red wine and including 11 variables of their corresponding chemical properties. Three experts also rated the quality of each wine with a rating between 0 (very bad) and 10 (excellent).

The data will be explored using R and will have visualizations along the way. I believe it is important to also define the individual variables. The following are the descriptions of each provided. Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

12 - quality (score between 0 and 10) | Output variable (based on sensory data)

Below are some initial stats on the variables. The first below is a listing on the number of rows. This indicates the number of Red Wines that are recorded in the data set.

## [1] 1599

Below are the classes and corresponding variables of each corresponding Red Wine listing.

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

The last listed above are a summary of the initial stats for each variable. Again these are the stats that pertain to a data set of Red Wines. Take note the free.sulfur.dioxide and the total.sulfur.dioxide variables have maxes based upon the means that would suggest that there are outliers.

Univariate Plots

The graph above definitely shows that there are 2 outliers in total sulfur dioxide. This is a variable that is associated to each Red Wine within the data set. It also appears that the quality isn’t rated higher than 8.

Both plots above look at the count level of fixed.acidity as it is associated to each Wine within the data set. The first plot used above is just a standard qplot histogram. The second uses a smaller binwidth and the scale_x_continous to show a bit more detail. It can be concluded that the majority of fixed.acidity is between 6 and 10.

Both plots above look at the count level of volatile.acidity as it is associated to each Wine within the data set. I initially wonder how the volatile.acidity and the fixed.acidity (another variable within the set of data) relate to one another.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

All data and plots above look at the count level of citric.acid as it is associated to each Wine within the data set. There appears to be an outlier at 1.0 for citric.acidn as we view the charts above.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The first plot and data above look a the count level of density as it is associated to each Wine within the data set. The second plot above looks at the count level of PH level as it is associated to each Wine within the data set. Lastly the two plots look at the percentage of alcohol as it relates to each Red Wine set. The last plot is broken into bucket categories. PH and Density are by far the most normally distributed data among the rest of the variables.

Univariate Analysis

What is the structure of your dataset?

There are 1599 observations and 13 variables within the dataset (x, fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfu.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality.)

Other observations:

Quality range is from 0 to 10 and the average or mean is 5.6

Max density is 1.0037

The max alcohol is 14.9%

What is/are the main feature(s) of interest in your dataset?

The main feature of interest I want to find is what correlates to the quality rating of the individual wine itself.

What other features in the dataset do you think will help support your

Largely I think that the alcohol level will contribute to the quality in so much as the boldness of the wine. The acidity level will most likely affect the bitterness of the wine and the citric acid level will affect the freshness of the wine.

Did you create any new variables from existing variables in the dataset?

Yes I created buckets for the alcohol percentage. I used ranges of 0-5, 5-10 and 10-15.

Of the features you investigated, were there any unusual distributions?

The pH and density were normally distributed, most of the others were skewed to the right.

Bivariate Plots

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000    0.00000000      0.724669575
## fixed.acidity        -0.268483920    1.00000000      0.000000000
## volatile.acidity     -0.008815099   -0.25613089      1.000000000
## citric.acid          -0.153551355    0.67170343     -0.552495685
## residual.sugar       -0.031260835    0.11477672      0.001917882
## chlorides            -0.119868519    0.09370519      0.061297772
## free.sulfur.dioxide   0.090479643   -0.15379419     -0.010503827
## total.sulfur.dioxide -0.117849669   -0.11318144      0.076470005
## density              -0.368372087    0.66804729      0.022026232
## pH                    0.136005328   -0.68297819      0.234937294
## sulphates            -0.125306999    0.18300566     -0.260986685
## alcohol               0.245122841   -0.06166827     -0.202288027
## quality               0.066452608    0.12405165     -0.390557780
##                        citric.acid residual.sugar     chlorides
## X                     6.744481e-10   2.115297e-01  1.533390e-06
## fixed.acidity         0.000000e+00   4.199465e-06  1.751746e-04
## volatile.acidity      0.000000e+00   9.389168e-01  1.422491e-02
## citric.acid           1.000000e+00   8.083723e-09  2.220446e-16
## residual.sugar        1.435772e-01   1.000000e+00  2.617079e-02
## chlorides             2.038229e-01   5.560954e-02  1.000000e+00
## free.sulfur.dioxide  -6.097813e-02   1.870490e-01  5.562147e-03
## total.sulfur.dioxide  3.553302e-02   2.030279e-01  4.740047e-02
## density               3.649472e-01   3.552834e-01  2.006323e-01
## pH                   -5.419041e-01  -8.565242e-02 -2.650261e-01
## sulphates             3.127700e-01   5.527121e-03  3.712605e-01
## alcohol               1.099032e-01   4.207544e-02 -2.211405e-01
## quality               2.263725e-01   1.373164e-02 -1.289066e-01
##                      free.sulfur.dioxide total.sulfur.dioxide
## X                           2.915917e-04         2.297726e-06
## fixed.acidity               6.335579e-10         5.709033e-06
## volatile.acidity            6.747011e-01         2.213857e-03
## citric.acid                 1.473916e-02         1.555454e-01
## residual.sugar              4.685141e-14         2.220446e-16
## chlorides                   8.241238e-01         5.809120e-02
## free.sulfur.dioxide         1.000000e+00         0.000000e+00
## total.sulfur.dioxide        6.676665e-01         1.000000e+00
## density                    -2.194583e-02         7.126948e-02
## pH                          7.037750e-02        -6.649456e-02
## sulphates                   5.165757e-02         4.294684e-02
## alcohol                    -6.940835e-02        -2.056539e-01
## quality                    -5.065606e-02        -1.851003e-01
##                            density            pH    sulphates      alcohol
## X                     0.000000e+00  4.770847e-08 4.992031e-07 0.000000e+00
## fixed.acidity         0.000000e+00  0.000000e+00 1.648681e-13 1.364868e-02
## volatile.acidity      3.787554e-01  0.000000e+00 0.000000e+00 3.330669e-16
## citric.acid           0.000000e+00  0.000000e+00 0.000000e+00 1.059462e-05
## residual.sugar        0.000000e+00  6.065915e-04 8.252134e-01 9.258425e-02
## chlorides             5.551115e-16  0.000000e+00 0.000000e+00 0.000000e+00
## free.sulfur.dioxide   3.804985e-01  4.869975e-03 3.888321e-02 5.492314e-03
## total.sulfur.dioxide  4.354284e-03  7.818341e-03 8.601835e-02 1.110223e-16
## density               1.000000e+00  0.000000e+00 2.418474e-09 0.000000e+00
## pH                   -3.416993e-01  1.000000e+00 2.109424e-15 1.110223e-16
## sulphates             1.485064e-01 -1.966476e-01 1.000000e+00 1.783053e-04
## alcohol              -4.961798e-01  2.056325e-01 9.359475e-02 1.000000e+00
## quality              -1.749192e-01 -5.773139e-02 2.513971e-01 4.761663e-01
##                           quality
## X                    7.857465e-03
## fixed.acidity        6.495635e-07
## volatile.acidity     0.000000e+00
## citric.acid          0.000000e+00
## residual.sugar       5.832180e-01
## chlorides            2.313383e-07
## free.sulfur.dioxide  4.283398e-02
## total.sulfur.dioxide 8.615331e-14
## density              1.874945e-12
## pH                   2.096278e-02
## sulphates            0.000000e+00
## alcohol              0.000000e+00
## quality              1.000000e+00

I found a function online (listed below) that allowed me to compare each variable’s relationship to one another. This represent all variables associated to Red Wine data.

https://stat.ethz.ch/pipermail/r-help/2001-November/016201.html

This plot represents all variable’s correlation associated to Red Wine data. As far as comparing bivariants, I first I wanted to dive deeper into the correlation between alcohol and quality.

These plots represent the quality variable as it is associated to alcohol percentage within the Red Wine data. Overall there doesn’t seem to be a heavy relationship between the alcohol level and the quality rating.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

This plot and stat summary represents the fixed.acidity variable as it is associated to alcohol percentage within the Red Wine data. It generally appears that the alcohol percentage and the fixed acidity are within 2 metric points of one another based upon the chart above.

This plot represents the fixed.acidity variable as it is associated to the citric.acid variable within the Red Wine data. The graph shows some sort of corresponding relationship between fixed acidity and the citric acid. I know categorically that citric acid is acidic based, so this matches up with what I thought may be represented in these correlations.

This plot represents the fixed.acidity variable as it is associated to the density variable within the Red Wine data. There appears to be a corresponding relationship between fixed acidity and density as well.

This box plot represents the alcohol_buckets variable as it is associated to the PH variable within the Red Wine data. There seems to be a slight relationship between the pH range and the alcohol buckets created variable as well.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Referring to quality vs alcohol: There does appear to a slight relationship between alcohol buckets and quality. Generally the higher alcohol the better the quality since there are more are more data points within this vicinity.

As pH decreases citric acid also increases, because of what acidic solution is it is easy to infer that this would be backed up by this data.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Acidity and density seemed to give a strong relationship between the two.

What was the strongest relationship you found?

Definitely acidity and density.

Multivariate Plots Section

This represent the density variable as it is associated to the fixed.acidity variable within the Red Wine data, while also including the quality rating by color. After seeing the previous graphs I knew that that where density and acidity increases. I wanted to see if there was any relationship with the quality. The quality is broken down by each color represented in the legend to the right.

This represent the alcohol percentage as it is associated to the citric.acid variable within the Red Wine data, while also including the quality rating by color. It would appear that this isn’t a direct relationship between the citric acid and alcohol. But there does seem to be more highly rated wines in the 11 to 13 percent alcohol range as well as when the citric acid level is around the .5 metric.

This represent the PH variable as it is associated to the volatile.acidity variable ithin the Red Wine data, while also including the quality rating by color. The above graph represent a summary of data quality where PH and volatile acidity are have an inverse correlation relationship.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

There is definitely a relationship in the alcohol percentage and the quality rating of the wine. As a general property the higher the alcohol the better the win. We can also see that there is a shift towards less volatile acidity.

Were there any interesting or surprising interactions between features?

The most interesting thing to see was the relationship between the alcohol level and quality as well as how citric acid also played a role, most likely due to the freshness of the wine.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

This represent the density variable as it is associated to the fixed.acidity variable within the Red Wine data, while also including the quality rating by color. This graph shows that the acidity increase as density increases as well as how that loosely the higher the quality typically also the more acidic.

Plot Two

Description Two

This represent the percentage of alcohol by volume as it is associated to the number of individual wines within the Red Wine data. This graph shows the alcohol parentage based on each count of the number of wines. Generally speaking highest is around the 9.5 percent marker per wine. Using a different data set, this of course would possibly change.

Plot Three

Description Three

This box plot represent the alcohol in buckets percentage by categories as it is associated to the PH variable within the Red Wine data. This final plot has broken down the Alcohol in buckets which I created distinctly new buckets to form this graph. It is based upon the PH level and roughly has some relationship as the alcohol level rises so does the PH level.


Reflection

Overall I found this project and report to be very interesting and valuable as it relates to using statistics and charts in R. The most interesting thing I found is that generally the best red wines range from 8 percent alcohol level to about a 10.5 percent based upon the quality rating. Additionally, I tried to get the boxplot to have different colors based upon the bucket variable that I had created. This was the most difficult syntax for me. I ended up just making them all one color.

Also, the process of learning R was honestly a lot of fun. I have an analytic mindset and I firmly believe that the tools and concepts that I learned will be used in the future. I know SQL, Python, HTML, CSS and JavaScript and this was the quickest language I learned as it pertains to difficulty.

If I were to compare data like this again I would love to see the names, prices and regions where the grapes were grown. I would like to see how region and price would compare to the overall quality. The names would be just a great value add for future personal experimentation I think this data could be leveraged for additional insights in the future.