Description of attributes:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Univariate Plots Section:

Looking at the data set and table structure

## spec_tbl_df[,13] [1,599 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ x                   : num [1:1599] 1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   x = col_double(),
##   ..   fixed.acidity = col_double(),
##   ..   volatile.acidity = col_double(),
##   ..   citric.acid = col_double(),
##   ..   residual.sugar = col_double(),
##   ..   chlorides = col_double(),
##   ..   free.sulfur.dioxide = col_double(),
##   ..   total.sulfur.dioxide = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double()
##   .. )
##        x          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median : 2.200   Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000
##  [1] "x"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

The “wineQualityReds” data set contains 11 variables (associated with the chemical properties of wine) and 1599 observations.

Wine Variable Histograms:

A note about citric acid: Citric acid is naturally present in extremely small amounts in grapes and is easily converted into other materials during the wine making process. Citric acid is sometimes added to wine to give the wine a fresh flavor. However, it is not unusual to encounter no citric acid in a wine. The data is not missing (null), it’s value is simply “0”. This is still statistically significant and I will not remove the zero values from the plots or any calculations.

I will be using the quality levels of the wine to compare the other variables. I would like to find some correlations between these variables and good quality red wines. I will build a factored variable to accomplish this.

## [1] "poor"    "average" "good"
##    poor average    good 
##      63    1319     217

Of the 1599 observations, 63 are rated as poor quality (<5), 1319 are rated as average quality (>5, <7) and 217 are rated as good quality (>7)

Several of the histograms show long tails and outliers. I will take a closer look at these and attempt to transform these to a more normal distribution and reduce the outliers.

Fixed Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6628  0.8513  0.8976  0.9112  0.9638  1.2014

No negative numbers or infinity. Log function gives the histogram a more normal distribution. Modified the x-axis to only show the range between 0.7 and 1.05.

Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9208 -0.4089 -0.2840 -0.3034 -0.1938  0.1987

Summary shows negative numbers, will add 1 to x-axis. The log function does give the histogram a more normal distribution. Modified the x-axis to only show the range between 0.05 and 0.3.

Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

The log function does not help to transform this histogram.

Residual Sugar

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -0.04576  0.27875  0.34242  0.36925  0.41497  1.19033

Summary shows negative numbers, will add 1 to x-axis. The log function does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.

Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

There is a very long tail on this histogram.
Removing the ouliers does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.

Free Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.8451  1.1461  1.1058  1.3222  1.8573

Summary shows negative numbers, will add 1 to x-axis. The log function gives the histogram a more normal distribution.

Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.7782  1.3424  1.5798  1.5638  1.7924  2.4609

No negative numbers or infinity. The log function gives the histogram a more normal distribution.

Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

This variable already has a pretty normal distribution. I will not transform the data. Removing the ouliers gives the histogram a slightly more normal distribution. Modified the x-axis to reduce the tails on both sides.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

This variable already has a pretty normal distribution. I will not alter the data. Removing the ouliers gives the histogram a slightly more normal distribution. Modified the x-axis to reduce the tails on both sides.

Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

There is a very long tail on this histogram.
Removing the ouliers gives the histogram a more normal distribution. Modified the x-axis to reduce the long tail.

Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

The log or sqrt function does not help to transform this histogram. It has a positively skewed distribution. Removing the ouliers does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.

Bivariate Analysis

Fixed Acidity:

The boxplot for fixed acidity shows some extreme outliers. Modified the x-axis to only show the range between 0.7 and 1.05. A closer look at the boxplot reveals that the median of fixed acidity in good quality wines are higher than the average and poor quality wines. There may be a correlation to this variable and the quality of wine. Further investigation may be needed.

Volatile Acidity

The boxplot for volatile acidity shows a few extreme outliers. Like fixed acidity, there may be a correlation to wine quality and warrants further investigation.

Citric Acid

The boxplot for citric acid shows a few extreme outliers. Like the other acids, there appears to be a correlation between citric acid and wine quality.

Residual Sugar

The boxplot reveals several extreme outliers. There does not seem to be association between residual sugar and wine quality.

Chlorides

The boxplot reveals several extreme outliers. There does not seem to be a strong association between chlorides and wine quality.

Free Sulfur Dioxide

There are not many outliers after transforming with log function.
There does not seem to be association between free sulfur dioxide and wine quality.

Total Sulfur Dioxide

There are not many outliers after transforming with log function.
There does not seem to be association between free sulfur dioxide and wine quality.

Density

This variable already has a pretty normal distribution. I will not alter the data. The boxplot reveals several extreme outliers.

The median density of good quality wines does seem to be slightly less than that of average or poor quality wines. However, the numbers are all very close and there isn’t a clear association between density and wine quality.

pH

This variable already has a pretty normal distribution. I will not alter the data. The boxplot reveals several extreme outliers. There does seem to be association between pH and wine quality.

Sulphates

There is a very long tail on this histogram.
Removing the outliers may reveal a normal distribution without transforming the numbers. The boxplot reveals several extreme outliers. There does appear to be an association between sulphates and wine quality.

Alcohol

The log or sqrt function does not help to transform this histogram. It has a positively skewed distribution. The boxplot reveals several extreme outliers. There does appear to be an association between alcohol and wine quality. The median values for both poor and average quality wines is lower than the median value for good quality wine.

From the boxplots on the 11 variables, we see some correlations between quality and the following:

fixed acidity, volatile acidity, citric acid, sulphates, alcohol and possibly residual sugar

I will run correlation coefficients to confirm or deny my theories about these variables.

Correlation Coefficient Test:

##        fixed acidity     volatile acidity          citric acid 
##           0.12405165          -0.39055778           0.22637251 
##       residual sugar            chlorides  free sulfur dioxide 
##           0.01373164          -0.12890656          -0.05065606 
## total sulfur dioxide              density                   pH 
##          -0.18510029          -0.17491923          -0.05773139 
##            sulphates              alcohol 
##           0.25139708           0.47616632

According to the correlation test, the variables with the strongest correlations to quality are volatile acidity, citric acid, sulphates and alcohol.

A few other variables have strong, but not as significant correlation to quality. They are fixed acidity, chlorides, total sulfur dioxide and density.

I will look at and plot some of the relationships between these variables and quality.

Multivariate Analysis

In this scatterplot I observed that the good quality wines grouped together in the upper left hand corner. This would suggest that wines with higher citric acid and lower volatile acidity are a better quality.

In this plot, many of the poor quality wines have higher volatile acidity and low sulphates. Although, the good quality wines are show quite a bit of variance, many are plotted witha sulphate level near 0.75 and volatile acidity measuring under 0.5.

Good quality wines appear to have a higher alcohol content and low volatile acidity.

This plot seems to have no association between citric acid, sulphates and wine quality

Most of the good quality wines are grouped together in the upper right corner of this scatterplot. While several of the poor quality are in the lower left corner.

Like with the previous plot, most of the good quality wines are grouped together in the upper right corner of this scatterplot. While several of the poor quality are in the lower left corner.

Final Plots and Summaries

This is the polished histogram for the density variable. The variations in density are very small amounts. When I adjusted the bin size you could more clearly see the normal distribution.

You can see on this scatterplot how higher alcohol content and higher sulphates correlate to the better quality wines.

I thought it was interesting to overlay the box plots with the data point to get a better visualization of where the points lie in relation to the boxes. This plot is the alcohol content plotted by the quality rankings.

Reflections

The wine quality data contains information on 1599 different wines with 11 chemical property variables. This data set was made available via Cortez et al., 2009.

Not having any personal interest in data about the quality of red wine. I didn’t expect to be particularly interested in the results of this data analysis. To my surprise, it was actually quite interesting trying to figure out how to transform and find correlations between the variables. I struggled quite a bit with the scatterplots and it took a lot of thought and experimentation trying to find the relationships between all the variables. I don’t think that the frequency polygons were very useful. I don’t feel that I was able to gleen any additional information from them. The boxplots seemed to be most helpful, for me, in being able to visualize the data. I would love to explore a larger data set with a more equal number of poor, average and good quality wines.