library(ggplot2)
library(gridExtra)
library(readr)
library(dplyr)
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Looking at the data set and table structure
## spec_tbl_df[,13] [1,599 x 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ x : num [1:1599] 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num [1:1599] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num [1:1599] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num [1:1599] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num [1:1599] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num [1:1599] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num [1:1599] 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num [1:1599] 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num [1:1599] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1599] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num [1:1599] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num [1:1599] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : num [1:1599] 5 5 5 6 5 5 5 7 7 5 ...
## - attr(*, "spec")=
## .. cols(
## .. x = col_double(),
## .. fixed.acidity = col_double(),
## .. volatile.acidity = col_double(),
## .. citric.acid = col_double(),
## .. residual.sugar = col_double(),
## .. chlorides = col_double(),
## .. free.sulfur.dioxide = col_double(),
## .. total.sulfur.dioxide = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
## x fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median : 2.200 Median :0.07900 Median :14.00 Median : 38.00
## Mean : 2.539 Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :15.500 Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## [1] "x" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
The “wineQualityReds” data set contains 11 variables (associated with the chemical properties of wine) and 1599 observations.
I will be using the quality levels of the wine to compare the other variables. I would like to find some correlations between these variables and good quality red wines. I will build a factored variable to accomplish this.
## [1] "poor" "average" "good"
## poor average good
## 63 1319 217
Of the 1599 observations, 63 are rated as poor quality (<5), 1319 are rated as average quality (>5, <7) and 217 are rated as good quality (>7)
Several of the histograms show long tails and outliers. I will take a closer look at these and attempt to transform these to a more normal distribution and reduce the outliers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6628 0.8513 0.8976 0.9112 0.9638 1.2014
No negative numbers or infinity. Log function gives the histogram a more normal distribution. Modified the x-axis to only show the range between 0.7 and 1.05.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.9208 -0.4089 -0.2840 -0.3034 -0.1938 0.1987
Summary shows negative numbers, will add 1 to x-axis. The log function does give the histogram a more normal distribution. Modified the x-axis to only show the range between 0.05 and 0.3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
The log function does not help to transform this histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.04576 0.27875 0.34242 0.36925 0.41497 1.19033
Summary shows negative numbers, will add 1 to x-axis. The log function does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
There is a very long tail on this histogram.
Removing the ouliers does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.8451 1.1461 1.1058 1.3222 1.8573
Summary shows negative numbers, will add 1 to x-axis. The log function gives the histogram a more normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7782 1.3424 1.5798 1.5638 1.7924 2.4609
No negative numbers or infinity. The log function gives the histogram a more normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0037
This variable already has a pretty normal distribution. I will not transform the data. Removing the ouliers gives the histogram a slightly more normal distribution. Modified the x-axis to reduce the tails on both sides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
This variable already has a pretty normal distribution. I will not alter the data. Removing the ouliers gives the histogram a slightly more normal distribution. Modified the x-axis to reduce the tails on both sides.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
There is a very long tail on this histogram.
Removing the ouliers gives the histogram a more normal distribution. Modified the x-axis to reduce the long tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The log or sqrt function does not help to transform this histogram. It has a positively skewed distribution. Removing the ouliers does give the histogram a more normal distribution. Modified the x-axis to reduce the long tail.
The boxplot for fixed acidity shows some extreme outliers. Modified the x-axis to only show the range between 0.7 and 1.05. A closer look at the boxplot reveals that the median of fixed acidity in good quality wines are higher than the average and poor quality wines. There may be a correlation to this variable and the quality of wine. Further investigation may be needed.
The boxplot for volatile acidity shows a few extreme outliers. Like fixed acidity, there may be a correlation to wine quality and warrants further investigation.
The boxplot for citric acid shows a few extreme outliers. Like the other acids, there appears to be a correlation between citric acid and wine quality.
The boxplot reveals several extreme outliers. There does not seem to be association between residual sugar and wine quality.
The boxplot reveals several extreme outliers. There does not seem to be a strong association between chlorides and wine quality.
There are not many outliers after transforming with log function.
There does not seem to be association between free sulfur dioxide and wine quality.
There are not many outliers after transforming with log function.
There does not seem to be association between free sulfur dioxide and wine quality.
This variable already has a pretty normal distribution. I will not alter the data. The boxplot reveals several extreme outliers.
The median density of good quality wines does seem to be slightly less than that of average or poor quality wines. However, the numbers are all very close and there isn’t a clear association between density and wine quality.
This variable already has a pretty normal distribution. I will not alter the data. The boxplot reveals several extreme outliers. There does seem to be association between pH and wine quality.
There is a very long tail on this histogram.
Removing the outliers may reveal a normal distribution without transforming the numbers. The boxplot reveals several extreme outliers. There does appear to be an association between sulphates and wine quality.
The log or sqrt function does not help to transform this histogram. It has a positively skewed distribution. The boxplot reveals several extreme outliers. There does appear to be an association between alcohol and wine quality. The median values for both poor and average quality wines is lower than the median value for good quality wine.
From the boxplots on the 11 variables, we see some correlations between quality and the following:
fixed acidity, volatile acidity, citric acid, sulphates, alcohol and possibly residual sugar
I will run correlation coefficients to confirm or deny my theories about these variables.
## fixed acidity volatile acidity citric acid
## 0.12405165 -0.39055778 0.22637251
## residual sugar chlorides free sulfur dioxide
## 0.01373164 -0.12890656 -0.05065606
## total sulfur dioxide density pH
## -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol
## 0.25139708 0.47616632
According to the correlation test, the variables with the strongest correlations to quality are volatile acidity, citric acid, sulphates and alcohol.
A few other variables have strong, but not as significant correlation to quality. They are fixed acidity, chlorides, total sulfur dioxide and density.
I will look at and plot some of the relationships between these variables and quality.
In this scatterplot I observed that the good quality wines grouped together in the upper left hand corner. This would suggest that wines with higher citric acid and lower volatile acidity are a better quality.
In this plot, many of the poor quality wines have higher volatile acidity and low sulphates. Although, the good quality wines are show quite a bit of variance, many are plotted witha sulphate level near 0.75 and volatile acidity measuring under 0.5.
Good quality wines appear to have a higher alcohol content and low volatile acidity.
This plot seems to have no association between citric acid, sulphates and wine quality
Most of the good quality wines are grouped together in the upper right corner of this scatterplot. While several of the poor quality are in the lower left corner.
Like with the previous plot, most of the good quality wines are grouped together in the upper right corner of this scatterplot. While several of the poor quality are in the lower left corner.
ggplot(aes(density), data = wineInfo)+
geom_histogram(color="white",
fill="Orange",
binwidth= 0.00025) +
ggtitle("Density Histogram") +
coord_cartesian(xlim=c(0.9925, 1.001))
This is the polished histogram for the density variable. The variations in density are very small amounts. When I adjusted the bin size you could more clearly see the normal distribution.
You can see on this scatterplot how higher alcohol content and higher sulphates correlate to the better quality wines.
I thought it was interesting to overlay the box plots with the data point to get a better visualization of where the points lie in relation to the boxes. This plot is the alcohol content plotted by the quality rankings.
The wine quality data contains information on 1599 different wines with 11 chemical property variables. This data set was made available via Cortez et al., 2009.
Not having any personal interest in data about the quality of red wine. I didn’t expect to be particularly interested in the results of this data analysis. To my surprise, it was actually quite interesting trying to figure out how to transform and find correlations between the variables. I struggled quite a bit with the scatterplots and it took a lot of thought and experimentation trying to find the relationships between all the variables. I don’t think that the frequency polygons were very useful. I don’t feel that I was able to gleen any additional information from them. The boxplots seemed to be most helpful, for me, in being able to visualize the data. I would love to explore a larger data set with a more equal number of poor, average and good quality wines.