The goal of this project is to conduct exploratory data analysis on a dataset containing wine information and to explore the variables, structure, patterns, oddities, and underlying relationships in the dataset. I will try to answer the following questions:
Which chemical properties are correlated?
Is there a relation between quality and the alcohol level?
Are there any parameters which strongly influence the alcohol level in wine?
The project uses white wines dataset published by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. The dataset is available at Elsevier, in Pre-press (pdf), and bib. Lets load a dataset and look at its structure:
#load libraries
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(gridExtra)
library(RColorBrewer)
#load dataset
white_wine <- read.csv('wineQualityWhites.csv')
#check structure
str(white_wine)
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
This dataset includes 4898 observations of 13 variables. The 0. first variable (X) is just a number of observation. The other variables are as follows:
fixed acidity: a fundamental property of wine, imparting sourness and resistance to microbial infection (tartaric acid - g / dm^3)
volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant and vinegar taste (acetic acid - g / dm^3)
citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines (g / dm^3)
residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet (g / dm^3)
chlorides: the amount of salt in the wine (sodium chloride - g / dm^3)
free sulfur dioxide: the free form of SO2 existing in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion, which prevents microbial growth and wine oxidation (mg / dm^3)
total sulfur dioxide: used as a preservative because of its anti-oxidative and anti-microbial properties (mg / dm^3)
density: wine density (g / cm^3)
pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic)
sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant (potassium sulphate - g / dm3)
alcohol: the percent of alcoholic content (% by volume)
quality (score between 0 and 10)
head(white_wine)
## X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 7.0 0.27 0.36 20.7 0.045
## 2 2 6.3 0.30 0.34 1.6 0.049
## 3 3 8.1 0.28 0.40 6.9 0.050
## 4 4 7.2 0.23 0.32 8.5 0.058
## 5 5 7.2 0.23 0.32 8.5 0.058
## 6 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality
## 1 6
## 2 6
## 3 6
## 4 6
## 5 6
## 6 6
summary(white_wine$quality)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
table(white_wine$quality)
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
ggplot(data=white_wine, aes(x=quality))+
geom_bar(aes(colour="red", fill="red"))+
ggtitle('Distribution of quality')+
xlab("Wine quality")+ylab("Count")
The distribution of quality is normal, with a mean being 5.9 on a scale of 10. The maximum quality in provided dataset is 9.
ggplot(data=white_wine, aes(x=alcohol))+
geom_histogram(color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(0,15,0.5))+
ggtitle('Distribution of alcohol in dataset')
summary(white_wine$alcohol)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
This distribution is bymodal, with a median of 10.4 and mean of 10.5. The most frequent alcohol level is between 9% and 9.5%.
Summary for fixed acidity
summary(white_wine$fixed.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Summary for volatile acidity
summary(white_wine$volatile.acidity)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Summary for citric acid
summary(white_wine$citric.acid)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
h1 <- ggplot(data=white_wine, aes(x=fixed.acidity))+
geom_histogram(binwidth=0.1 ,color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(3,11,0.1))+
xlim(3,11)+
ggtitle('fixed acidity')
h2 <- ggplot(data=white_wine, aes(x=volatile.acidity))+
geom_histogram(binwidth=0.01 ,color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(0,1,0.01))+
xlim(0,0.7)+
ggtitle('volatile acidity')
h3 <- ggplot(data=white_wine, aes(x=citric.acid))+
geom_histogram(binwidth=0.01 ,color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(0,1,0.01))+
xlim(0,0.9)+
ggtitle('citric acid')
#showing a grid of 3 histograms
grid.arrange(h1,h2,h3, ncol=1)
Three of these distributions are normal.
Summary for free sulfur dioxide
summary(white_wine$free.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Summary for total sulfur dioxide
summary(white_wine$total.sulfur.dioxide)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Plotting Histrograms for free sulfur dioxide and total sulfur dioxide
h4 <- ggplot(data=white_wine, aes(x=free.sulfur.dioxide))+
geom_histogram(binwidth=1 ,color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(3,289,10))+
xlim(0,289)+
ggtitle('free sulfur dioxide')
h5 <- ggplot(data=white_wine, aes(x=total.sulfur.dioxide))+
geom_histogram(binwidth=1 ,color='red', fill='red', alpha=0.5)+
scale_x_continuous(breaks = seq(0,440,10))+
xlim(0,440)+
ggtitle('total sulfur dioxide')
#showing a grid of 3 histograms
grid.arrange(h4,h5, ncol=1)
Both distributions are positively skewed
summary(white_wine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
ggplot(data=white_wine, aes(x=density))+
geom_histogram(binwidth=0.0001 ,color='red', fill='red', alpha=0.5)+
ggtitle('Density')
Most of the distribution is normal, with a few outliers.
ggplot(data=white_wine, aes(x=residual.sugar))+
geom_histogram(binwidth=0.1 ,color='red', fill='red', alpha=0.5)+
ggtitle('Residual sugar in wine')
Zoom at the levels from 0 to 20
ggplot(data=white_wine, aes(x=residual.sugar))+
geom_histogram(binwidth=0.1 ,color='red', fill='red', alpha=0.5)+
xlim(0,4)+
ggtitle('Residual sugar in wine, limit up to 4')
The distribution looks normal in this part, but there are some outliers with higher values. Let’s look at other variables:
ggplot(data=white_wine, aes(x=chlorides))+
geom_histogram(binwidth=0.005 ,color='red', fill='red', alpha=0.5)+
ggtitle('Chlorides in wine')
The distribution is mostly normal with a few outliers
ggplot(data=white_wine, aes(x=sulphates))+
geom_histogram(binwidth=0.01 ,color='red', fill='red', alpha=0.5)+
ggtitle('Sulphates in wine')
ggplot(data=white_wine, aes(x=pH))+
geom_histogram(binwidth=0.01 ,color='red', fill='red', alpha=0.5)+
ggtitle('pH factor in wine')
pH is roughly normaly distributed.
Bivariate analysis will start from plotting ggpairs and correlation hot maps.
theme_set(theme_minimal(11))
g <- ggpairs(data = white_wine,
columns=1:12,
lower = list(
continuous = wrap("smooth", alpha = 0.5, color = "red")),
upper = list(continuous = wrap("cor", size = 2))
)
# Applying a theme
g <- g + theme(
axis.text = element_text(size = 5),
axis.title = element_text(size = 5),
legend.background = element_rect(fill = "white"),
panel.grid.major = element_line(colour = NA),
panel.grid.minor = element_blank(),
panel.background = element_rect(fill = "grey95")
)
# Showing the plot
print(g, bottomHeightProportion = 0.5, leftWidthProportion = .5)
ggcorr(white_wine[, 1:12], layout.exp = 1)
alternatively can find out correlation as: cor(white_wine\(residual.sugar, white_wine\)density)
residual.sugar vs density 0.839
free.sulfur.dioxide vs total.sulfur.dioxide 0.616 (will ignore, as one contains the other)
density vs total.sulfur.dioxide 0.53
density vs alcohol -0.78
residual.sugar vs alcohol -0.451
total.sulfur.dioxide vs alcohol -0.449
alcohol 0.436
density -0.307
chrorides -0.21
volatile.acidity -0.195
Overall, correlations of individual variables with quality are not strong (less than 0.5), so I will need to look at combination of variables to see if there is a connection with quality.
I will write a function, so plotting each individual pair would be easier.
#plotting function for bivariable analysis
ploting_point <- function(dataset, x, y, title, xlab, ylab) {
ggplot(dataset, aes_string(x = x, y = y)) +
geom_point(alpha = 0.1, size = 1, color = 'blue') +
geom_smooth(method = 'lm', color = 'green') +
labs(title = title, x = xlab, y = ylab)
}
plot1 <- ploting_point(white_wine, "residual.sugar", "density",
"Correlation between sugar and density",
"Residual sugar", "Density")
plot1
plot2 <- ploting_point(white_wine, "free.sulfur.dioxide", "total.sulfur.dioxide",
"Correlation between free and total sulfur dioxide",
"free sulfur dioxide", "total sulfur dioxide")
plot2
plot3 <- ploting_point(white_wine, "density", "total.sulfur.dioxide",
"Correlation between density and total sulfur dioxide",
"density", "total sulfur dioxide")
plot3
density vs alcohol -0.78
plot4 <- ploting_point(white_wine, "density", "alcohol",
"Correlation between density and alcohol",
"density", "alcohol")
plot4
ggplot(aes(quality, alcohol), data = white_wine) +
geom_point(alpha = 0.1, position = position_jitter(h = 0),
size = 1, color = 'red') +
scale_x_continuous(breaks=seq(1,14,1)) +
labs(title = 'Correlation pair quality and alcohol', y = 'Alcohol', x = 'Quality')
And box plots for the same pair
box_plot <- ggplot(data = white_wine, aes(y = alcohol, x = quality, group = quality)) +
geom_boxplot(fill = "red", colour = "blue") +
scale_x_continuous(breaks=seq(1,14,1)) +
labs(title = 'Correlation between quality and alcohol',
y = 'Alcohol (%)', x = 'Quality')
box_plot
The median of alcohol contents moves up slightly with higher quality, but this is a weak relationship (just look at the outliers for the average quality of 5).
From the ggpairs plot, quality is not strongly correlated with any of the variables. The strongest correlation is with alcohol(0.436), density(-0.307), chlorides(-0.21) and volatile.acidity(-0.195)
summary(white_wine$density)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
white_wine$quality.cut <- cut(white_wine$quality, breaks=c(3,5,6,9))
a <- ggplot(white_wine, aes(x = density, colour=quality.cut))
a + geom_density(aes(y = ..count..), fill="lightgrey") +
ggtitle('Density by quality')+
xlim(0.987,1.005)
For better visualization I had to group quality in 3 categories: [3-5], [6], [7-9] The outcome is the following: the better wine tend to have lower density.
The above linear graphs outcome could be confirmed by violin plot.
ggplot(white_wine, aes(factor(quality), density)) +
geom_violin()
The above violin plot shows that in general, better wines have lower densities, with a few outliers that have a quality of 6. This relationship is strongest for wines rated to have the highest quality in the dataset (9), where nearly all of the wines with a rating of 9 have a density of about 0.99. The range in density for the other quality ratings has a larger spread.
ggplot(white_wine, aes(factor(quality), alcohol)) +
geom_violin()
Outcome 2: The values of alcohol tend to increase as the quality of wine get higher.
Adding one more layer (quality) to the strongest correlation pairs
#plotting residual sugar density vs quality
ggplot(aes(x = residual.sugar, y = density, color = factor(quality)), data = white_wine) +
geom_point(alpha = 0.8, size = 1, position = position_jitter(h=0),size=2) +
scale_color_brewer(type = 'seq', palette = 'OrRd',
guide = guide_legend(title = 'quality', reverse = TRUE, #to reverse the order of colors
override.aes = list(alpha = 0.8, size = 5))) +
ggtitle('resid.sugar , density and quality')+
ylim(0.99,1.005)+
xlim(0,25)
White wines of higher quality are slightly less dense for the same sugar level. I will split the plots by quality and add a linear model to each plot:
ggplot(data = white_wine,
aes(x = residual.sugar, y = density)) +
geom_point(alpha = 1/3, position = position_jitter(h = 0),
size = 2, color = 'red') +
facet_wrap(~quality) +
geom_smooth(method = 'lm', color = 'green') +
labs(title = 'Sugar vs density correlation by quality',
x = 'Sugar', y = 'Density')
#plotting alcohol density vs quality
ggplot(aes(x = alcohol, y = density, color = factor(quality)), data = white_wine) +
geom_point(alpha = 0.8, size = 1, position = position_jitter(h=0),size=2) +
scale_color_brewer(type = 'seq', palette = 'OrRd',
guide = guide_legend(title = 'quality', reverse = TRUE, #to reverse the order of colors
override.aes = list(alpha = 0.8, size = 5))) +
ggtitle('alcohol , density and quality')+
ylim(0.99,1.005)+
xlim(8,14)
Overall, it looks like wine has better quality if alcohol content is higher, with lower density.
ggplot(aes(x = alcohol, y = citric.acid, color = factor(quality)), data = white_wine) +
geom_point(alpha = 0.8, size = 1, position = position_jitter(h=0),size=2) +
scale_color_brewer(type = 'seq', palette = 'OrRd',
guide = guide_legend(title = 'quality', reverse = TRUE, #to reverse the order of colors
override.aes = list(alpha = 0.8, size = 5))) +
ggtitle('alcohol , citric acid and quality')+
ylim(0.01,0.8)+
xlim(9,14)
Using the same quality groups as above: [3-5], [6], [7-9]
# Cut in intervals by quality
white_wine$quality.cut <- cut(white_wine$quality, breaks=c(3,5,6,9))
ggplot(aes(x = alcohol, y = citric.acid, color = quality.cut), data = white_wine) +
geom_point(alpha = 0.5, size = 1.5, position = position_jitter(h=0),size=2) +
scale_color_brewer(type = 'seq', palette = 'OrRd',
guide = guide_legend(title = 'Quality range', reverse = TRUE, #to reverse the order of colors
override.aes = list(alpha = 0.8, size = 5))) +
ggtitle('Alcohol , citric.acid and quality')+
ylim(0.1,0.6)+
xlim(9,14)
For a good quality white wine citric acid level should be between 0.2 and 0.5 and alcohol % between 11 and 13.
Will have a closer look at the above plots, adding extra variables and limiting axis.
ggplot(data = white_wine,
aes(y = chlorides, x = alcohol, color = factor(quality.cut))) +
coord_cartesian(
ylim=c(0.012,0.075),
#excluding outliers on edges
xlim=c(quantile(white_wine$alcohol,.01),quantile(white_wine$alcohol,.99))
) +
geom_point(alpha = 0.1, position = position_jitter(h = 0), size = 2) +
#smoothed linear regression
geom_smooth(method = loess, se=FALSE) +
scale_color_brewer(palette = "Set1") +
labs(title = 'chlorides and alcohol by quality',
y = 'Chlorides', x = 'Alcohol') +
guides(color=guide_legend(title="Quality range", reverse = T), size=FALSE)
ggplot(data = white_wine,
aes(y = residual.sugar, x = alcohol, color = factor(quality.cut))) +
coord_cartesian(
ylim=c(2,20),
xlim=c(quantile(white_wine$alcohol,.01),quantile(white_wine$alcohol,.99))
) +
geom_point(alpha = 0.1, position = position_jitter(h = 0), size = 2) +
geom_smooth(method = loess, se=FALSE) +
scale_color_brewer(palette = "Set1") +
labs(title = 'resid sugar and alcohol by quality',
y = 'resid. sugar', x = 'Alcohol') +
guides(color=guide_legend(title="Quality range", reverse = T), size=FALSE)
Level of chlorides in relation to alcohol concentration is less volatile for high-quality wines. Also, the level of residual sugar in relation to alcohol is a little more for high-quality and medium-quality wines compared to low quality wines.
The purpose of this data exploration is to identify the variables to be used to build model to predict wine quality. We find that no single variable can be used to indicate the wine quality well enough. Using combinations of variables we can get better ideal of the wine quality.
From this exploratory analysis, we see that good wine tends to have more alcohol and lower density.