All the packages used in the analysis are listed below:
library(ggplot2)
library(GGally) # for ggpairs
library(memisc)
library(gridExtra)
The used dataset consists of 13 variable, eleven of which are the characteristics of red wine that potentially influence its quality. The first variable X represents the user ID, and the last one quality - the evaluation of perceived wine quality.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
The variable quality is of numeric type int, which is not convenient for the analysis. So first of all, I change the type of the quality variable to factor and add it to the dataframe as a new variable quality.factor. In addition, I create three categories of quality - good (>= 7), bad (<=4), and medium (5 and 6).
df$quality.factor <- factor(df$quality)
df$quality.cat <- NA
df$quality.cat <- ifelse(df$quality>=7, 'good', NA)
df$quality.cat <- ifelse(df$quality<=4, 'bad', df$quality.cat)
df$quality.cat <- ifelse(df$quality==5, 'medium', df$quality.cat)
df$quality.cat <- ifelse(df$quality==6, 'medium', df$quality.cat)
df$quality.cat <- factor(df$quality.cat, levels = c("bad", "medium", "good"))
For each variable in the dataset, I plot its frequency histogram and boxplot, showing the change of perceived quality influenced by each characteristic.
The following variables have a normal or close-to-normal distribution: fixed.acidity, volatile.acidity, density, pH and alcohol. Distribution of the variable citric.acid frequency is not normal but I will not transform this data for the purpose of the analysis.
n1 <- qplot(x = fixed.acidity, data = df,
binwidth = 0.1) +
scale_x_continuous(breaks = seq(4, 16, 1))
n2 <- qplot(x = volatile.acidity, data = df,
binwidth = 0.01) +
scale_x_continuous(breaks = seq(0.12, 1.58, 0.1))
n3 <- qplot(x = citric.acid, data = df,
binwidth = 0.01) +
scale_x_continuous(breaks = seq(0, 1, 0.1))
n4 <- qplot(x = density, data = df)
n5 <- qplot(x = pH, data = df)
n6 <- qplot(x = alcohol, data = df)
grid.arrange(n1, n2, n3, n4, n5, n6, ncol = 2)
The following list of variables is not a normal or close-to-normal distribution: residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide and sulphates. The hitograms of all these variables are right-skewed a lot and need some transformation.
l1 <- qplot(x = residual.sugar, data = df) +
scale_x_continuous(breaks = seq(0, 16, 0.5))
l2 <- qplot(x = chlorides, data = df,
binwidth = 0.01) +
scale_x_continuous(breaks = seq(0, 1, 0.1))
l3 <- qplot(x = free.sulfur.dioxide, data = df,
binwidth = 0.5)
l4 <- qplot(x = total.sulfur.dioxide, data = df,
binwidth = 0.5)
l5 <- qplot(x = sulphates, data = df)
grid.arrange(l1, l2, l3, l4, l5, ncol = 2)
I transform all this data using the log10-transformation to make the data look more like normal distribution.
l1a <- qplot(x = log10(residual.sugar), data = df)
l2a <- qplot(x = log10(chlorides), data = df)
l3a <- qplot(x = log10(free.sulfur.dioxide), data = df)
l4a <- qplot(x = log10(total.sulfur.dioxide), data = df)
l5a <- qplot(x = log10(sulphates), data = df)
grid.arrange(l1a, l2a, l3a, l4a, l5a, ncol = 2)
In this section, I analyse relationships between wine characteristics and its perceived quality, as well as possible correlations between different characteristics.
In order to compare statistical data of each variable visually, I use boxplots.
The following set of boxplots show all the cases when peceived wine quality increases together with increasing values of a characteristic’s variable.
p1up = qplot(x = quality.cat, y = alcohol,
data = df,
geom = "boxplot")
p2up = qplot(x = quality.cat, y = sulphates,
data = df,
geom = "boxplot")
p3up = qplot(x = quality.cat, y = citric.acid,
data = df,
geom = "boxplot")
p4up = qplot(x = quality.cat, y = fixed.acidity,
data = df,
geom = "boxplot")
grid.arrange(p1up, p2up, p3up, p4up, ncol = 2)
The next set of boxplots, on the contrary, show all the cases when peceived wine quality decreases while the values of variables increase.
p1d = qplot(x = quality.cat, y = volatile.acidity,
data = df,
geom = "boxplot")
p2d = qplot(x = quality.cat, y = pH,
data = df,
geom = "boxplot")
p3d = qplot(x = quality.cat, y = density,
data = df,
geom = "boxplot")
p4d = qplot(x = quality.factor, y = density,
data = df,
geom = "boxplot")
grid.arrange(p1d, p2d, p3d, ncol = 2)
The ggpairs output uses groups histograms for qualitative/qualitative variables and scatterplots for quantitative/quantitative variables in the lower triangle of the plot. In the upper triangle, it provides boxplots for the qualitative/quantitative pairs of variables, and correlation coefficients for quantitative/quantitative pairs.
df.subset <- df[,2:13]
ggpairs(df.subset, params = c(shape = I('.'), outlier.shape = I('.')))
By focusing on the pH column, I see that there could be a relationship between density and pH, as well as between pH and citric.acid. There is also a relationship between pH and fixed.acidity. The correlation between pH and these three variables is similar and always negative.
dens <- ggplot(aes(x = pH, y = density), data = df) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
coord_trans(x = "log10") +
geom_smooth(method = "lm", color = "red")
citr.ac <- ggplot(aes(x = pH, y = citric.acid), data = df) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
coord_trans(x = "log10") +
geom_smooth(method = "lm", color = "red")
fix.ac <- ggplot(aes(x = pH, y = fixed.acidity), data = df) +
geom_point(alpha = 1/5, position = position_jitter(h = 0)) +
coord_trans(x = "log10") +
geom_smooth(method = "lm", color = "red")
grid.arrange(dens, citr.ac, fix.ac, ncol = 2)
The highest positive correlation, according to a ggpairs matrix, is between density and fixed.acidity, as well as between fixed.acidity and citric.acid.
pos1 <- ggplot(aes(x = fixed.acidity, y = density), data = df) +
geom_jitter(alpha = 1/5) +
geom_smooth(method = "lm", color = "red")
pos2 <- ggplot(aes(x = fixed.acidity, y = citric.acid), data = df) +
geom_jitter(alpha = 1/5) +
geom_smooth(method = "lm", color = "red")
grid.arrange(pos1, pos2, ncol = 2)
In the previous boxplots we have seen that increasing levels of both sulphates and alcohol increase a perceived quality of red wine. Now, I create a scatterplot to see wheather a combination of these two may help distinguish between different quality levels.
ggplot(aes(x = log10(sulphates), y = alcohol, colour = quality.factor),
data = df) +
geom_point(aes(size = quality.factor)) +
scale_color_brewer(type = 'div', palette="Set1") +
scale_x_continuous(lim=c(quantile(log10(df$sulphates), 0.01),
quantile(log10(df$sulphates), 0.99)))+
scale_y_continuous(lim=c(quantile(df$alcohol, 0.01),
quantile(df$alcohol, 0.99)))
The plot reveals a clear pattern, showing most of orange and yellow dots (high-quality wine) in the place where both alcohol and sulphates level are high. There is also a visible range of violet dots in the middle of the plot, and the zone of mostly green dots in the bottom-left corner. This implies that such a combination of variables lets distinguish between different levels of medium-quality wines (5 and 6).
The previous plots show there is a positive corelation between the variables of density and fixed.acidity, so I create a dcatterplot to see wheather these two variables explain the quality changes well.
ggplot(aes(x = fixed.acidity, y = density, colour = quality.factor),
data = df) +
geom_point(size = 4) +
#geom_point() +
scale_color_brewer(type = 'div', palette="Set1") +
scale_x_continuous(lim=c(quantile(df$fixed.acidity, 0.01),
quantile(df$fixed.acidity, 0.99))) +
scale_y_continuous(lim=c(quantile(df$density, 0.01),
quantile(df$density, 0.99)))
Although the plot is not very clear, it reveals some patterns in presented data. It is visible here that the majority of green and violet dots are concentrated in the upper part, while the majority of orange dots are concentrated in the bottom part of the plot. Thus, this combination of variables may be useful to distinguish medium quality wine from the high quality.
Finally, I will analyze the influence of pH and sulfur.dioxide on a quality of red wine.
The left plot below shows the impact of the pH-total.sulfur.dioxide combination on quality. The zone of green dots (medium-quality wine) is immediately visible here.
The variable total.sulfur.dioxide is highly correlated with free.sulfur.dioxide, so I create the right plot to see wheather adding another variable adds any value. It seems that the area of orange dots (high-quality wine) is more visible on the right plot (left-bottom corner), while the green dots area is still clearly distinguished.
p1 <- ggplot(aes(x = pH, y = total.sulfur.dioxide, colour = quality.factor),
data = df) +
geom_point(aes(size = quality.factor)) +
scale_color_brewer(type = 'div', palette="Set1") +
scale_x_continuous(lim=c(quantile(df$pH, 0.01),
quantile(df$pH, 0.99))) +
scale_y_continuous(lim=c(quantile(df$total.sulfur.dioxide, 0.01),
quantile(df$total.sulfur.dioxide, 0.99)))
p2 <- ggplot(aes(x = log10(total.sulfur.dioxide),
y = log10(free.sulfur.dioxide), colour = quality.factor),
data = df) +
geom_point(aes(size = quality.factor)) +
#geom_point(aes(size = 12)) +
scale_color_brewer(type = 'div', palette="Set1") +
scale_x_continuous(lim=c(quantile(log10(df$total.sulfur.dioxide),
0.01),
quantile(log10(df$total.sulfur.dioxide),
0.99))) +
scale_y_continuous(lim=c(quantile(log10(df$free.sulfur.dioxide),
0.01),
quantile(log10(df$free.sulfur.dioxide),
0.99)))
grid.arrange(p1, p2, ncol = 2)
I mostly use combinations of two and more variables for the multiple regression model predicting the quality of red wine. First combination consists of all the variables that increase the quality with their increasing levels. Next combination is density and fixed.acidity as its visual representation implied its value for predicting the quality. Next goes volatile.acidity, as this variable has the highest negative correlation coefficient with the quality variable. And the last combination consists of pH, total.sulfur.dioxide and free.sulfur.dioxide, based on the last step of the previous EDA.
m1 <- lm(quality ~ alcohol*sulphates*citric.acid*fixed.acidity, data = df)
m2 <- update(m1, ~ . + density*fixed.acidity)
m3 <- update(m2, ~ . + volatile.acidity)
m4 <- update(m3, ~ . + pH*total.sulfur.dioxide*free.sulfur.dioxide)
mtable(m1, m2, m3, m4)
##
## Calls:
## m1: lm(formula = quality ~ alcohol * sulphates * citric.acid * fixed.acidity,
## data = df)
## m2: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## density + alcohol:sulphates + alcohol:citric.acid + sulphates:citric.acid +
## alcohol:fixed.acidity + sulphates:fixed.acidity + citric.acid:fixed.acidity +
## fixed.acidity:density + alcohol:sulphates:citric.acid + alcohol:sulphates:fixed.acidity +
## alcohol:citric.acid:fixed.acidity + sulphates:citric.acid:fixed.acidity +
## alcohol:sulphates:citric.acid:fixed.acidity, data = df)
## m3: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## density + volatile.acidity + alcohol:sulphates + alcohol:citric.acid +
## sulphates:citric.acid + alcohol:fixed.acidity + sulphates:fixed.acidity +
## citric.acid:fixed.acidity + fixed.acidity:density + alcohol:sulphates:citric.acid +
## alcohol:sulphates:fixed.acidity + alcohol:citric.acid:fixed.acidity +
## sulphates:citric.acid:fixed.acidity + alcohol:sulphates:citric.acid:fixed.acidity,
## data = df)
## m4: lm(formula = quality ~ alcohol + sulphates + citric.acid + fixed.acidity +
## density + volatile.acidity + pH + total.sulfur.dioxide +
## free.sulfur.dioxide + alcohol:sulphates + alcohol:citric.acid +
## sulphates:citric.acid + alcohol:fixed.acidity + sulphates:fixed.acidity +
## citric.acid:fixed.acidity + fixed.acidity:density + pH:total.sulfur.dioxide +
## pH:free.sulfur.dioxide + total.sulfur.dioxide:free.sulfur.dioxide +
## alcohol:sulphates:citric.acid + alcohol:sulphates:fixed.acidity +
## alcohol:citric.acid:fixed.acidity + sulphates:citric.acid:fixed.acidity +
## pH:total.sulfur.dioxide:free.sulfur.dioxide + alcohol:sulphates:citric.acid:fixed.acidity,
## data = df)
##
## ==============================================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------------------------
## (Intercept) 6.004 -16.407 -28.327 -47.730
## (8.287) (52.025) (50.640) (52.164)
## alcohol -0.248 -0.031 -0.006 0.166
## (0.791) (0.788) (0.767) (0.772)
## sulphates -1.259 1.733 1.544 1.326
## (11.824) (11.869) (11.550) (11.605)
## citric.acid 24.816 31.954 33.083 36.242*
## (18.651) (18.861) (18.354) (18.386)
## fixed.acidity 0.078 10.120 8.308 7.842
## (1.148) (5.646) (5.497) (5.526)
## alcohol x sulphates 0.400 0.151 0.199 0.126
## (1.120) (1.121) (1.091) (1.098)
## alcohol x citric.acid -1.790 -2.483 -2.711 -3.038
## (1.769) (1.784) (1.736) (1.742)
## sulphates x citric.acid -55.541* -64.136* -62.580* -61.781*
## (25.491) (25.795) (25.101) (25.161)
## alcohol x fixed.acidity 0.004 -0.029 -0.014 -0.039
## (0.111) (0.111) (0.108) (0.108)
## sulphates x fixed.acidity -0.679 -1.025 -0.781 -0.833
## (1.639) (1.640) (1.596) (1.602)
## citric.acid x fixed.acidity -3.359 -4.130 -4.076 -4.447*
## (2.308) (2.325) (2.262) (2.263)
## alcohol x sulphates x citric.acid 4.484 5.231* 5.177* 5.190*
## (2.400) (2.427) (2.362) (2.369)
## alcohol x sulphates x fixed.acidity 0.052 0.082 0.048 0.067
## (0.158) (0.158) (0.154) (0.155)
## alcohol x citric.acid x fixed.acidity 0.273 0.349 0.342 0.385
## (0.221) (0.222) (0.216) (0.216)
## sulphates x citric.acid x fixed.acidity 6.979* 7.874* 7.330* 7.300*
## (3.168) (3.191) (3.105) (3.109)
## alcohol x sulphates x citric.acid x fixed.acidity -0.597* -0.675* -0.621* -0.634*
## (0.302) (0.304) (0.295) (0.296)
## density 19.745 31.965 55.838
## (52.053) (50.668) (52.270)
## fixed.acidity x density -9.685 -7.953 -7.355
## (5.584) (5.436) (5.465)
## volatile.acidity -1.106*** -0.985***
## (0.117) (0.119)
## pH -1.566***
## (0.375)
## total.sulfur.dioxide -0.072**
## (0.024)
## free.sulfur.dioxide -0.207**
## (0.066)
## pH x total.sulfur.dioxide 0.021**
## (0.007)
## pH x free.sulfur.dioxide 0.065**
## (0.020)
## total.sulfur.dioxide x free.sulfur.dioxide 0.003***
## (0.001)
## pH x total.sulfur.dioxide x free.sulfur.dioxide -0.001***
## (0.000)
## ----------------------------------------------------------------------------------------------
## R-squared 0.333 0.342 0.377 0.391
## adj. R-squared 0.327 0.335 0.370 0.381
## sigma 0.663 0.659 0.641 0.635
## F 52.712 48.362 53.221 40.373
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1602.739 -1591.863 -1547.716 -1530.317
## Deviance 695.015 685.624 648.792 634.825
## AIC 3239.478 3221.725 3135.432 3114.633
## BIC 3330.889 3323.891 3242.975 3259.816
## N 1599 1599 1599 1599
## ==============================================================================================
The given model explains 39% of cases in the given dataset. The highest R-squared = 0.333 is provided by the first combination of parameters (alcohol, sulphates, citric.acid, fixed.acidity). Next three sets of features add 0.01-0.03 to the previous R-squared value.
This model has limitations. It is based on the limited data that does not provide very high (more than 8) and very low (less than 3) quality scores. Collecting the data with more cases of extreme scores, as well as additional data with existing low-quality scores (3 and 4), could significantly improve the model’s predictive power.
Alcohol and citric acid are two characteristics that increase a perceived quality of wine the most. pH and volatile acidity, on the contrary, reduce a perceived quality the most.
Alcohol and sulphates, together with other quality increasing characteristics, are doing the hardest job in predicting red wine quality.
Multiple regression model is able to explain up to 39% of existing cases in the dataset. Additional dataset with more data of extreme quality cases (both high and low-quality) should help improve the results of this model. Moreover, more sophisticated prediction models should be able to provide more accurate predictions for the quality of wine based on its chemical characteristics.