This document investigates the relationship between the ranking of red wine quality by human experts and 11 variables resulting from physicochemical tests carried out against the wines.
The data is collected from the University of California, Irvine - Machine Learning Repository
Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
There are 11 input (explanatory) variables that were collected as measurements through a variety of tests on the wines:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol
All input variables are continuous variables.
The response variable is the quality ranking. This is an integer in the range 0 (very bad) to 10 (very excellent). For the purpose of this investigation we will treat the quality response variable as a continuous variable to allow us to use simple linear modelling techniques with the proposal that a rounding function could be utilised to produce an appropriate ranking from predictions from a linear model. However it is recognised that the quality response variable is actually an ordinal variable (a form of categorical variable). It is expected that this will probably result in a less than optimal linear model due to the possible loss of fine-grained relationship information between the inputs and the response due to the clustering of continuous response values into the 11 discrete integer rankings. It is also expected that some form of generalised linear model using logistic regression may provide better relationship information (I just dont know how to do that yet).
There are 1599 samples with no missing data.
The null hypotheis (H0) is that none of the variance in the quality ranking is explained by the input variables. The alternate hypothesis (H1) is that the input variables contribution to the variance in the quality ranking is significantly different from 0.
#load in the red wine data from the UCI ML website
redwq<-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')
# and have a quick look at the data
head(redwq)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
summary(redwq)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Summary indicates that the distribution of quality rankings for the red wine is quite small. There are no extreme rankings e.g. Min =3 and Max = 8 and it is centered around the middle rankings Q1=5 and Q3=6.
# see what a couple of graphs show about the shape of the quality ranking distribution
par(mfrow=c(2,1))
# normal Q-Q plot against standardised quality rankings
qqnorm(scale(redwq$quality))
qqline(scale(redwq$quality))
# and a histogram
hist(redwq$quality)
The histogram looks sort of normal with very thin tails - far more rankings in the “ordinary” range than in the “very good” or “very bad” range.
The normal Q-Q plots look a bit strange due to the clustering around the ordinal integer rankings. For the purpose of this investigation we will assume “fit” due to the line passing through roughly the centre of each cluster except at the ends where the thin tails depart from the line.
But we really need to work out some way of treating this variable like an ordinal
As mentioned we will use a linear model. As all the input variables are continuous we will use Linear Regression in preference to ANOVA.
All available input variables used.
attach(redwq)
redmdl<-lm(formula=quality~alcohol+chlorides+citric.acid+density+fixed.acidity+free.sulfur.dioxide+pH+residual.sugar+sulphates+total.sulfur.dioxide+volatile.acidity)
# have a look at the results of the regression
redmdl
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density +
## fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity)
##
## Coefficients:
## (Intercept) alcohol chlorides
## 21.965208 0.276198 -1.874225
## citric.acid density fixed.acidity
## -0.182564 -17.881164 0.024991
## free.sulfur.dioxide pH residual.sugar
## 0.004361 -0.413653 0.016331
## sulphates total.sulfur.dioxide volatile.acidity
## 0.916334 -0.003265 -1.083590
#summarise the model
summary(redmdl)
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density +
## fixed.acidity + free.sulfur.dioxide + pH + residual.sugar +
## sulphates + total.sulfur.dioxide + volatile.acidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68911 -0.36652 -0.04699 0.45202 2.02498
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.197e+01 2.119e+01 1.036 0.3002
## alcohol 2.762e-01 2.648e-02 10.429 < 2e-16 ***
## chlorides -1.874e+00 4.193e-01 -4.470 8.37e-06 ***
## citric.acid -1.826e-01 1.472e-01 -1.240 0.2150
## density -1.788e+01 2.163e+01 -0.827 0.4086
## fixed.acidity 2.499e-02 2.595e-02 0.963 0.3357
## free.sulfur.dioxide 4.361e-03 2.171e-03 2.009 0.0447 *
## pH -4.137e-01 1.916e-01 -2.159 0.0310 *
## residual.sugar 1.633e-02 1.500e-02 1.089 0.2765
## sulphates 9.163e-01 1.143e-01 8.014 2.13e-15 ***
## total.sulfur.dioxide -3.265e-03 7.287e-04 -4.480 8.00e-06 ***
## volatile.acidity -1.084e+00 1.211e-01 -8.948 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared: 0.3606, Adjusted R-squared: 0.3561
## F-statistic: 81.35 on 11 and 1587 DF, p-value: < 2.2e-16
#some graphs to check residuals
par(mfrow=c(2,2))
plot(redmdl)
# histogram of the residuals
par(mfrow=c(1,1))
hist(redmdl$residuals)
Linear model result from the linear regression says that the response variable quality can be explained as
21.965208 + 0.276198(alcohol) -1.874225(chlorides) -0.182564(citric.acid) -17.881164(density) + 0.024991(fixed.acidity) + 0.004361(free.sulfur.dioxide) -0.413653(pH) + 0.016331(residual.sugar) + 0.916334(sulphates) - 0.003265(total.sulfur.dioxide) -1.08359(volatile.acidity)
Summary indicates p-values for citric.acid, density, fixed.actidity, and residual.sugar are greater than 0.05. As such we cannot reject the null hypothesis that these input variable do not make a significantly greater than 0 contribution to the variance of the quality ranking with 95% confidence. We will remove these from the regression
Adjusted R2 is not high at 0.3561 but the p-value of R2 is <0.05 so we are 95% confident that a relationship does exist between at least some of the input variables and the quality ranking.
Residuals vs Fitted indicates Constant Variance (although clustered into integer rankings)
Normal QQ indicates Normal distributed residuals. Histogram of residuals also looks like a normal distribution and centered around 0. Assumptions of regression: OK
Input variables citric.acid, density, fixed.actidity, and residual.sugar removed from the regression.
redmdl2<-lm(formula=quality~alcohol+chlorides+free.sulfur.dioxide+pH+sulphates+total.sulfur.dioxide+volatile.acidity)
# have a look at the results of the regression
redmdl2
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + free.sulfur.dioxide +
## pH + sulphates + total.sulfur.dioxide + volatile.acidity)
##
## Coefficients:
## (Intercept) alcohol chlorides
## 4.430099 0.289303 -2.017814
## free.sulfur.dioxide pH sulphates
## 0.005077 -0.482661 0.882665
## total.sulfur.dioxide volatile.acidity
## -0.003482 -1.012753
#summarise the model
summary(redmdl2)
##
## Call:
## lm(formula = quality ~ alcohol + chlorides + free.sulfur.dioxide +
## pH + sulphates + total.sulfur.dioxide + volatile.acidity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.68918 -0.36757 -0.04653 0.46081 2.02954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.4300987 0.4029168 10.995 < 2e-16 ***
## alcohol 0.2893028 0.0167958 17.225 < 2e-16 ***
## chlorides -2.0178138 0.3975417 -5.076 4.31e-07 ***
## free.sulfur.dioxide 0.0050774 0.0021255 2.389 0.017 *
## pH -0.4826614 0.1175581 -4.106 4.23e-05 ***
## sulphates 0.8826651 0.1099084 8.031 1.86e-15 ***
## total.sulfur.dioxide -0.0034822 0.0006868 -5.070 4.43e-07 ***
## volatile.acidity -1.0127527 0.1008429 -10.043 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared: 0.3595, Adjusted R-squared: 0.3567
## F-statistic: 127.6 on 7 and 1591 DF, p-value: < 2.2e-16
#some graphs to check residuals
par(mfrow=c(2,2))
plot(redmdl2)
# histogram of the residuals
par(mfrow=c(1,1))
hist(redmdl2$residuals)
Linear model result from the linear regression says that the response variable quality can be explained as
4.430099 + 0.289303(alcohol) - 2.017814(chlorides) + 0.004361(free.sulfur.dioxide) -0.482661(pH) + 0.882665(sulphates) -0.003482(total.sulfur.dioxide) -1.012753(volatile.acidity)
Summary indicates no p-values for remaining input variable are greater than 0.05. As such we reject the null hypothesis that these input variables do not make a significantly greater than 0 contribution to the variance of the quality ranking.
Adjusted R2 has risen slightly but is still not high at 0.3567 but the p-value of R2 is still <0.05 so we are at least 95% confident that a relationship does exist between at least some of the input variables and the quality ranking.
Residuals vs Fitted indicates Constant Variance (although clustered into integer rankings)
Normal QQ indicates Normal distributed residuals. Histogram of residuals also looks like a normal distribution and centered around 0. Assumptions of regression: OK
Test the predictive power of the model by using real input variable values
# we expect a rank of 6
predict.lm(redmdl2, data.frame( alcohol=9.8, chlorides=0.075, free.sulfur.dioxide=17.0, pH=3.16, sulphates=0.58, total.sulfur.dioxide=60,volatile.acidity=0.280), type="response")
## 1
## 5.694475
# we expect a rank of 5
predict.lm(redmdl2, data.frame( alcohol=9.4, chlorides=0.076, free.sulfur.dioxide=11.0, pH=3.51, sulphates=0.56, total.sulfur.dioxide=34,volatile.acidity=0.7), type="response")
## 1
## 5.024869
# we expect a rank of 7
predict.lm(redmdl2, data.frame( alcohol=10, chlorides=0.065, free.sulfur.dioxide=15.0, pH=3.39, sulphates=0.47, total.sulfur.dioxide=21,volatile.acidity=0.65), type="response")
## 1
## 5.315343
By rounding the results of the predictions we can see that the model predicted the correct values for the first sets of input variables but was incorrect where the expected value was 7.
Some suggestions for explanations for the results are:
Further investigations continue into logistical regression techniques for ordinal response variables.