Red wine preferences from physicochemical properties

This document investigates the relationship between the ranking of red wine quality by human experts and 11 variables resulting from physicochemical tests carried out against the wines.

The data is collected from the University of California, Irvine - Machine Learning Repository

Citation: This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Variables

Inputs

There are 11 input (explanatory) variables that were collected as measurements through a variety of tests on the wines:
- fixed acidity
- volatile acidity
- citric acid
- residual sugar
- chlorides
- free sulfur dioxide
- total sulfur dioxide
- density
- pH
- sulphates
- alcohol

All input variables are continuous variables.

Response

The response variable is the quality ranking. This is an integer in the range 0 (very bad) to 10 (very excellent). For the purpose of this investigation we will treat the quality response variable as a continuous variable to allow us to use simple linear modelling techniques with the proposal that a rounding function could be utilised to produce an appropriate ranking from predictions from a linear model. However it is recognised that the quality response variable is actually an ordinal variable (a form of categorical variable). It is expected that this will probably result in a less than optimal linear model due to the possible loss of fine-grained relationship information between the inputs and the response due to the clustering of continuous response values into the 11 discrete integer rankings. It is also expected that some form of generalised linear model using logistic regression may provide better relationship information (I just dont know how to do that yet).

Data

There are 1599 samples with no missing data.

Hypothesis

The null hypotheis (H₀) is that none of the variance in the quality ranking is explained by the input variables. The alternate hypothesis (H₁) is that the input variables contribution to the variance in the quality ranking is significantly different from 0.

#load in the red wine data from the UCI ML website
redwq<-read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')

# and have a quick look at the data
head(redwq)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

summary(redwq)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

Summary indicates that the distribution of quality rankings for the red wine is quite small. There are no extreme rankings e.g. Min =3 and Max = 8 and it is centered around the middle rankings Q1=5 and Q3=6.

# see what a couple of graphs show about the shape of the quality ranking distribution
par(mfrow=c(2,1))

# normal Q-Q plot against standardised quality rankings
qqnorm(scale(redwq$quality))
qqline(scale(redwq$quality))

# and a histogram
hist(redwq$quality)

The histogram looks sort of normal with very thin tails - far more rankings in the “ordinary” range than in the “very good” or “very bad” range.

The normal Q-Q plots look a bit strange due to the clustering around the ordinal integer rankings. For the purpose of this investigation we will assume “fit” due to the line passing through roughly the centre of each cluster except at the ends where the thin tails depart from the line.
But we really need to work out some way of treating this variable like an ordinal

Model

As mentioned we will use a linear model. As all the input variables are continuous we will use Linear Regression in preference to ANOVA.

Regression 1

All available input variables used.

attach(redwq)
redmdl<-lm(formula=quality~alcohol+chlorides+citric.acid+density+fixed.acidity+free.sulfur.dioxide+pH+residual.sugar+sulphates+total.sulfur.dioxide+volatile.acidity)

# have a look at the results of the regression
redmdl

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Coefficients:
##          (Intercept)               alcohol             chlorides  
##            21.965208              0.276198             -1.874225  
##          citric.acid               density         fixed.acidity  
##            -0.182564            -17.881164              0.024991  
##  free.sulfur.dioxide                    pH        residual.sugar  
##             0.004361             -0.413653              0.016331  
##            sulphates  total.sulfur.dioxide      volatile.acidity  
##             0.916334             -0.003265             -1.083590

#summarise the model
summary(redmdl)

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68911 -0.36652 -0.04699  0.45202  2.02498 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.197e+01  2.119e+01   1.036   0.3002    
## alcohol               2.762e-01  2.648e-02  10.429  < 2e-16 ***
## chlorides            -1.874e+00  4.193e-01  -4.470 8.37e-06 ***
## citric.acid          -1.826e-01  1.472e-01  -1.240   0.2150    
## density              -1.788e+01  2.163e+01  -0.827   0.4086    
## fixed.acidity         2.499e-02  2.595e-02   0.963   0.3357    
## free.sulfur.dioxide   4.361e-03  2.171e-03   2.009   0.0447 *  
## pH                   -4.137e-01  1.916e-01  -2.159   0.0310 *  
## residual.sugar        1.633e-02  1.500e-02   1.089   0.2765    
## sulphates             9.163e-01  1.143e-01   8.014 2.13e-15 ***
## total.sulfur.dioxide -3.265e-03  7.287e-04  -4.480 8.00e-06 ***
## volatile.acidity     -1.084e+00  1.211e-01  -8.948  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.648 on 1587 degrees of freedom
## Multiple R-squared:  0.3606, Adjusted R-squared:  0.3561 
## F-statistic: 81.35 on 11 and 1587 DF,  p-value: < 2.2e-16

#some graphs to check residuals 
par(mfrow=c(2,2))
plot(redmdl)

# histogram of the residuals
par(mfrow=c(1,1))
hist(redmdl$residuals)

Regression 1 results:

Linear model result from the linear regression says that the response variable quality can be explained as

21.965208 + 0.276198(alcohol) -1.874225(chlorides) -0.182564(citric.acid) -17.881164(density) + 0.024991(fixed.acidity) + 0.004361(free.sulfur.dioxide) -0.413653(pH) + 0.016331(residual.sugar) + 0.916334(sulphates) - 0.003265(total.sulfur.dioxide) -1.08359(volatile.acidity)

Summary indicates p-values for citric.acid, density, fixed.actidity, and residual.sugar are greater than 0.05. As such we cannot reject the null hypothesis that these input variable do not make a significantly greater than 0 contribution to the variance of the quality ranking with 95% confidence. We will remove these from the regression

Adjusted R2 is not high at 0.3561 but the p-value of R2 is <0.05 so we are 95% confident that a relationship does exist between at least some of the input variables and the quality ranking.

Graphs

Residuals vs Fitted indicates Constant Variance (although clustered into integer rankings)
Normal QQ indicates Normal distributed residuals. Histogram of residuals also looks like a normal distribution and centered around 0. Assumptions of regression: OK

Regression 2

Input variables citric.acid, density, fixed.actidity, and residual.sugar removed from the regression.

redmdl2<-lm(formula=quality~alcohol+chlorides+free.sulfur.dioxide+pH+sulphates+total.sulfur.dioxide+volatile.acidity)

# have a look at the results of the regression
redmdl2

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + free.sulfur.dioxide + 
##     pH + sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Coefficients:
##          (Intercept)               alcohol             chlorides  
##             4.430099              0.289303             -2.017814  
##  free.sulfur.dioxide                    pH             sulphates  
##             0.005077             -0.482661              0.882665  
## total.sulfur.dioxide      volatile.acidity  
##            -0.003482             -1.012753

#summarise the model
summary(redmdl2)

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + free.sulfur.dioxide + 
##     pH + sulphates + total.sulfur.dioxide + volatile.acidity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.68918 -0.36757 -0.04653  0.46081  2.02954 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.4300987  0.4029168  10.995  < 2e-16 ***
## alcohol               0.2893028  0.0167958  17.225  < 2e-16 ***
## chlorides            -2.0178138  0.3975417  -5.076 4.31e-07 ***
## free.sulfur.dioxide   0.0050774  0.0021255   2.389    0.017 *  
## pH                   -0.4826614  0.1175581  -4.106 4.23e-05 ***
## sulphates             0.8826651  0.1099084   8.031 1.86e-15 ***
## total.sulfur.dioxide -0.0034822  0.0006868  -5.070 4.43e-07 ***
## volatile.acidity     -1.0127527  0.1008429 -10.043  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6477 on 1591 degrees of freedom
## Multiple R-squared:  0.3595, Adjusted R-squared:  0.3567 
## F-statistic: 127.6 on 7 and 1591 DF,  p-value: < 2.2e-16

#some graphs to check residuals 
par(mfrow=c(2,2))
plot(redmdl2)

# histogram of the residuals
par(mfrow=c(1,1))
hist(redmdl2$residuals)

Regression 2 Results:

Linear model result from the linear regression says that the response variable quality can be explained as

4.430099 + 0.289303(alcohol) - 2.017814(chlorides) + 0.004361(free.sulfur.dioxide) -0.482661(pH) + 0.882665(sulphates) -0.003482(total.sulfur.dioxide) -1.012753(volatile.acidity)

Summary indicates no p-values for remaining input variable are greater than 0.05. As such we reject the null hypothesis that these input variables do not make a significantly greater than 0 contribution to the variance of the quality ranking.

Adjusted R2 has risen slightly but is still not high at 0.3567 but the p-value of R2 is still <0.05 so we are at least 95% confident that a relationship does exist between at least some of the input variables and the quality ranking.

Graphs

Predictions

Test the predictive power of the model by using real input variable values

# we expect a rank of 6
predict.lm(redmdl2, data.frame( alcohol=9.8, chlorides=0.075, free.sulfur.dioxide=17.0, pH=3.16, sulphates=0.58, total.sulfur.dioxide=60,volatile.acidity=0.280), type="response")

##        1 
## 5.694475

# we expect a rank of 5
predict.lm(redmdl2, data.frame( alcohol=9.4, chlorides=0.076, free.sulfur.dioxide=11.0, pH=3.51, sulphates=0.56, total.sulfur.dioxide=34,volatile.acidity=0.7), type="response")

##        1 
## 5.024869

# we expect a rank of 7
predict.lm(redmdl2, data.frame( alcohol=10, chlorides=0.065, free.sulfur.dioxide=15.0, pH=3.39, sulphates=0.47, total.sulfur.dioxide=21,volatile.acidity=0.65), type="response")

##        1 
## 5.315343

Results

By rounding the results of the predictions we can see that the model predicted the correct values for the first sets of input variables but was incorrect where the expected value was 7.

Some suggestions for explanations for the results are:

The loss of relationship information by treating the ordinal response variable as a continuous variable
The narrow distribution of response values (Q1=5 Q3=6) gave little information for the linear regression to properly determine the relationships for values outside the 1st and 3rd quartiles.
The adjusted R2 is not high at 0.3567 but having said that we do not have extensive experience in this type of data.

Summary

Further investigations continue into logistical regression techniques for ordinal response variables.

Red wine preferences from physicochemical properties

Dominic Mackenzie 96052826

Sunday, March 08, 2015

Variables

Inputs

Response

Data

Hypothesis

Model

Regression 1

Regression 1 results:

Graphs

Regression 2

Regression 2 Results:

Graphs

Predictions

Results

Summary