Introduction

This is the part II of wine data analysis. In the part I, I have compared the difference between red wines and white wines in terms of chemical properties. In this part, I will analyze the relationship between each chemical property and wine quality score given by experts. Red wines and white wines are analyzed separately because I assume wine experts evaluate red wines and white wines based on different standards.

Exploratory Data Analysis

Let’s first take a look at the correlations. The following table shows that correlations between each chemical property and quality for red wines. Alcohol, volatile acidity, sulphates and citric acid are the four chemical properties that have correlations larger than 0.2.

##                          quality
## fixed.acidity         0.12405165
## volatile.acidity     -0.39055778
## citric.acid           0.22637251
## residual.sugar        0.01373164
## chlorides            -0.12890656
## free.sulfur.dioxide  -0.05065606
## total.sulfur.dioxide -0.18510029
## density              -0.17491923
## pH                   -0.05773139
## sulphates             0.25139708
## alcohol               0.47616632
## quality               1.00000000

The following table shows that correlations between each chemical property and quality for white wines. Alcohol, density and chlorides are the three chemical properties that have correlations larger than 0.2.

##                           quality
## fixed.acidity        -0.113662831
## volatile.acidity     -0.194722969
## citric.acid          -0.009209091
## residual.sugar       -0.097576829
## chlorides            -0.209934411
## free.sulfur.dioxide   0.008158067
## total.sulfur.dioxide -0.174737218
## density              -0.307123313
## pH                    0.099427246
## sulphates             0.053677877
## alcohol               0.435574715
## quality               1.000000000

I put data into ten groups evenly by proportion, 10% for each group. Then I calculate the average quality for each group and see if there is relationship. I wrote the following function to make my codes clear and succinct.

fun <- function(redchem, redquality, whitechem, whitequality) {
  redfacut <- ntile(redchem, 10)
  whitefacut <- ntile(whitechem, 10)
  reddf <<- data.frame(red=tapply(redchem, redfacut, mean), 
                        quality=tapply(redquality, redfacut, mean))
  whitedf <<- data.frame(white=tapply(whitechem, whitefacut, mean), 
                      quality=tapply(whitequality, whitefacut, mean))
  redplot <<- ggplot(aes(x=red, y=quality, colour = I("#CC0000")), data = reddf) + geom_point() + geom_line() +
    theme(legend.position="none", axis.title.x = element_text(colour = I("#CC0000"))) + 
    xlab("red wine") + ylab("average quality")
  whiteplot <<- ggplot(aes(x=white,y=quality, colour = I("#33CCFF")), data = whitedf) + geom_point() + geom_line() + 
    theme(legend.position="none", axis.title.x = element_text(color = I("#33CCFF"))) + 
    xlab("white wine") + ylab("average quality") 
}

Fixed Acidity

fun(red$fixed.acidity, red$quality, white$fixed.acidity, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Fixed Acidity",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf)) 
##          red  quality    white  quality
## 1   5.965000 5.625000 5.510612 6.057143
## 2   6.764375 5.543750 6.041633 5.881633
## 3   7.111875 5.575000 6.309796 5.873469
## 4   7.415000 5.431250 6.510918 5.861224
## 5   7.745625 5.550000 6.698160 5.887526
## 6   8.120625 5.593750 6.874898 5.926531
## 7   8.635625 5.581250 7.090000 5.934694
## 8   9.252500 5.743750 7.334490 5.957143
## 9  10.185000 5.850000 7.674286 5.767347
## 10 12.023899 5.867925 8.506135 5.631902

Fixed acidity is positively related to quality for red wines, while the relationship is negative for white wines. However, both relationships are not very explicit.

Volatile Acidity

fun(red$volatile.acidity, red$quality, white$volatile.acidity, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Volatile Acidity",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf)) 
##          red  quality     white  quality
## 1  0.2589688 6.243750 0.1476633 6.183673
## 2  0.3406875 6.062500 0.1853061 6.132653
## 3  0.3926250 5.818750 0.2102959 5.987755
## 4  0.4402187 5.600000 0.2324082 5.918367
## 5  0.4958750 5.681250 0.2525562 5.852761
## 6  0.5465000 5.643750 0.2730408 5.755102
## 7  0.5927500 5.418750 0.2956122 5.777551
## 8  0.6375312 5.437500 0.3243571 5.755102
## 9  0.6996562 5.306250 0.3664082 5.808163
## 10 0.8755660 5.144654 0.4951534 5.607362

The relationship between volatile acidity and quality is more explicit, which are negatively related. This result makes sense because volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. So higher amount of volatile acidity leads to a lower score.

Citric Acid

fun(red$citric.acid, red$quality, white$citric.acid, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Citric Acid",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality     white  quality
## 1  0.0017500 5.481250 0.1528571 5.446939
## 2  0.0355625 5.400000 0.2372245 5.697959
## 3  0.0945625 5.537500 0.2664694 5.991837
## 4  0.1695625 5.412500 0.2860612 6.148980
## 5  0.2330000 5.381250 0.3048671 6.116564
## 6  0.2874375 5.568750 0.3265510 6.108163
## 7  0.3523750 5.875000 0.3512449 5.997959
## 8  0.4267500 6.006250 0.3867143 5.910204
## 9  0.4926250 5.718750 0.4487347 5.736735
## 10 0.6183019 5.981132 0.5816360 5.623722

Citric acid is positively related to quality for red wines, while the relationship for white wines is hard to tell.

Residual Sugar

fun(red$residual.sugar, red$quality, white$residual.sugar, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Residual Sugar",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##         red  quality     white  quality
## 1  1.512500 5.468750  1.075714 5.744898
## 2  1.764375 5.768750  1.389490 5.846939
## 3  1.893125 5.537500  1.738163 5.991837
## 4  2.011875 5.587500  2.559490 6.171429
## 5  2.130625 5.606250  4.404090 6.110429
## 6  2.247813 5.637500  6.169082 5.918367
## 7  2.404062 5.831250  7.783061 5.744898
## 8  2.594375 5.618750  9.853265 5.763265
## 9  3.024062 5.643750 12.546735 5.751020
## 10 5.825786 5.660377 16.411452 5.736196

There is no obvious relationship between residual sugar and quality for both red wines and white wines.

Chlorides

fun(red$chlorides, red$quality, white$chlorides, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Chlorides",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##           red  quality      white  quality
## 1  0.05186250 5.931250 0.02560816 6.342857
## 2  0.06420625 5.962500 0.03205306 6.206122
## 3  0.07000000 5.675000 0.03552857 6.153061
## 4  0.07435000 5.675000 0.03846531 6.000000
## 5  0.07758750 5.587500 0.04139059 5.807771
## 6  0.08077500 5.450000 0.04430612 5.848980
## 7  0.08470000 5.575000 0.04704694 5.669388
## 8  0.09051250 5.618750 0.05023265 5.624490
## 9  0.09987500 5.462500 0.05456531 5.657143
## 10 0.18138365 5.421384 0.08860532 5.468303

Chlorides is the amount of salt in the wine. Chlorides and quality appear a negative relationship for red and white wines. It seems like experts do not like salty wines.

Free Sulfur Dioxide

fun(red$free.sulfur.dioxide, red$quality, white$free.sulfur.dioxide, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Free Sulfur Dioxide",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality    white  quality
## 1   4.037500 5.625000 10.50918 5.420408
## 2   5.759375 5.843750 18.23673 5.848980
## 3   7.443750 5.568750 23.28265 5.908163
## 4   9.931250 5.631250 27.56837 6.057143
## 5  12.256250 5.712500 31.57566 6.077710
## 6  14.937500 5.562500 35.60612 6.024490
## 7  17.493750 5.731250 40.14694 6.051020
## 8  21.693750 5.600000 45.89388 5.961224
## 9  27.162500 5.506250 52.59694 5.785714
## 10 38.172956 5.578616 67.72290 5.644172

There is no obvious relationship between free sulfur dioxide and quality for both red wines and white wines.

Total Sulfur Dioxide

fun(red$total.sulfur.dioxide, red$quality, white$total.sulfur.dioxide, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Total Sulfur Dioxide",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality     white  quality
## 1   11.01250 5.681250  70.95714 5.818367
## 2   16.51875 5.837500  95.17755 6.167347
## 3   21.86875 5.737500 107.95306 6.034694
## 4   27.21875 5.837500 118.23776 6.095918
## 5   33.96875 5.668750 128.84867 6.022495
## 6   41.57500 5.731250 140.50408 5.928571
## 7   49.99375 5.631250 153.20816 5.906122
## 8   62.22500 5.506250 167.28469 5.744898
## 9   81.55000 5.493750 184.51224 5.575510
## 10 119.20126 5.232704 217.06442 5.484663

Total sulfur dioxide has a negative relationship to quality for red wines generally, while there is a similar relationship for white wines except the 0%-10% group.

Density

fun(red$density, red$quality, white$density, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Density",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality     white  quality
## 1  0.9933612 6.075000 0.9895554 6.526531
## 2  0.9949803 6.000000 0.9908483 6.330612
## 3  0.9956052 5.625000 0.9917164 6.112245
## 4  0.9961056 5.506250 0.9925161 5.908163
## 5  0.9965189 5.456250 0.9933160 5.815951
## 6  0.9969276 5.500000 0.9942000 5.667347
## 7  0.9973394 5.556250 0.9951781 5.642857
## 8  0.9978478 5.425000 0.9961439 5.518367
## 9  0.9985786 5.618750 0.9975082 5.597959
## 10 1.0002238 5.597484 0.9993006 5.658487

The density and quality show a negative linear relationship for both red wines and white wines.

pH

fun(red$pH, red$quality, white$pH, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by pH",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##         red  quality    white  quality
## 1  3.044437 5.712500 2.945061 5.848980
## 2  3.157000 5.631250 3.036327 5.773469
## 3  3.208250 5.737500 3.085959 5.761224
## 4  3.254375 5.737500 3.125714 5.851020
## 5  3.293187 5.537500 3.159305 5.740286
## 6  3.328750 5.675000 3.194531 5.816327
## 7  3.365813 5.631250 3.233755 5.851020
## 8  3.402563 5.650000 3.280490 5.953061
## 9  3.464000 5.475000 3.343408 6.120408
## 10 3.594528 5.572327 3.478650 6.063395

The pH and quality do not show an obvious linear relationship for both red wines and white wines.

Sulphates

fun(red$sulphates, red$quality, white$sulphates, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Sulphates",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality     white  quality
## 1  0.4613125 5.137500 0.3303878 5.914286
## 2  0.5215625 5.250000 0.3785306 5.938776
## 3  0.5506250 5.425000 0.4082245 5.883673
## 4  0.5788750 5.437500 0.4370612 5.820408
## 5  0.6059375 5.556250 0.4614315 5.744376
## 6  0.6356875 5.662500 0.4890000 5.793878
## 7  0.6756875 5.818750 0.5144286 5.744898
## 8  0.7306250 6.056250 0.5493265 5.875510
## 9  0.8014375 6.125000 0.6014286 6.014286
## 10 1.0220126 5.893082 0.7290798 6.049080

The suplates and quality shows a positive relationship for red wines, except the 90%-100% group, while the suplates and quality do not show any linear relationship for white wines.

Alcohol

fun(red$alcohol, red$quality, white$alcohol, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Alcohol",gp=gpar(fontsize=15,font=1)))

print(cbind(reddf,whitedf))
##          red  quality     white  quality
## 1   9.137708 5.256250  8.853469 5.597959
## 2   9.393125 5.200000  9.193469 5.471429
## 3   9.526667 5.300000  9.459796 5.434694
## 4   9.750000 5.306250  9.774796 5.555102
## 5  10.010104 5.500000 10.153885 5.666667
## 6  10.354375 5.587500 10.533231 5.824490
## 7  10.724687 5.831250 10.939415 6.057143
## 8  11.101667 5.825000 11.381735 6.110204
## 9  11.625938 6.143750 12.033701 6.434694
## 10 12.619287 6.415094 12.823149 6.627812

The result clearly shows that alcohol and quality are positively related for both red and white wines.

In the next part, I will test different models to predict wine quality based on chemical properties.

Thanks for reading.