This is the part II of wine data analysis. In the part I, I have compared the difference between red wines and white wines in terms of chemical properties. In this part, I will analyze the relationship between each chemical property and wine quality score given by experts. Red wines and white wines are analyzed separately because I assume wine experts evaluate red wines and white wines based on different standards.
Let’s first take a look at the correlations. The following table shows that correlations between each chemical property and quality for red wines. Alcohol, volatile acidity, sulphates and citric acid are the four chemical properties that have correlations larger than 0.2.
## quality
## fixed.acidity 0.12405165
## volatile.acidity -0.39055778
## citric.acid 0.22637251
## residual.sugar 0.01373164
## chlorides -0.12890656
## free.sulfur.dioxide -0.05065606
## total.sulfur.dioxide -0.18510029
## density -0.17491923
## pH -0.05773139
## sulphates 0.25139708
## alcohol 0.47616632
## quality 1.00000000
The following table shows that correlations between each chemical property and quality for white wines. Alcohol, density and chlorides are the three chemical properties that have correlations larger than 0.2.
## quality
## fixed.acidity -0.113662831
## volatile.acidity -0.194722969
## citric.acid -0.009209091
## residual.sugar -0.097576829
## chlorides -0.209934411
## free.sulfur.dioxide 0.008158067
## total.sulfur.dioxide -0.174737218
## density -0.307123313
## pH 0.099427246
## sulphates 0.053677877
## alcohol 0.435574715
## quality 1.000000000
I put data into ten groups evenly by proportion, 10% for each group. Then I calculate the average quality for each group and see if there is relationship. I wrote the following function to make my codes clear and succinct.
fun <- function(redchem, redquality, whitechem, whitequality) {
redfacut <- ntile(redchem, 10)
whitefacut <- ntile(whitechem, 10)
reddf <<- data.frame(red=tapply(redchem, redfacut, mean),
quality=tapply(redquality, redfacut, mean))
whitedf <<- data.frame(white=tapply(whitechem, whitefacut, mean),
quality=tapply(whitequality, whitefacut, mean))
redplot <<- ggplot(aes(x=red, y=quality, colour = I("#CC0000")), data = reddf) + geom_point() + geom_line() +
theme(legend.position="none", axis.title.x = element_text(colour = I("#CC0000"))) +
xlab("red wine") + ylab("average quality")
whiteplot <<- ggplot(aes(x=white,y=quality, colour = I("#33CCFF")), data = whitedf) + geom_point() + geom_line() +
theme(legend.position="none", axis.title.x = element_text(color = I("#33CCFF"))) +
xlab("white wine") + ylab("average quality")
}
fun(red$fixed.acidity, red$quality, white$fixed.acidity, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Fixed Acidity",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 5.965000 5.625000 5.510612 6.057143
## 2 6.764375 5.543750 6.041633 5.881633
## 3 7.111875 5.575000 6.309796 5.873469
## 4 7.415000 5.431250 6.510918 5.861224
## 5 7.745625 5.550000 6.698160 5.887526
## 6 8.120625 5.593750 6.874898 5.926531
## 7 8.635625 5.581250 7.090000 5.934694
## 8 9.252500 5.743750 7.334490 5.957143
## 9 10.185000 5.850000 7.674286 5.767347
## 10 12.023899 5.867925 8.506135 5.631902
Fixed acidity is positively related to quality for red wines, while the relationship is negative for white wines. However, both relationships are not very explicit.
fun(red$volatile.acidity, red$quality, white$volatile.acidity, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Volatile Acidity",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 0.2589688 6.243750 0.1476633 6.183673
## 2 0.3406875 6.062500 0.1853061 6.132653
## 3 0.3926250 5.818750 0.2102959 5.987755
## 4 0.4402187 5.600000 0.2324082 5.918367
## 5 0.4958750 5.681250 0.2525562 5.852761
## 6 0.5465000 5.643750 0.2730408 5.755102
## 7 0.5927500 5.418750 0.2956122 5.777551
## 8 0.6375312 5.437500 0.3243571 5.755102
## 9 0.6996562 5.306250 0.3664082 5.808163
## 10 0.8755660 5.144654 0.4951534 5.607362
The relationship between volatile acidity and quality is more explicit, which are negatively related. This result makes sense because volatile acidity is the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. So higher amount of volatile acidity leads to a lower score.
fun(red$citric.acid, red$quality, white$citric.acid, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Citric Acid",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 0.0017500 5.481250 0.1528571 5.446939
## 2 0.0355625 5.400000 0.2372245 5.697959
## 3 0.0945625 5.537500 0.2664694 5.991837
## 4 0.1695625 5.412500 0.2860612 6.148980
## 5 0.2330000 5.381250 0.3048671 6.116564
## 6 0.2874375 5.568750 0.3265510 6.108163
## 7 0.3523750 5.875000 0.3512449 5.997959
## 8 0.4267500 6.006250 0.3867143 5.910204
## 9 0.4926250 5.718750 0.4487347 5.736735
## 10 0.6183019 5.981132 0.5816360 5.623722
Citric acid is positively related to quality for red wines, while the relationship for white wines is hard to tell.
fun(red$residual.sugar, red$quality, white$residual.sugar, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Residual Sugar",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 1.512500 5.468750 1.075714 5.744898
## 2 1.764375 5.768750 1.389490 5.846939
## 3 1.893125 5.537500 1.738163 5.991837
## 4 2.011875 5.587500 2.559490 6.171429
## 5 2.130625 5.606250 4.404090 6.110429
## 6 2.247813 5.637500 6.169082 5.918367
## 7 2.404062 5.831250 7.783061 5.744898
## 8 2.594375 5.618750 9.853265 5.763265
## 9 3.024062 5.643750 12.546735 5.751020
## 10 5.825786 5.660377 16.411452 5.736196
There is no obvious relationship between residual sugar and quality for both red wines and white wines.
fun(red$chlorides, red$quality, white$chlorides, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Chlorides",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 0.05186250 5.931250 0.02560816 6.342857
## 2 0.06420625 5.962500 0.03205306 6.206122
## 3 0.07000000 5.675000 0.03552857 6.153061
## 4 0.07435000 5.675000 0.03846531 6.000000
## 5 0.07758750 5.587500 0.04139059 5.807771
## 6 0.08077500 5.450000 0.04430612 5.848980
## 7 0.08470000 5.575000 0.04704694 5.669388
## 8 0.09051250 5.618750 0.05023265 5.624490
## 9 0.09987500 5.462500 0.05456531 5.657143
## 10 0.18138365 5.421384 0.08860532 5.468303
Chlorides is the amount of salt in the wine. Chlorides and quality appear a negative relationship for red and white wines. It seems like experts do not like salty wines.
fun(red$free.sulfur.dioxide, red$quality, white$free.sulfur.dioxide, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Free Sulfur Dioxide",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 4.037500 5.625000 10.50918 5.420408
## 2 5.759375 5.843750 18.23673 5.848980
## 3 7.443750 5.568750 23.28265 5.908163
## 4 9.931250 5.631250 27.56837 6.057143
## 5 12.256250 5.712500 31.57566 6.077710
## 6 14.937500 5.562500 35.60612 6.024490
## 7 17.493750 5.731250 40.14694 6.051020
## 8 21.693750 5.600000 45.89388 5.961224
## 9 27.162500 5.506250 52.59694 5.785714
## 10 38.172956 5.578616 67.72290 5.644172
There is no obvious relationship between free sulfur dioxide and quality for both red wines and white wines.
fun(red$total.sulfur.dioxide, red$quality, white$total.sulfur.dioxide, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Total Sulfur Dioxide",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 11.01250 5.681250 70.95714 5.818367
## 2 16.51875 5.837500 95.17755 6.167347
## 3 21.86875 5.737500 107.95306 6.034694
## 4 27.21875 5.837500 118.23776 6.095918
## 5 33.96875 5.668750 128.84867 6.022495
## 6 41.57500 5.731250 140.50408 5.928571
## 7 49.99375 5.631250 153.20816 5.906122
## 8 62.22500 5.506250 167.28469 5.744898
## 9 81.55000 5.493750 184.51224 5.575510
## 10 119.20126 5.232704 217.06442 5.484663
Total sulfur dioxide has a negative relationship to quality for red wines generally, while there is a similar relationship for white wines except the 0%-10% group.
fun(red$density, red$quality, white$density, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Density",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 0.9933612 6.075000 0.9895554 6.526531
## 2 0.9949803 6.000000 0.9908483 6.330612
## 3 0.9956052 5.625000 0.9917164 6.112245
## 4 0.9961056 5.506250 0.9925161 5.908163
## 5 0.9965189 5.456250 0.9933160 5.815951
## 6 0.9969276 5.500000 0.9942000 5.667347
## 7 0.9973394 5.556250 0.9951781 5.642857
## 8 0.9978478 5.425000 0.9961439 5.518367
## 9 0.9985786 5.618750 0.9975082 5.597959
## 10 1.0002238 5.597484 0.9993006 5.658487
The density and quality show a negative linear relationship for both red wines and white wines.
fun(red$pH, red$quality, white$pH, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by pH",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 3.044437 5.712500 2.945061 5.848980
## 2 3.157000 5.631250 3.036327 5.773469
## 3 3.208250 5.737500 3.085959 5.761224
## 4 3.254375 5.737500 3.125714 5.851020
## 5 3.293187 5.537500 3.159305 5.740286
## 6 3.328750 5.675000 3.194531 5.816327
## 7 3.365813 5.631250 3.233755 5.851020
## 8 3.402563 5.650000 3.280490 5.953061
## 9 3.464000 5.475000 3.343408 6.120408
## 10 3.594528 5.572327 3.478650 6.063395
The pH and quality do not show an obvious linear relationship for both red wines and white wines.
fun(red$sulphates, red$quality, white$sulphates, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Sulphates",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 0.4613125 5.137500 0.3303878 5.914286
## 2 0.5215625 5.250000 0.3785306 5.938776
## 3 0.5506250 5.425000 0.4082245 5.883673
## 4 0.5788750 5.437500 0.4370612 5.820408
## 5 0.6059375 5.556250 0.4614315 5.744376
## 6 0.6356875 5.662500 0.4890000 5.793878
## 7 0.6756875 5.818750 0.5144286 5.744898
## 8 0.7306250 6.056250 0.5493265 5.875510
## 9 0.8014375 6.125000 0.6014286 6.014286
## 10 1.0220126 5.893082 0.7290798 6.049080
The suplates and quality shows a positive relationship for red wines, except the 90%-100% group, while the suplates and quality do not show any linear relationship for white wines.
fun(red$alcohol, red$quality, white$alcohol, white$quality)
grid.arrange(redplot, whiteplot, ncol = 2, top = textGrob("Quality by Alcohol",gp=gpar(fontsize=15,font=1)))
print(cbind(reddf,whitedf))
## red quality white quality
## 1 9.137708 5.256250 8.853469 5.597959
## 2 9.393125 5.200000 9.193469 5.471429
## 3 9.526667 5.300000 9.459796 5.434694
## 4 9.750000 5.306250 9.774796 5.555102
## 5 10.010104 5.500000 10.153885 5.666667
## 6 10.354375 5.587500 10.533231 5.824490
## 7 10.724687 5.831250 10.939415 6.057143
## 8 11.101667 5.825000 11.381735 6.110204
## 9 11.625938 6.143750 12.033701 6.434694
## 10 12.619287 6.415094 12.823149 6.627812
The result clearly shows that alcohol and quality are positively related for both red and white wines.
In the next part, I will test different models to predict wine quality based on chemical properties.
Thanks for reading.