P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib
This report explores a dataset containing 1599 red wines with 11 variables on the chemical properties of the wine and quality rates of each wine.
Data dimesions
## [1] 1599 13
Data structure
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
Data Summary
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Our data consists of 13 variables with 1599 observations, additional to the 11 chemical properties we have X the index and quality.
Most of our data observations has an average quality 5-6 and only few samples of the low quality and high quality wines which may affect my analysis. What characteristics a good quality wine has? is it the sowerness, the sweetness or level of alcohol that makes best wines?
The distribuation of fixed acidity is little right skewed and centered arround 8.
I see that volatile acidity exists in small amounts in our wines with a mean of 0.52 and the distribuation looks bimodal at 0.4 and 0.6.
## [1] 0.08255159
8.2% of our wines have no citric acid. Does that mean citric acid is not necessary in wines? or is there a problem in the data?
Most of our wines are arround 2.2 the median and the 3rd quantile with only 2.6. A lot of outliers on the higher ranges if removed we obtain a distribution that looks normal as shown below.
Similar to residual sugar the distribution of chlorides long tailed and concentrated on the lower numbers, 75% of wines in our data has an amount of salt less than 0.09.
The distribuation for both free sulfur dioxide and total sulfur dioxide are right skewed with long tail and few outliers. I removed the outliers to see data clearly. I wonder if that affect the quality of the wine somehow?
Density and pH have symmetric distribution with few outliers on both sides.
The distribution for the sulphates is similar to the one for residual sugar and chlorides, it’s right skewed, long tailed and the presence of outliers, the second pair of plots is after removing outliers. I wonder if there is correlation between these variables?
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Most observations has a percent of alcohol that ranges between 9% and 12% and mean of 10.42%. The median and the mean are close.
There are 1599 observation of wines in the dataset with 12 features . There is one categorical variable (quality) and the others are numerical variables that indicate wine physical and chemical properties of the wine. Quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).
Other observations:
The main feature of this data is quality, I want to determine which wine characteristics affect the quality score.
I created a new variable named rating from the quality score to better categorize the quality and better study and visualize the different attributes of the wine.
The distribution of the citric acid variable was different from the rest with 132 samples from data has 0 amount of citric acid. I removed outliers from all variables that has long tailed distribution.
Correlation between all variables:
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
From the plot and the numbers above, I don’t see any strong correlations which means that the quality of the wine is surely a conbination of these attributes.
I want to look closer to each variable that I see affecting the quality score mainly Alcohol, Fixed acidity, volatile acidity and also I will look at others.
First I will study the variables that has positive correlation with quality.
From the boxplots above we obviously that wines with high quality rates has higher amounts of Alcohol, Fixed acidity, Citric acid and Sulphates.
The following set of variables have a positive correlation with quality.
As volatile acidity, pH and density decrease the quality also decreases.
The following variables seem to have no direct effect on teh quality of wines.
For these variables the change doesn’t seem to affect the rating except for chlorides I can notice that good quality wines seem to have smaller amounts.
The above plot includes the variables that correlate with quality and also illustrates the relationship betwen these variables.
The positive correlation between alcohol and quality tells that people prefer more alcohol in wines and it’s also illustrated in the boxplot the median of good quality is above the median for bad and average quality.
Quality also correlates positively with citric acid and fixed acidity, wines with higher acidity seem to get better quality score, the “fresh” taste coming from the citric acid is preferable for tasters.Wines with higher amounts of sulphastes slightly increase its quality.
High amounts of volatile acidity is considered undesirable in wines but a touch of it is no bad thing. And that is confirmed in the boxplot above.
Quality and pH correlate negatively, Low pH levels gets better quality scores. Good wines seem to have lower density which also matches high level of alcohol.
The other variables like free sulfur dioxide, total sulfur dioxide, residual sugar and chlorides seem to have no direct effect on the quality of wines from this dataset.
As one of the wine characteristics is sweetness I was surprised that is it not affecting the quality in this dataset.
The acidity and pH level tend to negatively correlate with each which is normal as the pH is a measure of acidity. But what doesn’t make sense to me is pH correlates positively with volatile acidity, it might be a lurking variable.
Also the negative correlation between density and alcohol explains that wines with more alcohol weighs less.
The fact that citric acid is one of the main predominant acids in fixed acidity explains the strong positive correlation between both variables.
There is a positive correlation betwen free sulfur dioxide and total sulfur dioxide as the total is a compound between the free and the bound forms of SO2.
The strongest relationship I found is between fixed acidity and pH.
In the next analysis I will combine the variables that correlate with each other and the feature of interest which is quality in the same plots.
Looking at these density plots, they seem to tell same findings from the boxplots in the previous analysis, but interestingly in this density plots I see that alcohol, volatile acidity, citric acid and sulphates are the ones characterizing the good quality wines. Let’s find out!
## rw$rating: bad
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9934 0.9957 0.9966 0.9967 0.9977 1.0010
## --------------------------------------------------------
## rw$rating: average
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9958 0.9968 0.9969 0.9979 1.0037
## --------------------------------------------------------
## rw$rating: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9906 0.9947 0.9957 0.9960 0.9973 1.0032
The graph above compares alcohol to density, we can see as alcohol increases quality tend to get better. And I notice that density has a slight impact on quality so maybe be because of the natural relationship that more alcohol in wines the less it weighs where density comes in place. For this reason I will eliminate density from being one of the variables that affect the rating of red wines.
This graph compares fixed acidity to citric acid, we can see that most of the yellow points (good quality) are above the smooth line where both variables increases but specificaly citric acid, I may need to look deeper in other acids to confirm but using this dataset I will keep citric acid as a variable contribuating in the quality of the wine and disgard fixed acidity in further analysis.
Wines with higher level of alcohol seem to have lower amounts of volatile acidity, which confirms that small amounts of this variable is an important key to the quality of wines.
Here oppositely comes sulphates with a little amount can make wines better as we see it slightly decreases when alcohol increases.
As I came down with 4 variables as major keys in the quality of wines I am going to generate a linear model.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = rw)
## m2: lm(formula = I(quality) ~ I(alcohol) + citric.acid, data = rw)
## m3: lm(formula = I(quality) ~ I(alcohol) + citric.acid + volatile.acidity,
## data = rw)
## m4: lm(formula = I(quality) ~ I(alcohol) + citric.acid + volatile.acidity +
## sulphates, data = rw)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 3.055*** 2.646***
## (0.175) (0.171) (0.194) (0.201)
## I(alcohol) 0.361*** 0.346*** 0.314*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## citric.acid 0.730*** 0.068 -0.079
## (0.090) (0.103) (0.104)
## volatile.acidity -1.343*** -1.265***
## (0.114) (0.113)
## sulphates 0.696***
## (0.103)
## ----------------------------------------------------------------------------
## R-squared 0.227 0.257 0.317 0.336
## adj. R-squared 0.226 0.256 0.316 0.334
## sigma 0.710 0.696 0.668 0.659
## F 468.267 276.595 246.976 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1621.596 -1599.093
## Deviance 805.870 773.917 711.603 691.852
## AIC 3448.114 3385.421 3253.192 3210.186
## BIC 3464.245 3406.930 3280.078 3242.448
## N 1599 1599 1599 1599
## ============================================================================
Holding alcohol level constant, density has little effect on the quality of wines as other keys can contribute in density. There is a low variance in the median with a mean almost equal in all ratings.
Good quality wines has more citric acid. This also applied to fixed acidity but not with the same slope. 75% of good quality wines has high amount of citric acid.
High amount of volatile acidity affect the quality negatively on the oppsite a little bit more of sulphates tend to affect positively.
I didn’t notice any surprising interactions.
The R-squared in the linear model I generated are pretty low but in this field of study the numbers are acceptable. I notice that the adjustaed R-squared doesn’t vary much from the R-squared itself. This indicates that all varibales included in the model are relevant. I also see the 3 stars ’***’ next to each coeficient which tells me that the p-value is statisticaly significant, and the F statistics is pretty high. The model can be a good fit.
Most observations in our dataset has a quality score of 5 and 6 which mean average quality on a scale from 0 to 10.
Level of alcohol is one of the properties that affect the quality rate, as we see in this density plot.
Wines that have high level of alcohol with a touch of sulfates, good amounts of citric acid and a little bit of volatile acidity are the wines that are getting high quality scores.
The red wines quality data set contains information on 1599 wines accross 13 variables, 11 are the chemical properties of the wines and one is the score that this particular wine gets on its quality. My work in this project is to determine if the quality score given is based on the variables in the dataset, in other words how does the change of these variables affect the quality score.
I started by looking at the data set and understand it by getting a summary of each variable, then ploting the distribuation of all of them. I plotted the variable of interest quality where I noticed that most of the observation have scores 5 and 6. The quality score ranges between 3 -8 on a scale of 0-10. I created a variable named rating where I categorized the wines having a score (1-4) as bad, 5-6 as average and 7-10 as good.
Next I looked into the correlation of all variables, it showed that quality correlated positivey with alcohol and negatively with volatile acidity. Also it correlated alightly with other varibles. I noticed that some variable strongly correlate with each other, that’s why I plotted the changes of these variables together to help me disgard any irrelevant or repetitive variable. I found that pH was a lurking variable as it gives same information as acidity.
The limitations I found in this project was with my own assumptions as I assumed that sweetness would be one of the main characteristics changing the quality score but studying this data set showed that it’s not. I also based my analysis on some information found on teh web such as the citric acid is one of the fixed acids in the wines. I would like to study bigger data set of wines with more variables to include for example all different acids in the wines.