##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## corrplot 0.84 loaded
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
## The following object is masked from 'package:tidyr':
##
## extract
The term “Quality” for wine is regularly used by those who produce,discuss and consume wine. Equally – as the following extract from a letter written by a consumer to the wine magazine Decanter shows – it may be used by the public: In fact, it is as important for a wine aficionado to know that certain wine [sic] is top quality as knowing that certain vintage [sic] of a famous wine is clearly below the average, especially if the wine is expensive, because in this case he may save a lot of money (Martinez,1999)[^1]
Quality, as an element of wine, has a long history. The Egyptians were apparently describing wine as ‘very good quality’ by the death of Tuthankamun in 1352 BC (Johnson, 1989), and the Romans subsequently specified the best regions and the greatest vintages (Johnson, 1989). The wine trades today adopt various mechanismns to grade wine, with things like informal consumer magazine tastings, wine shows etc.
The concept of wine quality is very important to the wine industry and to wine consumers. Therefore, the perception that one buys or experiences “quality” by one’s choice may have a significant influence on the consumer decision-making process.
We would like to explore the following questions with this project: - Which property of wine has the highest density? Due to which the quality of white wine is better compared to the factor which makes consumers buy more white wine? - What could improve the quality of wine with the given properties? - Which property has the highest grade rate? Why? - How do you analyse each property and it’s behaviour with respect to the overall quality of wine?
The data sources we choose to use are Wine quality dataset of UCI Machine Learning Repository by Forina, M. et al, PARVUS[^2]. The UCI machine learning repository is a collection of databases, domain theories used by machine learning community. This archive was created in 1987 by David Aha and a fellow graduate students at UC Irvine.
Quality is been an important issue within the feild of consumer behaviour (Holbrook & Corfman, 1985; Olson, 1977; Steenkamp, 1990; Sweeney & Soutar, 1995; Zeithaml, 1988), although its nature and relationship to other factors such as price and value is subject to debate. This study could give us an insight into the type of quality analysis on wine and also helps us understand the consumer behaviour and also to better undertand which property of wine could trigger more buyers.
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ purrr 0.2.5
## ✔ readr 1.1.1 ✔ stringr 1.3.1
## ✔ tibble 1.4.2 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ gridExtra::combine() masks dplyr::combine()
## ✖ lubridate::date() masks base::date()
## ✖ pastecs::extract() masks tidyr::extract()
## ✖ dplyr::filter() masks stats::filter()
## ✖ pastecs::first() masks dplyr::first()
## ✖ lubridate::intersect() masks base::intersect()
## ✖ dplyr::lag() masks stats::lag()
## ✖ pastecs::last() masks dplyr::last()
## ✖ lubridate::setdiff() masks base::setdiff()
## ✖ lubridate::union() masks base::union()
## [1] "/Users/lekhana/Google Drive/_Landing Zone/HU/2018_05to08_DataVisualization/Data Viz in R"
## Parsed with column specification:
## cols(
## fixed_acidity = col_double(),
## volatile_acidity = col_double(),
## citric_acid = col_double(),
## residual_sugar = col_double(),
## chlorides = col_double(),
## free_sulfur_dioxide = col_double(),
## total_sulfur_dioxide = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_integer()
## )
## fixed_acidity volatile_acidity citric_acid residual_sugar
## nbr.val 4.898000e+03 4.898000e+03 4.898000e+03 4.898000e+03
## nbr.null 0.000000e+00 0.000000e+00 1.900000e+01 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min 3.800000e+00 8.000000e-02 0.000000e+00 6.000000e-01
## max 1.420000e+01 1.100000e+00 1.660000e+00 6.580000e+01
## range 1.040000e+01 1.020000e+00 1.660000e+00 6.520000e+01
## sum 3.357475e+04 1.362825e+03 1.636870e+03 3.130515e+04
## median 6.800000e+00 2.600000e-01 3.200000e-01 5.200000e+00
## mean 6.854788e+00 2.782411e-01 3.341915e-01 6.391415e+00
## SE.mean 1.205772e-02 1.440216e-03 1.729207e-03 7.247276e-02
## CI.mean.0.95 2.363854e-02 2.823469e-03 3.390022e-03 1.420791e-01
## var 7.121136e-01 1.015954e-02 1.464579e-02 2.572577e+01
## std.dev 8.438682e-01 1.007945e-01 1.210198e-01 5.072058e+00
## coef.var 1.231064e-01 3.622561e-01 3.621271e-01 7.935736e-01
## chlorides free_sulfur_dioxide total_sulfur_dioxide
## nbr.val 4.898000e+03 4.898000e+03 4.898000e+03
## nbr.null 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00
## min 9.000000e-03 2.000000e+00 9.000000e+00
## max 3.460000e-01 2.890000e+02 4.400000e+02
## range 3.370000e-01 2.870000e+02 4.310000e+02
## sum 2.241930e+02 1.729390e+05 6.776905e+05
## median 4.300000e-02 3.400000e+01 1.340000e+02
## mean 4.577236e-02 3.530808e+01 1.383607e+02
## SE.mean 3.121775e-04 2.430087e-01 6.072391e-01
## CI.mean.0.95 6.120080e-04 4.764061e-01 1.190461e+00
## var 4.773337e-04 2.892427e+02 1.806085e+03
## std.dev 2.184797e-02 1.700714e+01 4.249806e+01
## coef.var 4.773180e-01 4.816783e-01 3.071543e-01
## density pH sulphates alcohol
## nbr.val 4.898000e+03 4.898000e+03 4.898000e+03 4.898000e+03
## nbr.null 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min 9.871100e-01 2.720000e+00 2.200000e-01 8.000000e+00
## max 1.038980e+00 3.820000e+00 1.080000e+00 1.420000e+01
## range 5.187000e-02 1.100000e+00 8.600000e-01 6.200000e+00
## sum 4.868746e+03 1.561613e+04 2.399270e+03 5.149888e+04
## median 9.937400e-01 3.180000e+00 4.700000e-01 1.040000e+01
## mean 9.940274e-01 3.188267e+00 4.898469e-01 1.051427e+01
## SE.mean 4.273596e-05 2.157592e-03 1.630702e-03 1.758388e-02
## CI.mean.0.95 8.378166e-05 4.229848e-03 3.196907e-03 3.447230e-02
## var 8.945524e-06 2.280118e-02 1.302471e-02 1.514427e+00
## std.dev 2.990907e-03 1.510006e-01 1.141258e-01 1.230621e+00
## coef.var 3.008878e-03 4.736135e-02 2.329827e-01 1.170429e-01
## quality group
## nbr.val 4.898000e+03 NA
## nbr.null 0.000000e+00 NA
## nbr.na 0.000000e+00 NA
## min 3.000000e+00 NA
## max 9.000000e+00 NA
## range 6.000000e+00 NA
## sum 2.879000e+04 NA
## median 6.000000e+00 NA
## mean 5.877909e+00 NA
## SE.mean 1.265456e-02 NA
## CI.mean.0.95 2.480862e-02 NA
## var 7.843557e-01 NA
## std.dev 8.856386e-01 NA
## coef.var 1.506724e-01 NA
Data used to construct the dataset for this project include: UCI Machine Learning Reopsitory - Fixed Acidity (the acid that contributes to the conservation of wine) - Volatile Acidity (the amount of acetic acid in wine at high level which leads to the unpleasant taste of vinegar) - Citric Acid (this could be found in small amounts and can add freshness to the wine) - Residual sugar (the amount of sugar remaining at the end of the process of fermentation) - Chlorides (the amount of salt in wine) - Free Sulphur Dioxide (this prevents the increase of microbes and oxidation of the wine during the process of fermentation) - Total Sulphur Dioxide (gives the aroma and also a ting of flavour to the wine) - Density (this is the density of water and purely depends on the water and the amount of sugar) - pH (acidic or basic a wine is on a pH scale) - Sulfates (an additive that adds as antioxidant) - Alcohol (the percentage of alcohol present in wine)
In order to explore the research questions, we leveraged bar chart, line plot, boxplot, histograms, to find the trend and correlation with the quality of white wine. Some of the data was straight foreward easy to understand, but we had to manipulate it for performing analysis and visualizations - for example, there was a column with sensory property where we took the range and created a group to classify high, medium and low ranges.
This dataset was not in csv format, we had to format it into csv and used for analysis. After which we - White wine Data set(4898 observations of 13 variables)
The most straight forward analysis we could do was check the distribution of all the wine properties. Hence we plotted the distribution across these properties, we could chlorides, residual sugar are skewed to the left and hence we need to fix the skew by taking the log of these values.
The highest count value is with properties such as fixed acidity and volatile acidity. To further examine this we run Density plots to check how these properties are associated with quality of wine.
This plot shows the density of the property with respect to density and property of wine. The highest qine quality point is 9 and it falls on a density scale from 12 to 14.
We are also trying to understand the effect or range where residual sugar falls in scale of quality points.The highest we can see it falls somewhere between 0 to 10.Residual Sugar has a positive asymmetrical distribution and a long tail, with concentration between 1 g / dm? and 1,5 dm?.
We have grouped the wine quality into 3 levels - high, medium and low. As we can see there are significantly more medium quality wine compared to high and low quality ones from the plot below. However, doing this will result in over and under representation of high and medium quality wines respectively as well as having too little data points for analysis.
Since alcohol was a major contributor to the density plot graph we are diving into to see how many outliers are present in using boxplots. The outliers can be seen in low quality wines having the alcohol pecentages very high.
Properties like residual sugar, chlorides, sulphar di oxide are fixed using log 10 here. In this investigation using log 10, we can observe a bimodal distribution of the residual sugar, peaks noticed between 1 and 3 and another between 8 and 12.
The peaks happen in different values, being about 0.04 g / dm? It also shows a more dense distribution.
## <ScaleContinuousPosition>
## Range:
## Limits: 0 -- 1
The free sulfur dioxide presents distinct characteristics as per above graph.The normal distribution is between 1 and 150 mg / dm?, with a concentration between 20 and 40 mg / dm?.
Total sulfur dioxide lso presents distinct characteristics . It has a normal distribution between 25 and 275 mg / dm?, with a concentration between 90 and 160 mg / dm?. The above graph shows a normal distribution for density, the peaks happen in different values. As this distribution has exceptions with very high values, we will omit 1% of the wines with the highest values of density.
pH distribution as per above is normal, the peaks happen in different values, being about 3.2
It shows a slight positive asymmetric distribution for sulphates. However, the peaks happen in different values, being about 0.4 g / dm?.
It is a slightly dense positive asymmetric distribution for alcohol. The distribution is between 8.5% and 14%, with concentration between 9% and 10.5%.
## White Wine:
##
## Call:
## lm(formula = quality ~ alcohol, data = subset(white_wine))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5317 -0.5286 0.0012 0.4996 3.1579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.582009 0.098008 26.34 <2e-16 ***
## alcohol 0.313469 0.009258 33.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared: 0.1897, Adjusted R-squared: 0.1896
## F-statistic: 1146 on 1 and 4896 DF, p-value: < 2.2e-16
## <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## White Wine:
##
## Call:
## lm(formula = density ~ alcohol, data = subset(white_wine))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.005475 -0.001238 -0.000153 0.001156 0.047201
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.014e+00 2.300e-04 4407.87 <2e-16 ***
## alcohol -1.896e-03 2.173e-05 -87.25 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001871 on 4896 degrees of freedom
## Multiple R-squared: 0.6086, Adjusted R-squared: 0.6085
## F-statistic: 7613 on 1 and 4896 DF, p-value: < 2.2e-16
## Warning: Unknown or uninitialised column: 'acidity'.
## <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## White Wine:
##
## Call:
## lm(formula = density ~ fixed_acidity, data = subset(white_wine))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.008935 -0.002318 -0.000348 0.002011 0.044064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.876e-01 3.373e-04 2927.91 <2e-16 ***
## fixed_acidity 9.404e-04 4.884e-05 19.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.002884 on 4896 degrees of freedom
## Multiple R-squared: 0.0704, Adjusted R-squared: 0.07021
## F-statistic: 370.8 on 1 and 4896 DF, p-value: < 2.2e-16
## <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## White Wine:
##
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0056862 -0.0011059 0.0001726 0.0011523 0.0155617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.909e-01 3.742e-05 26480.7 <2e-16 ***
## residual_sugar 4.947e-04 4.586e-06 107.9 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001628 on 4896 degrees of freedom
## Multiple R-squared: 0.7039, Adjusted R-squared: 0.7038
## F-statistic: 1.164e+04 on 1 and 4896 DF, p-value: < 2.2e-16
We can observe a linear trend in the dispersion charts above, except for the relation between density and fixed acidity. Although they show little correlation with quality, we will verify the distribution of these chemical properties using our personalized quality classification.
## White Wine
## group: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.10 10.17 10.80 13.50
## --------------------------------------------------------
## group: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.40 10.00 10.27 11.00 14.00
## --------------------------------------------------------
## group: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
High quality has on average more alcohol than medium or low quality wines.
## White Wine
## group: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9892 0.9926 0.9941 0.9943 0.9960 1.0004
## --------------------------------------------------------
## group: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9923 0.9944 0.9945 0.9966 1.0390
## --------------------------------------------------------
## group: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9905 0.9917 0.9924 0.9936 1.0006
As there is a negative correlation between alcohol and density, it is expected that we will find a distribution similar to the previous one, in which wines classified as high quality has an average less density than wines of medium or low quality.
## White Wine
## group: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.200 6.400 6.900 7.181 7.650 11.800
## --------------------------------------------------------
## group: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.876 7.400 14.200
## --------------------------------------------------------
## group: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.900 6.200 6.700 6.725 7.200 9.200
There seems to be a slight inverse relationship, the lower the average fixed acidity, the higher the quality.
## White Wine
## group: Low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.350 2.700 4.821 7.500 17.550
## --------------------------------------------------------
## group: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.200 6.798 10.500 65.800
## --------------------------------------------------------
## group: High
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.800 3.875 5.262 7.400 19.250
For the residual sugar, there seems to be no linear relationship to the quality of the white wine. The quality seems to be associated with the right “sweetness”.
## <ggproto object: Class CoordCartesian, Coord, gg>
## aspect: function
## clip: on
## default: FALSE
## distance: function
## expand: TRUE
## is_free: function
## is_linear: function
## labels: function
## limits: list
## modify_scales: function
## range: function
## render_axis_h: function
## render_axis_v: function
## render_bg: function
## render_fg: function
## setup_data: function
## setup_layout: function
## setup_panel_params: function
## setup_params: function
## transform: function
## super: <ggproto object: Class CoordCartesian, Coord, gg>
## Correlation between residual sugar and density for high quality wine:
##
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine,
## group == "High"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0039636 -0.0012119 -0.0000708 0.0011311 0.0042220
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.896e-01 7.712e-05 12832.69 <2e-16 ***
## residual_sugar 5.298e-04 1.136e-05 46.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001586 on 1058 degrees of freedom
## Multiple R-squared: 0.6727, Adjusted R-squared: 0.6724
## F-statistic: 2175 on 1 and 1058 DF, p-value: < 2.2e-16
## Correlation between residual sugar and density for medium quality wine:
##
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine,
## group == "Medium"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0056575 -0.0008969 0.0001570 0.0009854 0.0163569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.912e-01 4.028e-05 24606.0 <2e-16 ***
## residual_sugar 4.770e-04 4.691e-06 101.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001488 on 3653 degrees of freedom
## Multiple R-squared: 0.7389, Adjusted R-squared: 0.7389
## F-statistic: 1.034e+04 on 1 and 3653 DF, p-value: < 2.2e-16
## Correlation between residual sugar and density for low quality wine:
##
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine,
## group == "Low"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0042671 -0.0011050 0.0000907 0.0011534 0.0038391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.923e-01 1.870e-04 5305.26 <2e-16 ***
## residual_sugar 4.291e-04 2.892e-05 14.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001687 on 181 degrees of freedom
## Multiple R-squared: 0.5488, Adjusted R-squared: 0.5463
## F-statistic: 220.1 on 1 and 181 DF, p-value: < 2.2e-16
As per above, the average and median percentage of alcohol is higher for high quality wines, While the average alcohol content in low and medium wines is around 10%, in high quality wines this average rises to approx. 11.5%.
The chemical properties most closely related to density is fixed acidity. In this scatter plot we removed the medium quality wine points to ease the visualization of the difference between high and low quality wines. We can observe that high quality tend to have less density by fixed acidity. This density can be explained by the fixed acidity in 63% of the cases (R ^ 2 = 0.6308) in the high quality wines and in 54% of the cases (R ^ 2 = 0.5369) in the low quality wines.
The chemical property most closely related to density is the residual sugar. In this scatter plot, once again the medium quality wine points were removed to ease the visualization of the difference between high and low quality wines. We can observe that high quality wines tend to have lower density by residual sugar. This density can be explained by the residual sugar in 68% of cases (R ^ 2 = 0.6844) in high quality whites and in 71% of cases (R ^ 2 = 0.7064) in low quality whites.
In sum, the highest quality of wine tend to have lower density by residual sugar. But this research conducted only on white wine and covers only a few facets of the wine quality. There could be many more questions to dive deep for us to better understand and the more data collected the better we could analyse. ##Reference
[^1] Study by Steve Charters, MA (Oxon): Perceptions of Wine Quality http://ro.ecu.edu.au/cgi/viewcontent.cgi?article=1115&context=theses [^2] UCI Machine Learning Repository for Wine quality data https://archive.ics.uci.edu/ml/datasets/wine