## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

## corrplot 0.84 loaded

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## The following object is masked from 'package:tidyr':
## 
##     extract

1.Introduction

The term “Quality” for wine is regularly used by those who produce,discuss and consume wine. Equally – as the following extract from a letter written by a consumer to the wine magazine Decanter shows – it may be used by the public: In fact, it is as important for a wine aficionado to know that certain wine [sic] is top quality as knowing that certain vintage [sic] of a famous wine is clearly below the average, especially if the wine is expensive, because in this case he may save a lot of money (Martinez,1999)[^1]

Quality, as an element of wine, has a long history. The Egyptians were apparently describing wine as ‘very good quality’ by the death of Tuthankamun in 1352 BC (Johnson, 1989), and the Romans subsequently specified the best regions and the greatest vintages (Johnson, 1989). The wine trades today adopt various mechanismns to grade wine, with things like informal consumer magazine tastings, wine shows etc.

The concept of wine quality is very important to the wine industry and to wine consumers. Therefore, the perception that one buys or experiences “quality” by one’s choice may have a significant influence on the consumer decision-making process.

We would like to explore the following questions with this project: - Which property of wine has the highest density? Due to which the quality of white wine is better compared to the factor which makes consumers buy more white wine? - What could improve the quality of wine with the given properties? - Which property has the highest grade rate? Why? - How do you analyse each property and it’s behaviour with respect to the overall quality of wine?

2. Data and Methods

The data sources we choose to use are Wine quality dataset of UCI Machine Learning Repository by Forina, M. et al, PARVUS[^2]. The UCI machine learning repository is a collection of databases, domain theories used by machine learning community. This archive was created in 1987 by David Aha and a fellow graduate students at UC Irvine.

Quality is been an important issue within the feild of consumer behaviour (Holbrook & Corfman, 1985; Olson, 1977; Steenkamp, 1990; Sweeney & Soutar, 1995; Zeithaml, 1988), although its nature and relationship to other factors such as price and value is subject to debate. This study could give us an insight into the type of quality analysis on wine and also helps us understand the consumer behaviour and also to better undertand which property of wine could trigger more buyers.

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ purrr   0.2.5
## ✔ readr   1.1.1     ✔ stringr 1.3.1
## ✔ tibble  1.4.2     ✔ forcats 0.3.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::as.difftime() masks base::as.difftime()
## ✖ gridExtra::combine()     masks dplyr::combine()
## ✖ lubridate::date()        masks base::date()
## ✖ pastecs::extract()       masks tidyr::extract()
## ✖ dplyr::filter()          masks stats::filter()
## ✖ pastecs::first()         masks dplyr::first()
## ✖ lubridate::intersect()   masks base::intersect()
## ✖ dplyr::lag()             masks stats::lag()
## ✖ pastecs::last()          masks dplyr::last()
## ✖ lubridate::setdiff()     masks base::setdiff()
## ✖ lubridate::union()       masks base::union()

## [1] "/Users/lekhana/Google Drive/_Landing Zone/HU/2018_05to08_DataVisualization/Data Viz in R"

## Parsed with column specification:
## cols(
##   fixed_acidity = col_double(),
##   volatile_acidity = col_double(),
##   citric_acid = col_double(),
##   residual_sugar = col_double(),
##   chlorides = col_double(),
##   free_sulfur_dioxide = col_double(),
##   total_sulfur_dioxide = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_integer()
## )

##              fixed_acidity volatile_acidity  citric_acid residual_sugar
## nbr.val       4.898000e+03     4.898000e+03 4.898000e+03   4.898000e+03
## nbr.null      0.000000e+00     0.000000e+00 1.900000e+01   0.000000e+00
## nbr.na        0.000000e+00     0.000000e+00 0.000000e+00   0.000000e+00
## min           3.800000e+00     8.000000e-02 0.000000e+00   6.000000e-01
## max           1.420000e+01     1.100000e+00 1.660000e+00   6.580000e+01
## range         1.040000e+01     1.020000e+00 1.660000e+00   6.520000e+01
## sum           3.357475e+04     1.362825e+03 1.636870e+03   3.130515e+04
## median        6.800000e+00     2.600000e-01 3.200000e-01   5.200000e+00
## mean          6.854788e+00     2.782411e-01 3.341915e-01   6.391415e+00
## SE.mean       1.205772e-02     1.440216e-03 1.729207e-03   7.247276e-02
## CI.mean.0.95  2.363854e-02     2.823469e-03 3.390022e-03   1.420791e-01
## var           7.121136e-01     1.015954e-02 1.464579e-02   2.572577e+01
## std.dev       8.438682e-01     1.007945e-01 1.210198e-01   5.072058e+00
## coef.var      1.231064e-01     3.622561e-01 3.621271e-01   7.935736e-01
##                 chlorides free_sulfur_dioxide total_sulfur_dioxide
## nbr.val      4.898000e+03        4.898000e+03         4.898000e+03
## nbr.null     0.000000e+00        0.000000e+00         0.000000e+00
## nbr.na       0.000000e+00        0.000000e+00         0.000000e+00
## min          9.000000e-03        2.000000e+00         9.000000e+00
## max          3.460000e-01        2.890000e+02         4.400000e+02
## range        3.370000e-01        2.870000e+02         4.310000e+02
## sum          2.241930e+02        1.729390e+05         6.776905e+05
## median       4.300000e-02        3.400000e+01         1.340000e+02
## mean         4.577236e-02        3.530808e+01         1.383607e+02
## SE.mean      3.121775e-04        2.430087e-01         6.072391e-01
## CI.mean.0.95 6.120080e-04        4.764061e-01         1.190461e+00
## var          4.773337e-04        2.892427e+02         1.806085e+03
## std.dev      2.184797e-02        1.700714e+01         4.249806e+01
## coef.var     4.773180e-01        4.816783e-01         3.071543e-01
##                   density           pH    sulphates      alcohol
## nbr.val      4.898000e+03 4.898000e+03 4.898000e+03 4.898000e+03
## nbr.null     0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## nbr.na       0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## min          9.871100e-01 2.720000e+00 2.200000e-01 8.000000e+00
## max          1.038980e+00 3.820000e+00 1.080000e+00 1.420000e+01
## range        5.187000e-02 1.100000e+00 8.600000e-01 6.200000e+00
## sum          4.868746e+03 1.561613e+04 2.399270e+03 5.149888e+04
## median       9.937400e-01 3.180000e+00 4.700000e-01 1.040000e+01
## mean         9.940274e-01 3.188267e+00 4.898469e-01 1.051427e+01
## SE.mean      4.273596e-05 2.157592e-03 1.630702e-03 1.758388e-02
## CI.mean.0.95 8.378166e-05 4.229848e-03 3.196907e-03 3.447230e-02
## var          8.945524e-06 2.280118e-02 1.302471e-02 1.514427e+00
## std.dev      2.990907e-03 1.510006e-01 1.141258e-01 1.230621e+00
## coef.var     3.008878e-03 4.736135e-02 2.329827e-01 1.170429e-01
##                   quality group
## nbr.val      4.898000e+03    NA
## nbr.null     0.000000e+00    NA
## nbr.na       0.000000e+00    NA
## min          3.000000e+00    NA
## max          9.000000e+00    NA
## range        6.000000e+00    NA
## sum          2.879000e+04    NA
## median       6.000000e+00    NA
## mean         5.877909e+00    NA
## SE.mean      1.265456e-02    NA
## CI.mean.0.95 2.480862e-02    NA
## var          7.843557e-01    NA
## std.dev      8.856386e-01    NA
## coef.var     1.506724e-01    NA

2a. Data set Construction

Data used to construct the dataset for this project include: UCI Machine Learning Reopsitory - Fixed Acidity (the acid that contributes to the conservation of wine) - Volatile Acidity (the amount of acetic acid in wine at high level which leads to the unpleasant taste of vinegar) - Citric Acid (this could be found in small amounts and can add freshness to the wine) - Residual sugar (the amount of sugar remaining at the end of the process of fermentation) - Chlorides (the amount of salt in wine) - Free Sulphur Dioxide (this prevents the increase of microbes and oxidation of the wine during the process of fermentation) - Total Sulphur Dioxide (gives the aroma and also a ting of flavour to the wine) - Density (this is the density of water and purely depends on the water and the amount of sugar) - pH (acidic or basic a wine is on a pH scale) - Sulfates (an additive that adds as antioxidant) - Alcohol (the percentage of alcohol present in wine)

2b. Data Analysis Methods

In order to explore the research questions, we leveraged bar chart, line plot, boxplot, histograms, to find the trend and correlation with the quality of white wine. Some of the data was straight foreward easy to understand, but we had to manipulate it for performing analysis and visualizations - for example, there was a column with sensory property where we took the range and created a group to classify high, medium and low ranges.

3. Data Analysis and Visualization

This dataset was not in csv format, we had to format it into csv and used for analysis. After which we - White wine Data set(4898 observations of 13 variables)

Distribution of Properties of Wine

The most straight forward analysis we could do was check the distribution of all the wine properties. Hence we plotted the distribution across these properties, we could chlorides, residual sugar are skewed to the left and hence we need to fix the skew by taking the log of these values.

The highest count value is with properties such as fixed acidity and volatile acidity. To further examine this we run Density plots to check how these properties are associated with quality of wine.

Density Plots for Quality analysis

This plot shows the density of the property with respect to density and property of wine. The highest qine quality point is 9 and it falls on a density scale from 12 to 14.

We are also trying to understand the effect or range where residual sugar falls in scale of quality points.The highest we can see it falls somewhere between 0 to 10.Residual Sugar has a positive asymmetrical distribution and a long tail, with concentration between 1 g / dm? and 1,5 dm?.

Grouping wine Quality

We have grouped the wine quality into 3 levels - high, medium and low. As we can see there are significantly more medium quality wine compared to high and low quality ones from the plot below. However, doing this will result in over and under representation of high and medium quality wines respectively as well as having too little data points for analysis.

Since alcohol was a major contributor to the density plot graph we are diving into to see how many outliers are present in using boxplots. The outliers can be seen in low quality wines having the alcohol pecentages very high.

Fixing skweness in variables

Properties like residual sugar, chlorides, sulphar di oxide are fixed using log 10 here. In this investigation using log 10, we can observe a bimodal distribution of the residual sugar, peaks noticed between 1 and 3 and another between 8 and 12.

The peaks happen in different values, being about 0.04 g / dm? It also shows a more dense distribution.

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

Below Analysis for Free Sulfur Dioxide

The free sulfur dioxide presents distinct characteristics as per above graph.The normal distribution is between 1 and 150 mg / dm?, with a concentration between 20 and 40 mg / dm?.

Below Analysis for Free Sulfur Dioxide

Total sulfur dioxide lso presents distinct characteristics . It has a normal distribution between 25 and 275 mg / dm?, with a concentration between 90 and 160 mg / dm?. The above graph shows a normal distribution for density, the peaks happen in different values. As this distribution has exceptions with very high values, we will omit 1% of the wines with the highest values of density.

Below Analysis for pH

pH distribution as per above is normal, the peaks happen in different values, being about 3.2

It shows a slight positive asymmetric distribution for sulphates. However, the peaks happen in different values, being about 0.4 g / dm?.

It is a slightly dense positive asymmetric distribution for alcohol. The distribution is between 8.5% and 14%, with concentration between 9% and 10.5%.

Univariate Analysis of White Wine between alcohol and Quality.

## White Wine:

## 
## Call:
## lm(formula = quality ~ alcohol, data = subset(white_wine))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.582009   0.098008   26.34   <2e-16 ***
## alcohol     0.313469   0.009258   33.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Corelation between Alcohol and Density

## <ggproto object: Class CoordCartesian, Coord, gg>
##     aspect: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_params: function
##     setup_params: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord, gg>

## White Wine:

## 
## Call:
## lm(formula = density ~ alcohol, data = subset(white_wine))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.005475 -0.001238 -0.000153  0.001156  0.047201 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.014e+00  2.300e-04 4407.87   <2e-16 ***
## alcohol     -1.896e-03  2.173e-05  -87.25   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001871 on 4896 degrees of freedom
## Multiple R-squared:  0.6086, Adjusted R-squared:  0.6085 
## F-statistic:  7613 on 1 and 4896 DF,  p-value: < 2.2e-16

Corelation between Density and Fixed Acidity

## Warning: Unknown or uninitialised column: 'acidity'.

## <ggproto object: Class CoordCartesian, Coord, gg>
##     aspect: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_params: function
##     setup_params: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord, gg>

## White Wine:

## 
## Call:
## lm(formula = density ~ fixed_acidity, data = subset(white_wine))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.008935 -0.002318 -0.000348  0.002011  0.044064 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.876e-01  3.373e-04 2927.91   <2e-16 ***
## fixed_acidity 9.404e-04  4.884e-05   19.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002884 on 4896 degrees of freedom
## Multiple R-squared:  0.0704, Adjusted R-squared:  0.07021 
## F-statistic: 370.8 on 1 and 4896 DF,  p-value: < 2.2e-16

Corelation between Density and Residual Sugar

## <ggproto object: Class CoordCartesian, Coord, gg>
##     aspect: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_params: function
##     setup_params: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord, gg>

## White Wine:

## 
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0056862 -0.0011059  0.0001726  0.0011523  0.0155617 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.909e-01  3.742e-05 26480.7   <2e-16 ***
## residual_sugar 4.947e-04  4.586e-06   107.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001628 on 4896 degrees of freedom
## Multiple R-squared:  0.7039, Adjusted R-squared:  0.7038 
## F-statistic: 1.164e+04 on 1 and 4896 DF,  p-value: < 2.2e-16

We can observe a linear trend in the dispersion charts above, except for the relation between density and fixed acidity. Although they show little correlation with quality, we will verify the distribution of these chemical properties using our personalized quality classification.

Relation between Alcohol and quality group

## White Wine

## group: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.10   10.17   10.80   13.50 
## -------------------------------------------------------- 
## group: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.40   10.00   10.27   11.00   14.00 
## -------------------------------------------------------- 
## group: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   10.70   11.50   11.42   12.40   14.20

High quality has on average more alcohol than medium or low quality wines.

Relation between Density and quality group

## White Wine

## group: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9960  1.0004 
## -------------------------------------------------------- 
## group: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9923  0.9944  0.9945  0.9966  1.0390 
## -------------------------------------------------------- 
## group: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9905  0.9917  0.9924  0.9936  1.0006

As there is a negative correlation between alcohol and density, it is expected that we will find a distribution similar to the previous one, in which wines classified as high quality has an average less density than wines of medium or low quality.

Relation between Fixed Acidity and quality group

## White Wine

## group: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.400   6.900   7.181   7.650  11.800 
## -------------------------------------------------------- 
## group: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.876   7.400  14.200 
## -------------------------------------------------------- 
## group: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.700   6.725   7.200   9.200

There seems to be a slight inverse relationship, the lower the average fixed acidity, the higher the quality.

Relation between Residual Sugar and quality group

## White Wine

## group: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.350   2.700   4.821   7.500  17.550 
## -------------------------------------------------------- 
## group: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   6.200   6.798  10.500  65.800 
## -------------------------------------------------------- 
## group: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   1.800   3.875   5.262   7.400  19.250

For the residual sugar, there seems to be no linear relationship to the quality of the white wine. The quality seems to be associated with the right “sweetness”.

Multivariate Plot sections

## <ggproto object: Class CoordCartesian, Coord, gg>
##     aspect: function
##     clip: on
##     default: FALSE
##     distance: function
##     expand: TRUE
##     is_free: function
##     is_linear: function
##     labels: function
##     limits: list
##     modify_scales: function
##     range: function
##     render_axis_h: function
##     render_axis_v: function
##     render_bg: function
##     render_fg: function
##     setup_data: function
##     setup_layout: function
##     setup_panel_params: function
##     setup_params: function
##     transform: function
##     super:  <ggproto object: Class CoordCartesian, Coord, gg>

## Correlation between residual sugar and density for high quality wine:

## 
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine, 
##     group == "High"))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0039636 -0.0012119 -0.0000708  0.0011311  0.0042220 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)    9.896e-01  7.712e-05 12832.69   <2e-16 ***
## residual_sugar 5.298e-04  1.136e-05    46.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001586 on 1058 degrees of freedom
## Multiple R-squared:  0.6727, Adjusted R-squared:  0.6724 
## F-statistic:  2175 on 1 and 1058 DF,  p-value: < 2.2e-16

## Correlation between residual sugar and density for medium quality wine:

## 
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine, 
##     group == "Medium"))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0056575 -0.0008969  0.0001570  0.0009854  0.0163569 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.912e-01  4.028e-05 24606.0   <2e-16 ***
## residual_sugar 4.770e-04  4.691e-06   101.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001488 on 3653 degrees of freedom
## Multiple R-squared:  0.7389, Adjusted R-squared:  0.7389 
## F-statistic: 1.034e+04 on 1 and 3653 DF,  p-value: < 2.2e-16

## Correlation between residual sugar and density for low quality wine:

## 
## Call:
## lm(formula = density ~ residual_sugar, data = subset(white_wine, 
##     group == "Low"))
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0042671 -0.0011050  0.0000907  0.0011534  0.0038391 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.923e-01  1.870e-04 5305.26   <2e-16 ***
## residual_sugar 4.291e-04  2.892e-05   14.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001687 on 181 degrees of freedom
## Multiple R-squared:  0.5488, Adjusted R-squared:  0.5463 
## F-statistic: 220.1 on 1 and 181 DF,  p-value: < 2.2e-16

Final Summary

Percentage of alcohol by Wine Quality

As per above, the average and median percentage of alcohol is higher for high quality wines, While the average alcohol content in low and medium wines is around 10%, in high quality wines this average rises to approx. 11.5%.

The chemical properties most closely related to density is fixed acidity. In this scatter plot we removed the medium quality wine points to ease the visualization of the difference between high and low quality wines. We can observe that high quality tend to have less density by fixed acidity. This density can be explained by the fixed acidity in 63% of the cases (R ^ 2 = 0.6308) in the high quality wines and in 54% of the cases (R ^ 2 = 0.5369) in the low quality wines.

The chemical property most closely related to density is the residual sugar. In this scatter plot, once again the medium quality wine points were removed to ease the visualization of the difference between high and low quality wines. We can observe that high quality wines tend to have lower density by residual sugar. This density can be explained by the residual sugar in 68% of cases (R ^ 2 = 0.6844) in high quality whites and in 71% of cases (R ^ 2 = 0.7064) in low quality whites.

4. Discussion

In sum, the highest quality of wine tend to have lower density by residual sugar. But this research conducted only on white wine and covers only a few facets of the wine quality. There could be many more questions to dive deep for us to better understand and the more data collected the better we could analyse. ##Reference

[^1] Study by Steve Charters, MA (Oxon): Perceptions of Wine Quality http://ro.ecu.edu.au/cgi/viewcontent.cgi?article=1115&context=theses [^2] UCI Machine Learning Repository for Wine quality data https://archive.ics.uci.edu/ml/datasets/wine

Project For White Wine

Saurabh,Lekhana

August 12, 2018

1.Introduction

2. Data and Methods

2a. Data set Construction

2b. Data Analysis Methods

3. Data Analysis and Visualization

Distribution of Properties of Wine

Density Plots for Quality analysis

Grouping wine Quality

Fixing skweness in variables

Below Analysis for Free Sulfur Dioxide

Below Analysis for Free Sulfur Dioxide

Below Analysis for pH

Univariate Analysis of White Wine between alcohol and Quality.

Corelation between Alcohol and Density

Corelation between Density and Fixed Acidity

Corelation between Density and Residual Sugar

Relation between Alcohol and quality group

Relation between Density and quality group

Relation between Fixed Acidity and quality group

Relation between Residual Sugar and quality group

Multivariate Plot sections

Final Summary

Percentage of alcohol by Wine Quality

4. Discussion