This assignement explores the univariate, bivariate, & multivariate relationships between variables with data analysis techniques in R. The obersvation and variables have red wine samples only. The dataset is here; and this information document contains helpful descriptions and domain knowledge for the dataset.
Other helpful links for this work:Wiki article on acids in wine Types of variables Outlier function from r-bloggers Analysis on both White and Red Wine GGcorr documentation Legends GGplot cookbook Wiki to reversal paradox Package for reversal paradox Diamonds example project Git sample project 1 Git sample project 2
## 'data.frame': 1599 obs. of 13 variables: ## $ X : int 1 2 3 4 5 6 7 8 9 10 ... ## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ... ## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... ## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ... ## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... ## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ... ## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ... ## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ... ## $ density : num 0.998 0.997 0.997 0.998 0.998 ... ## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ... ## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... ## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ... ## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...X, the first variable, looks like a unique identifier. Note no observations have a quality greater than 8. This lack of disparity is not favorable for analysis. Quality is an ordinal categorical variable. A new variable is created from quality for subsquent analysis. Read this for more on types of variables. From the information document: “The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent)”.
Except for total.sulfar.dioxide and free.sulfar.dioxide, all other numeric variables are continous. Note total.sulfer.dioxide is the sum of free.sulfur.dioxide and bound forms; hence the two sulfur variables are related.
volitale.acidity is acetic acid, different from tartaric or fixed.acidity and citric.acid. Acetic acid gives wine vinegar like taste, while fixed acids do not easily evaporate. Citric acid is added to some wines for freshness or to increase acidity.
Outliers skew the mean. Running a version of this function will identify and plot outliers. This view shows how the distribution improves with higher proprtions or excessive outliers; see chlorides, residual sugars and sulphates for examples where the distribution improves after removing outliers.
## chlorides outlier stats
## 112 Outliers identified
## 7.5 (%) propotion of outliers
## 0.2 mean of the outliers
## 0.09 mean without removing outliers
## 0.08 mean with removeing outliers
## residual.sugar outlier stats
## 155 Outliers identified
## 10.7 (%) propotion of outliers
## 5.88 mean of the outliers
## 2.54 mean without removing outliers
## 2.18 mean with removeing outliers
## sulphates outlier stats
## 59 Outliers identified
## 3.8 (%) propotion of outliers
## 1.23 mean of the outliers
## 0.66 mean without removing outliers
## 0.64 mean with removeing outliers
## total.sulfur.dioxide outlier stats
## 55 Outliers identified
## 3.6 (%) propotion of outliers
## 143.89 mean of the outliers
## 46.47 mean without removing outliers
## 43 mean with removeing outliers
## fixed.acidity outlier stats
## 49 Outliers identified
## 3.2 (%) propotion of outliers
## 13.29 mean of the outliers
## 8.32 mean without removing outliers
## 8.16 mean with removeing outliers
These distributions below are more like normal distributions, and less impacted by outliers.
## citric.acid outlier stats
## 1 Outliers identified
## 0.1 (%) propotion of outliers
## 1 mean of the outliers
## 0.27 mean without removing outliers
## 0.27 mean with removeing outliers
## pH outlier stats
## 35 Outliers identified
## 2.2 (%) propotion of outliers
## 3.42 mean of the outliers
## 3.31 mean without removing outliers
## 3.31 mean with removeing outliers
## density outlier stats
## 45 Outliers identified
## 2.9 (%) propotion of outliers
## 1 mean of the outliers
## 1 mean without removing outliers
## 1 mean with removeing outliers
## volatile.acidity outlier stats
## 19 Outliers identified
## 1.2 (%) propotion of outliers
## 1.13 mean of the outliers
## 0.53 mean without removing outliers
## 0.52 mean with removeing outliers
This summary excludes the unique identifer variable X:
## fixed.acidity volatile.acidity citric.acid residual.sugar ## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900 ## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900 ## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200 ## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539 ## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600 ## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500 ## chlorides free.sulfur.dioxide total.sulfur.dioxide ## Min. :0.01200 Min. : 1.00 Min. : 6.00 ## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 ## Median :0.07900 Median :14.00 Median : 38.00 ## Mean :0.08747 Mean :15.87 Mean : 46.47 ## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 ## Max. :0.61100 Max. :72.00 Max. :289.00 ## density pH sulphates alcohol ## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 ## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 ## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 ## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 ## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 ## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 ## quality ## Min. :3.000 ## 1st Qu.:5.000 ## Median :6.000 ## Mean :5.636 ## 3rd Qu.:6.000 ## Max. :8.000Quality variable max and min are 8 and 3with a 6 median and 5.6 mean. Recall quality is on a 0 to 10 scale; hence both end points, zero and ten, are missing as are one, two and nine as seen in histogram. Citric acid min is 0. There exist 132 zero values. Only variable with any zero values. Residual sugar, chlorides and sulfurs appear to have outliers. Similar max for fixed acidity, residual sugar and alcohol but different means and medians. Density and pH appear to have normal distributions.
The feature of main interest is quality categorical numeric, ordinal variable. Observations rate into three groups like bad, average or good with respect to this variable. Although it is not possible to place a value nor say an average wine is three times better than a bad wine.
There exist 132 zero values in the citric acid variable. These zeros account for 8.3 percent. Citric acid is an inexpensive way to boost total acidity in wine. This wiki link says the use of citric acid for acidification is prohibited in the EU; though limited use is permitted for removing excess iron and copper from wine if potassium ferrocyanide is unavailable.
There may exist an association between residual.sugar and quality. Unsure how the multiple acidity variables relate to each other and quality. Citric.acid is the only variable with observations equal zero. Expect most positive correlation for quality to be with alcohol, and most negative to be volatile acid.
Yes, changed quality to an ordered factor the created a new variable called rating to bucket wines as bad, average or good based on quality. In addition, the tt.acidity variable sums up fixed acidity, volatile acidity, and citric acid as these acids are independent. Volatile acidity or acetic acid at high levels can lead to an unpleasant vinegar taste. Therefore expect at least a moderate negative correlation between volatile acidity and quality. On the other hand, citric acid in small quantities can add ‘freshness’ and flavor to wines while fixed acids do not evaporate readily per the information document.
# change quality to ordered factor
rw$quality <- factor(rw$quality, ordered = TRUE)
# create total acidity variable
rw$ttl.acidity <- rw$citric.acid + rw$fixed.acidity + rw$volatile.acidity
# create rating variable wrt quality
rw$rating <- ifelse(rw$quality < 5, 'bad', ifelse(
rw$quality < 7, 'average', 'good'))
rw$rating <- ordered(rw$rating,
levels = c('bad', 'average', 'good'))
Summary for rating variable below, not many bad or good wines in the dataset limits model learning and training.
## bad average good
## 4 82 14
Tidy adjustments not made to data. The lack of disparity in the distribution for quality standsout. Other variables have long tails, see above sections for additional commentary. Subsquent boxplots to show outliers while ratings feature to help classify observations in further analysis.
- Two variables, alcohol and volatile acidity have moderate correlations with quality. There is a 0.48 moderate positive correlation with alcohol and and quality. From the plot, quality increases at moderate rates with higher alcohol.
```
## # A tibble: 3 x 4
## rating alcohol_mean alcohol_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 10.21587 10.0 63
## 2 average 10.25272 10.0 1319
## 3 good 11.51805 11.6 217
```
- Volatile acid and quality have a -0.39 moderate negative correlation, which implies red wine quality decreases as volatile acidity increases.
## # A tibble: 3 x 4
## rating volatile.acidity_mean volatile.acidity_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 0.7242063 0.68 63
## 2 average 0.5385595 0.54 1319
## 3 good 0.4055300 0.37 217
This plot excludes the average wines to plot alcohol and volatile acidity into two different colored clusters. The trend lines make is easy to see the relationship alcohol and volatile acidity by rating.
- There exist weak positive correlations for both 1) quality and sulphates and 2) quality and citric acid. Quality trends in the same direction as both sulphates and citric acid at a weak rate.
## # A tibble: 3 x 4
## rating sulphates_mean sulphates_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 0.5922222 0.56 63
## 2 average 0.6472631 0.61 1319
## 3 good 0.7434562 0.74 217
## # A tibble: 3 x 4
## rating citric.acid_mean citric.acid_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 0.1736508 0.08 63
## 2 average 0.2582638 0.24 1319
## 3 good 0.3764977 0.40 217
- Citric acid and fixed acidity have a strong positive correlation of 0.67; citric acid has a weak positive correlation of 0.23 with quality while fixed acidity has a very weak positive correlation of 0.12.
## # A tibble: 3 x 4
## rating fixed.acidity_mean fixed.acidity_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 7.871429 7.5 63
## 2 average 8.254284 7.8 1319
## 3 good 8.847005 8.7 217
- The moderate negative correlation volatile acidity and citric acid where the volatile acidity y values scale with the squre root function.
## # A tibble: 3 x 4
## rating volatile.acidity_mean volatile.acidity_median n
## <ord> <dbl> <dbl> <int>
## 1 bad 0.7242063 0.68 63
## 2 average 0.5385595 0.54 1319
## 3 good 0.4055300 0.37 217
ttl.acidity variable is the sum of citric, volatile and fixed acidity. Ignore any correlation efficents between these three variable. Volatile acidity and citric acid have strong moderate negative correlation; volatile acidity has moderate negative correlation with quality. Citric acid has weak positive correlation with quality. Likewise, citric acid and fixed acidity also share a strong correlation.
Citric acid and sulfates have weak positive correlations with quality. Residual sugar corellates to only one variable, density.
Density and citric acid each correlate with five variables.
Not unexpected, fixed.acidity and pH have a strong negative correlation, likewise, total.sulfur.dioxide and free.sulfur.dioxide have a strong positive correlation. A list of notable correlations is below.
Yes, the Simpsons package clusters data into subsets to test if the regression at the level of the group is in the opposite direction at the level of the clusters. This package helps identify instances of Simpsons paradox.
Executing the Simpson function on citric acid and fixed acidity detects several clusters to regress upon. Only two clusters correlate in the same direction as the group. The overall trend for the subgroups reverses or disappears when the subgroups are combined.
Opposite Trend Lines for subgroup and overall group
This is also known as the reversal or amalgamation paradox. See wiki page here for more examples. For the correlation with density and fixed acidity, the simpsons function identifies three clusters, two of which show no evidence for Simpson’s paradox.
- Citric acid and fixed acidity have a strong positive correlation of 0.67, while citrict acid and volatile acidity have a moderate negative correlation of -0.55.
There is a negative moderate correlation of 0.5 between alcohol and density. This means higher alcohol implies lower density. This makes sense since alcohol is less dense than water. The boxplot confirms wines rated good have higher alcohol content.
pH measures acidity, this multivariate plot shows the negative correlations between pH and fixed acidty has little affect on rating.
Alcohol, sulphates and citric acid have the largest positive correlations with quality. Higher quality wines tend to be higher in alcohol, citric acid and sulphates. On the other hand, higher quality wines tend to be lower in volatile acidity.
Applying the simpsons package to different variables to identify lurking variables impacting the overall coefficent variables. Residual sugar and pH had no meaningful correlation; this confirms that wine quality really is about acidic profiles.
There is no meaningful correlation between residual sugar and alcohol; that and the positive correlation between pH & volatile acidity were unexpected.
We know that pH measures acidity on a log scale. So stronger correlations between variables with acidity and pH make sense. A linear model can quantify the pH variance based upon the three acidity variables. The r-squared value is nearly 50% of the pH variance is explained by the acid variables; meaning other variable(s) contribute to the variance.
m <- lm(pH ~
I(log10(fixed.acidity)) +
I(log10(volatile.acidity)) +
I(log10(citric.acid)),
data = subset(rw, rw$citric.acid > 0))
rw$pH.predictions <- predict(m,rw)
rw$pH.error <- (rw$pH.predictions - rw$pH)/rw$pH
ggplot(data = df, aes(x = quality, y = pH.error)) +
geom_boxplot()
summary(m)
Description One
This boxplot confirms wine quality increases with alcohol content. Though plenty of outliers exist in quality 5.
Description Two
The trends line for Bad and Good wines trend in a different direction than average wines. The average subgroup shows a negative correlation between alcohol and volatile acidity. The trend lines make is easy to see the relationship alcohol and volatile acidity by rating.
Correlation matrix for red wine data subset
The correlation matrix makes is easy to identify correlations greater than 0.3 - this visual is clean and highlight noteable correlations.
With this exploratory data analysis on the red wine dataset, I found the biggest challenging was sharing the right amount of information. Plots and visuals make it easier to see where to explore more. Overcoming challenges with domain knowledge was aided by internet research, though I can see how domain knowledge is super helpful during the EDA process.
Alcohol and volatile acid have the largest correlations with quality. Citric acid and sulphates also have positive correlations. Sulphates, liked fixed acidity, had several observations with high values but average wine ratings. This subset was missing any wines rated above 8 or below 3. Having a more disparse dataset would improve the analysis; some of the challenges with the data included factoring the quality variable and domain knowledge. Wine is all about the acids, so understaind the relationships between acids and sulfur could be helpful. Additional inferential statistics and modeling could be done to quantify and confirm the analysis. After this project, I understand why wine data is a fun way to explore data analysis techniques.