Redwine Data - Exploratory Data Analytics

Libraries Used

library(ggplot2)
library(ggplot2)
library(scales)
library(DT)
library(tidyr)
library(dplyr)
library(psych)
library(gridExtra)
library(GGally)
library(corrplot)

The dataset and its attributes:

The dataset is the red wine from UCI wine data. The variables in this dataset are the following:

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

1. Understand the dimensions and check if there are missing values

We start our exploratory data analysis journey by looking at the shape of the dataset and the data types. And we get the following findings:

there are 1599 observations, 12 variables.

Integer variable: The integer variable quality is the independent (target) variable.
Numeric variables: The remaining 11 numeric variables are the dependent (input) variables.

based on the skew values, the residual.sugar, chlorides, and sulphates are very skewed. We will look at the distribution of each variable to further verify this finding.

Below is the summary of the descriptive statistics for the 12 variables in the data set.

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Checking for missing values

We then move on the check if there are missing values in any of the variables. And we find that there were 0 rows with NA in the data set, which means there is no missing value.

## [1] 0

Checking for outliers

A check was also done to understand the outliers in the different independent variables. It is obvious that in many cases the outliers were in the upper range.

variables	count	iqr_val	l_whisker	u_whisker	ol_low	ol_low_prct	ol_upper	ol_upper_prct
alcohol	1599	1.60	7.10	13.50	0	0.00	13	0.01
chlorides	1599	0.02	0.04	0.12	9	0.01	103	0.06
citric.acid	1599	0.33	-0.40	0.91	0	0.00	1	0.00
density	1599	0.00	0.99	1.00	21	0.01	24	0.02
fixed.acidity	1599	2.10	3.95	12.35	0	0.00	49	0.03
free.sulfur.dioxide	1599	14.00	-14.00	42.00	0	0.00	30	0.02
pH	1599	0.19	2.92	3.68	14	0.01	21	0.01
residual.sugar	1599	0.70	0.85	3.65	0	0.00	155	0.10
sulphates	1599	0.18	0.28	1.00	0	0.00	59	0.04
total.sulfur.dioxide	1599	40.00	-38.00	122.00	0	0.00	55	0.03
volatile.acidity	1599	0.25	0.02	1.02	0	0.00	19	0.01

2. Understand the distributions of variables

Our logical next step is to check the distribution of our target variable - quality. Despite that the data type of quality is an integer, this variable is ordinal values. That is, if we plot scattergraphs between the quality variable and other variables, we will see the dots in 6 lines - either vertical or horizontal. We can verify this theory by plotting the scatter graphs, but it would be more revealing if we choose the violin plots.

Base on below, we can also make the below finding: This red wine quality data is unbalanced, with a large majority of wines fall into quality 5 and 6, less than 4% of the overall sample is under 5, and less than 14 is above 6.

define variables for visualization

define some global virables

We plot the histogram of all input (independent) variables as shown in below, and we get the following findings:

Density and pH are almost normally distributed.
The rest are positively skewed.

3. Understand the relationship between input variables and the target variable

We want to understand two areas of relationships:

the relationship between each independent variable and the target variable.
The relationship among independent variables.

As mentioned earlier, we suspect that if we plot the scatterplot between any input variable and the quality variable, we will see dots in lines. We prove our theory by plotting the graph and see what we expect. Furthermore, we notice that there are outliers. For example, by looking at the graph of residual.sugar - quality, we see that several outliers with rather high sugar number. Another example is the total.sulfur.dioxide - quality, it looks like 2 good wines (quality=7) have extremely high total sulfur dioxide compare to the rest.

By looking at above graphs, we still don’t know how the variables are distributed per each quality and where is the quartiles fall into, so we try the violin and density graph to get this information. It is obvious that different quality of wines fall into different distribution by looking at the shape of the violin. The boxplot in the middle is also revealing. For example, quality is increasing as the average value of volatile.acidity decreases and citric.acid increases; there does not seem to be much difference in quality for the average value of residual.sugar (except the outliers) and chlorides. We can get almost same conclusions from density graphs as the violins, because, after all, violin is based on density.

We plot the graphs among variables, and by observing the below we have the following findings:

Some input variables are having a more visual relationship to quality than others.

volatile.acidity and citric.acid show negative correlation with each other
Quality doesn’t seem to show a visual pattern with sugar or chlorides, and a weak one with the acids
Higher quality appears to correspond with higher alcohol and sulphites
There are outliers in free.sulfpher.dioxide, total.sulfur.dioxide, sulphates, fixed.acidity, volatile.acidity, citric.acid, residual.sugar and chlorides

4. Understand the relationship in input variables

Now we have gained quite a lot understanding about the relationship between each input variable and the target variable. But what about the relationship among the input variables? We go ahead plotting the correlation plot. By looking at both graphs, we have the following findings:

Fixed.acidity has a strong positive correlation with citric.acid and density, and strong negative correlation with pH.
Citric.acid has somewhat negative correlation with volatile.acidity and pH.
Alcohol is loosely negatively correlated to density and mildly correlated to quality.
There are not much relationship among the rest.
Given the relatively weak correlations among the various variables, it may be difficult to get a robust model for quality.