When modelling for multivariate relationships it is important to identify and account for Adjustment and confounding variables as they can ruin the relationship between variables and give useless results.

Dataset

I have used the Wine quality dataset obtained from Kaggle https://www.kaggle.com/rajyellow46/wine-quality for this task.

The datset contains the chemical properties of red and white variants of the Portuguese “Vinho Verde” wine. These chemical properties play a vital role in determining the quality of Wine.

Load and View Data

library(readr)
library(tidyverse)
## -- Attaching packages ------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   0.8.3     v stringr 1.4.0
## v ggplot2 3.2.1     v forcats 0.4.0
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
winequalityN <- read_csv("winequalityN.csv")
## Parsed with column specification:
## cols(
##   type = col_character(),
##   `fixed acidity` = col_double(),
##   `volatile acidity` = col_double(),
##   `citric acid` = col_double(),
##   `residual sugar` = col_double(),
##   chlorides = col_double(),
##   `free sulfur dioxide` = col_double(),
##   `total sulfur dioxide` = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_double()
## )
str(winequalityN)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6497 obs. of  13 variables:
##  $ type                : chr  "white" "white" "white" "white" ...
##  $ fixed acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free sulfur dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total sulfur dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : num  6 6 6 6 6 6 6 6 6 6 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   type = col_character(),
##   ..   `fixed acidity` = col_double(),
##   ..   `volatile acidity` = col_double(),
##   ..   `citric acid` = col_double(),
##   ..   `residual sugar` = col_double(),
##   ..   chlorides = col_double(),
##   ..   `free sulfur dioxide` = col_double(),
##   ..   `total sulfur dioxide` = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_double()
##   .. )
summary(winequalityN)
##      type           fixed acidity    volatile acidity  citric acid    
##  Length:6497        Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  Class :character   1st Qu.: 6.400   1st Qu.:0.2300   1st Qu.:0.2500  
##  Mode  :character   Median : 7.000   Median :0.2900   Median :0.3100  
##                     Mean   : 7.217   Mean   :0.3397   Mean   :0.3187  
##                     3rd Qu.: 7.700   3rd Qu.:0.4000   3rd Qu.:0.3900  
##                     Max.   :15.900   Max.   :1.5800   Max.   :1.6600  
##                     NA's   :10       NA's   :8        NA's   :3       
##  residual sugar     chlorides       free sulfur dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  1.00     
##  1st Qu.: 1.800   1st Qu.:0.03800   1st Qu.: 17.00     
##  Median : 3.000   Median :0.04700   Median : 29.00     
##  Mean   : 5.444   Mean   :0.05604   Mean   : 30.53     
##  3rd Qu.: 8.100   3rd Qu.:0.06500   3rd Qu.: 41.00     
##  Max.   :65.800   Max.   :0.61100   Max.   :289.00     
##  NA's   :2        NA's   :2                            
##  total sulfur dioxide    density             pH          sulphates     
##  Min.   :  6.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 77.0        1st Qu.:0.9923   1st Qu.:3.110   1st Qu.:0.4300  
##  Median :118.0        Median :0.9949   Median :3.210   Median :0.5100  
##  Mean   :115.7        Mean   :0.9947   Mean   :3.218   Mean   :0.5312  
##  3rd Qu.:156.0        3rd Qu.:0.9970   3rd Qu.:3.320   3rd Qu.:0.6000  
##  Max.   :440.0        Max.   :1.0390   Max.   :4.010   Max.   :2.0000  
##                                        NA's   :9       NA's   :4       
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.30   Median :6.000  
##  Mean   :10.49   Mean   :5.818  
##  3rd Qu.:11.30   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :9.000  
## 

Variable categorization

The quality variable has been converted from numeric to Categorical data type as per the below mentioned conversion standard :

Quality 3-4 : Poor , Quality 5-6 : Average , Quality 7-9 : Good

winequalityN$quality <- ifelse(winequalityN$quality <= 4, "Poor", ifelse(winequalityN$quality <= 6, "Average", "Good"))

Data Analysis

1) Sulphates - PH

The below plot shows that Sulphates and PH have a positive correlation. It can be seen that as the Sulphates increases the PH value also increases.

ggplot(winequalityN) +
 aes(x = sulphates, y = pH) +
 geom_point(size = 1L, colour = "#0c4c8a") +
  geom_smooth(method='lm', color = "black", size = 2.5) +
 theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Further grouping by wine quality also shows the same positive correlation being maintained across all levels of wine quality.

ggplot(winequalityN) +
 aes(x = sulphates, y = pH, colour = quality) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
 scale_color_hue() +
 theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

Howevever on further grouping by wine type, it can be seen that white wine still maintains a positive correlation between Sulphates and PH across all quality levels, but red wine has a negative relationship as PH decreases with increase in Suplahtes for all quality levels.

ggplot(winequalityN) +
 aes(x = sulphates, y = pH, colour = quality) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
 scale_color_hue() +
 theme_minimal() +
 facet_wrap(vars(type))
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).

ggplot(winequalityN) +
 aes(x = sulphates, y = pH, colour = type) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5)  +
 scale_color_hue() +
 theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).

## Warning: Removed 13 rows containing missing values (geom_point).

ggplot(winequalityN) +
 aes(x = sulphates, y = pH, colour = type) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5)  +
 scale_color_hue() +
 theme_minimal() +
 facet_wrap(vars(quality))
## Warning: Removed 13 rows containing non-finite values (stat_smooth).

## Warning: Removed 13 rows containing missing values (geom_point).

2) Residual Sugar - Alcohol

The below plot shows that Residual Sugar and Alcohol have a negative correlation. It can be seen that as the residual sugar increases the alcohol decreases.

ggplot(winequalityN) +
  aes(x = `residual sugar`, y = alcohol) +
  geom_point(size = 1L, colour = "#0c4c8a") +
  geom_smooth(method='lm', color = "black", size = 2.5) +
  theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

After taking into account the wine type and quality, we observe the below changes in the relationship.

  1. For all levels of wine quality, the relationship between residual sugar and alcohol remains negative for white wine but however for red wine there is a positive relation as alcohol increases with rise in residual sugar.
ggplot(winequalityN) +
  aes(x = `residual sugar`, y = alcohol, colour = type) +
  geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
  scale_color_hue() +
  theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot(winequalityN) +
  aes(x = `residual sugar`, y = alcohol, colour = type) +
  geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
  scale_color_hue() +
  theme_minimal() +
  facet_wrap(vars(quality))
## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

  1. Across all levels of wine quality the relationship between residual sugar and alcohol remains negative.
ggplot(winequalityN) +
  aes(x = `residual sugar`, y = alcohol, colour = quality) +
  geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) + 
  scale_color_hue() +
  theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

However on further grouping by wine type it can be observed that white wine still maintains a negative relationship across all levels of wine quality while red wine maintains a positive relationship between residual sugar and alcohol across all quality levels.

ggplot(winequalityN) +
  aes(x = `residual sugar`, y = alcohol, colour = quality) +
  geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
  scale_color_hue() +
  theme_minimal() +
  facet_wrap(vars(type))
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

3) Density - Total Sulfur Dioxide

The below plot shows that Density and Total Sulfur Dioxide have a positive correlation. It can be seen that as the density increases the total sulfur dioxide also increases.

ggplot(winequalityN) +
 aes(x = density, y = `total sulfur dioxide`) +
 geom_point(size = 1L, colour = "#0c4c8a") +
  geom_smooth(method='lm', color = "black", size = 2.5) + 
 theme_minimal()

Grouping by wine type it can be seen that both red and white wines maintain a positive relationship between density and total sulfur dioxide.

ggplot(winequalityN) +
 aes(x = density, y = `total sulfur dioxide`, colour = type) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
 scale_color_hue() +
 theme_minimal()

On further grouping the wine types by wine quality levels, shows that white wine maintains a positive relation between density and total sulfur dioxide across all quality levels, however red wine maintains a positive collinearity for Poor and average quality levels whereas for Good quality level the correlation is negative as total sulfur dioxide decreases with increase in density.

ggplot(winequalityN) +
 aes(x = density, y = `total sulfur dioxide`, colour = quality) +
 geom_point(size = 1L, alpha = 0.1) +
  geom_smooth(method='lm',  size = 2.5) +
 scale_color_hue() +
 theme_minimal() +
 facet_wrap(vars(type))

Conslusion

In the above data analysis, Wine quality levels and wine type have been identified as confounding variables as they change the correlation(results) between different chemical properties by introducing bias and increasing variance when considered. Thus when woking on multivariate relationships it is important to identify and consider these variables when modelling.