When modelling for multivariate relationships it is important to identify and account for Adjustment and confounding variables as they can ruin the relationship between variables and give useless results.
I have used the Wine quality dataset obtained from Kaggle https://www.kaggle.com/rajyellow46/wine-quality for this task.
The datset contains the chemical properties of red and white variants of the Portuguese “Vinho Verde” wine. These chemical properties play a vital role in determining the quality of Wine.
library(readr)
library(tidyverse)
## -- Attaching packages ------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v ggplot2 3.2.1 v forcats 0.4.0
## -- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
winequalityN <- read_csv("winequalityN.csv")
## Parsed with column specification:
## cols(
## type = col_character(),
## `fixed acidity` = col_double(),
## `volatile acidity` = col_double(),
## `citric acid` = col_double(),
## `residual sugar` = col_double(),
## chlorides = col_double(),
## `free sulfur dioxide` = col_double(),
## `total sulfur dioxide` = col_double(),
## density = col_double(),
## pH = col_double(),
## sulphates = col_double(),
## alcohol = col_double(),
## quality = col_double()
## )
str(winequalityN)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 6497 obs. of 13 variables:
## $ type : chr "white" "white" "white" "white" ...
## $ fixed acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free sulfur dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total sulfur dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : num 6 6 6 6 6 6 6 6 6 6 ...
## - attr(*, "spec")=
## .. cols(
## .. type = col_character(),
## .. `fixed acidity` = col_double(),
## .. `volatile acidity` = col_double(),
## .. `citric acid` = col_double(),
## .. `residual sugar` = col_double(),
## .. chlorides = col_double(),
## .. `free sulfur dioxide` = col_double(),
## .. `total sulfur dioxide` = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_double()
## .. )
summary(winequalityN)
## type fixed acidity volatile acidity citric acid
## Length:6497 Min. : 3.800 Min. :0.0800 Min. :0.0000
## Class :character 1st Qu.: 6.400 1st Qu.:0.2300 1st Qu.:0.2500
## Mode :character Median : 7.000 Median :0.2900 Median :0.3100
## Mean : 7.217 Mean :0.3397 Mean :0.3187
## 3rd Qu.: 7.700 3rd Qu.:0.4000 3rd Qu.:0.3900
## Max. :15.900 Max. :1.5800 Max. :1.6600
## NA's :10 NA's :8 NA's :3
## residual sugar chlorides free sulfur dioxide
## Min. : 0.600 Min. :0.00900 Min. : 1.00
## 1st Qu.: 1.800 1st Qu.:0.03800 1st Qu.: 17.00
## Median : 3.000 Median :0.04700 Median : 29.00
## Mean : 5.444 Mean :0.05604 Mean : 30.53
## 3rd Qu.: 8.100 3rd Qu.:0.06500 3rd Qu.: 41.00
## Max. :65.800 Max. :0.61100 Max. :289.00
## NA's :2 NA's :2
## total sulfur dioxide density pH sulphates
## Min. : 6.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.: 77.0 1st Qu.:0.9923 1st Qu.:3.110 1st Qu.:0.4300
## Median :118.0 Median :0.9949 Median :3.210 Median :0.5100
## Mean :115.7 Mean :0.9947 Mean :3.218 Mean :0.5312
## 3rd Qu.:156.0 3rd Qu.:0.9970 3rd Qu.:3.320 3rd Qu.:0.6000
## Max. :440.0 Max. :1.0390 Max. :4.010 Max. :2.0000
## NA's :9 NA's :4
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.30 Median :6.000
## Mean :10.49 Mean :5.818
## 3rd Qu.:11.30 3rd Qu.:6.000
## Max. :14.90 Max. :9.000
##
The quality variable has been converted from numeric to Categorical data type as per the below mentioned conversion standard :
Quality 3-4 : Poor , Quality 5-6 : Average , Quality 7-9 : Good
winequalityN$quality <- ifelse(winequalityN$quality <= 4, "Poor", ifelse(winequalityN$quality <= 6, "Average", "Good"))
The below plot shows that Sulphates and PH have a positive correlation. It can be seen that as the Sulphates increases the PH value also increases.
ggplot(winequalityN) +
aes(x = sulphates, y = pH) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(method='lm', color = "black", size = 2.5) +
theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
Further grouping by wine quality also shows the same positive correlation being maintained across all levels of wine quality.
ggplot(winequalityN) +
aes(x = sulphates, y = pH, colour = quality) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
Howevever on further grouping by wine type, it can be seen that white wine still maintains a positive correlation between Sulphates and PH across all quality levels, but red wine has a negative relationship as PH decreases with increase in Suplahtes for all quality levels.
ggplot(winequalityN) +
aes(x = sulphates, y = pH, colour = quality) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal() +
facet_wrap(vars(type))
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
ggplot(winequalityN) +
aes(x = sulphates, y = pH, colour = type) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal()
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
ggplot(winequalityN) +
aes(x = sulphates, y = pH, colour = type) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal() +
facet_wrap(vars(quality))
## Warning: Removed 13 rows containing non-finite values (stat_smooth).
## Warning: Removed 13 rows containing missing values (geom_point).
The below plot shows that Residual Sugar and Alcohol have a negative correlation. It can be seen that as the residual sugar increases the alcohol decreases.
ggplot(winequalityN) +
aes(x = `residual sugar`, y = alcohol) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(method='lm', color = "black", size = 2.5) +
theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
After taking into account the wine type and quality, we observe the below changes in the relationship.
ggplot(winequalityN) +
aes(x = `residual sugar`, y = alcohol, colour = type) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(winequalityN) +
aes(x = `residual sugar`, y = alcohol, colour = type) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal() +
facet_wrap(vars(quality))
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
ggplot(winequalityN) +
aes(x = `residual sugar`, y = alcohol, colour = quality) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal()
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
However on further grouping by wine type it can be observed that white wine still maintains a negative relationship across all levels of wine quality while red wine maintains a positive relationship between residual sugar and alcohol across all quality levels.
ggplot(winequalityN) +
aes(x = `residual sugar`, y = alcohol, colour = quality) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal() +
facet_wrap(vars(type))
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).
The below plot shows that Density and Total Sulfur Dioxide have a positive correlation. It can be seen that as the density increases the total sulfur dioxide also increases.
ggplot(winequalityN) +
aes(x = density, y = `total sulfur dioxide`) +
geom_point(size = 1L, colour = "#0c4c8a") +
geom_smooth(method='lm', color = "black", size = 2.5) +
theme_minimal()
Grouping by wine type it can be seen that both red and white wines maintain a positive relationship between density and total sulfur dioxide.
ggplot(winequalityN) +
aes(x = density, y = `total sulfur dioxide`, colour = type) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal()
On further grouping the wine types by wine quality levels, shows that white wine maintains a positive relation between density and total sulfur dioxide across all quality levels, however red wine maintains a positive collinearity for Poor and average quality levels whereas for Good quality level the correlation is negative as total sulfur dioxide decreases with increase in density.
ggplot(winequalityN) +
aes(x = density, y = `total sulfur dioxide`, colour = quality) +
geom_point(size = 1L, alpha = 0.1) +
geom_smooth(method='lm', size = 2.5) +
scale_color_hue() +
theme_minimal() +
facet_wrap(vars(type))
In the above data analysis, Wine quality levels and wine type have been identified as confounding variables as they change the correlation(results) between different chemical properties by introducing bias and increasing variance when considered. Thus when woking on multivariate relationships it is important to identify and consider these variables when modelling.