Illustrating Confounding Variables through Wine

Jedo Enriquez

2020-05-31

Confounding variables (also known as lurking variables) are essentially factors that affect both your dependent and independent variables. Failing to account for these confounders could lead to incorrect conclusions when doing analysis for your dataset.

To illustrate a basic example, we look at data concerned with a commodity that certainly had its share of consumption growth during the quarantine period: Wine.

Importing our dataset

For this exercise we’re using the wine dataset from https://www.kaggle.com/rajyellow46/wine-quality which contains information on the Portuguese “Vinho Verde” wine, including observations on physiochemical properties as well as each wine’s taste quality.

library(tidyverse)
wine <- read_csv("winequality.csv")

The lower the sugar, the better the wine?

By plotting ‘wine quality’ with ‘residual sugar’, we find out that as sweetness level goes up, the quality of the wine actually diminishes.

ggplot(wine) +
  aes(x = `residual sugar`, y = quality) +
  geom_smooth(method='lm',  size = 1.5) +
  scale_color_hue() +
  theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 2 rows containing non-finite values (stat_smooth).

Before we readily assume that wine drinkers would always prefer liquor that’s completely devoid of sweetness, we first have to consider if there are confounding variables that would affect our analysis.

An obvious one that’s available in our dataset is of course the type of wine.

White Wine vs Red Wine

If we produce the same plot but split by the type of wine, we discover that our premature conclusion is incorrect. It turns out that perception of quality improves with sweetness for red wine, while the inverse is true only for white wine.

ggplot(wine) +
  aes(x = `residual sugar`, y = quality, colour = type) +
  geom_smooth(method='lm',  size = 1) +
  scale_color_hue() +
  theme_minimal()
#> `geom_smooth()` using formula 'y ~ x'
#> Warning: Removed 2 rows containing non-finite values (stat_smooth).

You can develop this further by extending the analysis to consider the following variables, which could likely be confounders in this particular example: