Simpson’s Paradox

Jeff Shamp

2020-08-31

Introduction

Simpson’s paradox is a phenomenon that arises in data analysis, which if not considered may lead to spurious conclusions or misleading predictions. The general idea of the paradox is that a data set in general may appear to trend in one direction (positive or negative) but trend in the opposite direction when divided by sub-grouping. This is problematic because looking at a too high level aggregation, especially if using a large data set, may led the novice data analyst to believe that the data has a negative (let’s say) association for all groups. The overall trend may not be true for all groups contained within the data set.

We will look at a few examples, starting with the Palmer Penguins from blog post #1.

We will omit the few observations with missing values.

Palmer Penguins

library(tidyverse)
library(palmerpenguins)
set.seed(9450)

penguin_df<- 
  palmerpenguins::penguins %>%
  na.omit()


DT::datatable(head(penguin_df))

Population Level Aggregation

If we look at a scatterplot of bill length and bill depth we can see the overall trend of length vs depth for the entire population.

penguin_df %>%
  ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
  geom_point() + 
  labs(x="Length", y="Depth", title="Bill Depth as a function of Bill Length") +
  theme_classic()

If we were to naively regress this data for some kind of linear association we see the following.

lin_reg <- lm(bill_depth_mm ~ bill_length_mm, data=penguin_df)

penguin_df %>%
  ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
  geom_point() +
  geom_abline(slope = lin_reg$coefficients[[2]],
              intercept = lin_reg$coefficients[[1]], 
              color="red") +
  labs(x="Length", y="Depth",
       title="Regression of Depth as a function of Length") +
  theme_classic()

Here we see a negative association between bill depth and bill length for the data set. Upon further analysis of the residuals we see that those residuals are reasonably normal and distributed about zero.

lin_reg %>%
  ggplot(aes(x=lin_reg$residuals)) +
  geom_histogram(color="black") + 
  labs(x="Residuals", y="Count", 
       title="Residual Histogram") +
  theme_classic()

Additionally, the linear regression summary suggests that we have reason to believe that small negative slope is non-zero (p-value << 0.05).

summary(lin_reg)
## 
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = penguin_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1548 -1.4291  0.0122  1.3994  4.5004 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    20.78665    0.85417  24.335  < 2e-16 ***
## bill_length_mm -0.08233    0.01927  -4.273 2.53e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.92 on 331 degrees of freedom
## Multiple R-squared:  0.05227,    Adjusted R-squared:  0.04941 
## F-statistic: 18.26 on 1 and 331 DF,  p-value: 2.528e-05

Species Level Aggregation

From this we might as well conclude that the longer the bill, the less deep it is. However, if you drill down from the population level to the species level we see the opposite result.

penguin_df %>%
  ggplot(aes(x=bill_length_mm, y=bill_depth_mm, 
             color=species)) +
  geom_point() + 
  labs(x="Length", y="Depth", title="Bill Depth as a function of Bill Length") +
  theme_classic()

Bill depth increases with bill length for each of the three species. Let’s again regress linearly the species data and plot.

chin<-
  penguin_df %>%
  filter(species == "Chinstrap")
adelie<-
  penguin_df %>%
  filter(species == "Adelie")
gentoo<-
  penguin_df %>%
  filter(species == "Gentoo")

lm_chin<- lm(data=chin, bill_depth_mm ~ bill_length_mm)
lm_adelie<- lm(data=adelie, bill_depth_mm ~ bill_length_mm)
lm_gentoo<- lm(data=gentoo, bill_depth_mm ~ bill_length_mm)

penguin_df %>%
  ggplot(aes(x=bill_length_mm, y=bill_depth_mm, 
             color=species)) +
  geom_point() +
  geom_abline(slope = lm_chin$coefficients[[2]],
              intercept = lm_chin$coefficients[[1]], 
              color="black") +
  geom_abline(slope = lm_adelie$coefficients[[2]],
              intercept = lm_adelie$coefficients[[1]], 
              color="black") +
  geom_abline(slope = lm_gentoo$coefficients[[2]],
              intercept = lm_gentoo$coefficients[[1]], 
              color="black") +
  labs(x="Length", y="Depth",
       title="Regression of Depth as a function of Length") +
  theme_classic()

Here we see the paradoxical effect. The population data had a negative association, whereas the species level data has the opposite. The species level regression have similarly believable summary statistics. An example is below.

lm_chin %>%
  ggplot(aes(x=lm_chin$residuals)) +
  geom_histogram(color="black") + 
  labs(x="Residuals", y="Count", 
       title="Residual Histogram Chinstrap Species") +
  theme_classic()

summary(lm_chin)
## 
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = chin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.65742 -0.46033 -0.01862  0.61473  1.69801 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     7.56914    1.55053   4.882 6.99e-06 ***
## bill_length_mm  0.22221    0.03168   7.015 1.53e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8659 on 66 degrees of freedom
## Multiple R-squared:  0.4271, Adjusted R-squared:  0.4184 
## F-statistic: 49.21 on 1 and 66 DF,  p-value: 1.526e-09

Conclusion

Defining a valid model is essential for extracting insight from data, and there is no substitute for rigorous data analysis. We see above that without carefully examining the data we may miss important relationships that exist within the data. Simpson’s paradox is an example of how, depending on context, one might reach opposite conclusions for the same predictors. Those opposite conclusions may also be reasonably justifiable using basic statistical tools, e.g. null test on slope, residual distribution, standard error.