Introduction
Simpson’s paradox is a phenomenon that arises in data analysis, which if not considered may lead to spurious conclusions or misleading predictions. The general idea of the paradox is that a data set in general may appear to trend in one direction (positive or negative) but trend in the opposite direction when divided by sub-grouping. This is problematic because looking at a too high level aggregation, especially if using a large data set, may led the novice data analyst to believe that the data has a negative (let’s say) association for all groups. The overall trend may not be true for all groups contained within the data set.
We will look at a few examples, starting with the Palmer Penguins from blog post #1.
We will omit the few observations with missing values.
Palmer Penguins
Population Level Aggregation
If we look at a scatterplot of bill length and bill depth we can see the overall trend of length vs depth for the entire population.
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point() +
labs(x="Length", y="Depth", title="Bill Depth as a function of Bill Length") +
theme_classic()
If we were to naively regress this data for some kind of linear association we see the following.
lin_reg <- lm(bill_depth_mm ~ bill_length_mm, data=penguin_df)
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm)) +
geom_point() +
geom_abline(slope = lin_reg$coefficients[[2]],
intercept = lin_reg$coefficients[[1]],
color="red") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length") +
theme_classic()
Here we see a negative association between bill depth and bill length for the data set. Upon further analysis of the residuals we see that those residuals are reasonably normal and distributed about zero.
lin_reg %>%
ggplot(aes(x=lin_reg$residuals)) +
geom_histogram(color="black") +
labs(x="Residuals", y="Count",
title="Residual Histogram") +
theme_classic()
Additionally, the linear regression summary suggests that we have reason to believe that small negative slope is non-zero (p-value << 0.05).
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = penguin_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1548 -1.4291 0.0122 1.3994 4.5004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20.78665 0.85417 24.335 < 2e-16 ***
## bill_length_mm -0.08233 0.01927 -4.273 2.53e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.92 on 331 degrees of freedom
## Multiple R-squared: 0.05227, Adjusted R-squared: 0.04941
## F-statistic: 18.26 on 1 and 331 DF, p-value: 2.528e-05
Species Level Aggregation
From this we might as well conclude that the longer the bill, the less deep it is. However, if you drill down from the population level to the species level we see the opposite result.
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm,
color=species)) +
geom_point() +
labs(x="Length", y="Depth", title="Bill Depth as a function of Bill Length") +
theme_classic()
Bill depth increases with bill length for each of the three species. Let’s again regress linearly the species data and plot.
chin<-
penguin_df %>%
filter(species == "Chinstrap")
adelie<-
penguin_df %>%
filter(species == "Adelie")
gentoo<-
penguin_df %>%
filter(species == "Gentoo")
lm_chin<- lm(data=chin, bill_depth_mm ~ bill_length_mm)
lm_adelie<- lm(data=adelie, bill_depth_mm ~ bill_length_mm)
lm_gentoo<- lm(data=gentoo, bill_depth_mm ~ bill_length_mm)
penguin_df %>%
ggplot(aes(x=bill_length_mm, y=bill_depth_mm,
color=species)) +
geom_point() +
geom_abline(slope = lm_chin$coefficients[[2]],
intercept = lm_chin$coefficients[[1]],
color="black") +
geom_abline(slope = lm_adelie$coefficients[[2]],
intercept = lm_adelie$coefficients[[1]],
color="black") +
geom_abline(slope = lm_gentoo$coefficients[[2]],
intercept = lm_gentoo$coefficients[[1]],
color="black") +
labs(x="Length", y="Depth",
title="Regression of Depth as a function of Length") +
theme_classic()
Here we see the paradoxical effect. The population data had a negative association, whereas the species level data has the opposite. The species level regression have similarly believable summary statistics. An example is below.
lm_chin %>%
ggplot(aes(x=lm_chin$residuals)) +
geom_histogram(color="black") +
labs(x="Residuals", y="Count",
title="Residual Histogram Chinstrap Species") +
theme_classic()
##
## Call:
## lm(formula = bill_depth_mm ~ bill_length_mm, data = chin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.65742 -0.46033 -0.01862 0.61473 1.69801
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.56914 1.55053 4.882 6.99e-06 ***
## bill_length_mm 0.22221 0.03168 7.015 1.53e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8659 on 66 degrees of freedom
## Multiple R-squared: 0.4271, Adjusted R-squared: 0.4184
## F-statistic: 49.21 on 1 and 66 DF, p-value: 1.526e-09
Conclusion
Defining a valid model is essential for extracting insight from data, and there is no substitute for rigorous data analysis. We see above that without carefully examining the data we may miss important relationships that exist within the data. Simpson’s paradox is an example of how, depending on context, one might reach opposite conclusions for the same predictors. Those opposite conclusions may also be reasonably justifiable using basic statistical tools, e.g. null test on slope, residual distribution, standard error.