Simpson's Paradox Explored Graphically

Rob Creel
August 12, 2017

A Data Set

Here is a data set, df.

str(df)
'data.frame':   120 obs. of  3 variables:
 $ x: num  -2.77 4.4 5.08 10.86 23.8 ...
 $ y: num  8.23 11.15 16.55 20.78 24.79 ...
 $ z: chr  "A" "B" "C" "D" ...

Looking at a plot, there is a positive trend between x and y.

plot of chunk plot1

An Apparent Contradiction

However, there are is a confounding variable z in this data set (named df), which contribute to potentially surprising pair of statistics.

lm(y ~ x, data = df)$coefficients[2] # Slope of overall regression.
       x 
0.795245 
as.numeric(by(data = df, INDICES=df$z, FUN=function(d)
    lm(d[, "y"] ~ d[, "x"])$coefficients[2])) # Slope of each factor's regression.
[1] -0.3642108 -0.4352077 -0.5795771 -0.5767912 -0.5500855 -0.5650512

The overall slope is positive, but the slopes by factor are negative.

Showing the Factors with Colors and Regression Lines

plot of chunk colors

Decorated Static Plots Help a Bit

Showing the groups by color illustrates how groups within a data set can show trends that are reversed from the trend of the data set as a whole.

The App Helps a Lot

What the app does is let the user explore various arrangements of data which demonstrate this apparent paradox as well. The user can specify whether the groups are large are small, how many groups there are, and how spread out the groups are.

Sample Configurations From the App

Sample A

  • lots of points
  • grouping not shown

plot of chunk sc1

Sample B

  • few points
  • groupings shown plot of chunk sc2