Rob Creel
August 12, 2017
Here is a data set, df.
str(df)
'data.frame': 120 obs. of 3 variables:
$ x: num -2.77 4.4 5.08 10.86 23.8 ...
$ y: num 8.23 11.15 16.55 20.78 24.79 ...
$ z: chr "A" "B" "C" "D" ...
Looking at a plot, there is a positive trend between x and y.
However, there are is a confounding variable z in this data set (named df), which contribute to potentially surprising pair of statistics.
lm(y ~ x, data = df)$coefficients[2] # Slope of overall regression.
x
0.795245
as.numeric(by(data = df, INDICES=df$z, FUN=function(d)
lm(d[, "y"] ~ d[, "x"])$coefficients[2])) # Slope of each factor's regression.
[1] -0.3642108 -0.4352077 -0.5795771 -0.5767912 -0.5500855 -0.5650512
The overall slope is positive, but the slopes by factor are negative.
Showing the groups by color illustrates how groups within a data set can show trends that are reversed from the trend of the data set as a whole.
What the app does is let the user explore various arrangements of data which demonstrate this apparent paradox as well. The user can specify whether the groups are large are small, how many groups there are, and how spread out the groups are.