This is an R Markdown document giving an example of Simpson’s Paradox. Looking first at correlation and second at the estimated slopes in regression lines.
The example considers book prices and the number of pages in books. The confounding variable to consider is type of book, either hardcover or paperback.
And some examples of using ggplot2 are included.
Here are some links to get started:
The data.
### synthetic data
# Consider book price (y) by number of pages (x)
z = c("hardcover","hardcover",
"hardcover","hardcover",
"paperback", "paperback","paperback",
"paperback")
x1 = c( 150, 225, 342, 185)
y1 = c( 27.43, 48.76, 50.25, 32.01 )
x2 = c( 475, 834, 1020, 790)
y2 = c( 10.00, 15.73, 20.00, 17.89 )
x = c(x1, x2)
y = c(y1, y2)
Plotting a scatterplot.
plot(x,y)
Compute the correlation coefficient \(r\) to see the effect of Simpson’s Paradox. The overall correlation is negative and the correlations, taking into consideration the confounding variable, type of book, the correlations change to positive, for both hardcover and paperback books.
# correlation
cor(y, x)
cor(y1, x1)
cor(y2, x2)
Now compute the slopes in the linear regressions to see the same effect.
Overall, the slope is negative.
# linear regression
lm(y ~ x)
For the hardcover books, the slope is positive.
# linear regression
lm(y1 ~ x1)
For the paperback books, the slope is positive.
# linear regression
lm(y2 ~ x2)
Now let’s visualize Simpon’s Paradox in the scatterplot. First by plotting the names of each point in the scatterplot.
plot(x,y, type="n")
text(x,y,z, cex=0.8)
Now plot the linear regression model ignoring the type of book.
# fitted linear regression
plot(x,y)
model0 <- lm(y ~ x)
abline(model0)
Now plot the linear regression models using the type of book.
# fitted linear regression
plot(x,y)
text(200,40,"hardcover")
model1 <- lm(y1 ~ x1)
abline(model1)
text(600,15,"paperback")
model2 <- lm(y2 ~ x2)
abline(model2)
Now using ggplot2. A very nice tutorial is available from Cookbook R.
library(ggplot2)
# put the data into a data.frame
# cond = type of book
# xvar = x
# yvar = y
dat <- data.frame(cond = z, xvar = x, yvar = y)
Plot the overall data.
ggplot(dat, aes(x=xvar, y=yvar)) + geom_point(shape=1) # Use hollow circles
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm) # Add linear regression line
# (by default includes 95% confidence region)
ggplot(dat, aes(x=xvar, y=yvar)) +
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, # Add linear regression line
se=FALSE) # Don't add shaded confidence region
Now plot the regression lines for each type. See Simpson’s Paradox.
ggplot(dat, aes(x=xvar, y=yvar, color=cond)) + geom_point(shape=1)
# Same, but with different colors and add regression lines
ggplot(dat, aes(x=xvar, y=yvar, color=cond)) +
geom_point(shape=1) +
scale_colour_hue(l=50) + # Use a slightly darker palette than normal
geom_smooth(method=lm, # Add linear regression lines
se=FALSE) # Don't add shaded confidence region
Plot using different shapes.
# Set shape by cond
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point()
# Same, but with different shapes
ggplot(dat, aes(x=xvar, y=yvar, shape=cond)) + geom_point() +
scale_shape_manual(values=c(1,2)) # Use a hollow circle and triangle
LS0tCnRpdGxlOiAiU2ltcHNvbidzIFBhcmFkb3ggLSBDb3JyZWxhdGlvbnMgYW5kIFNsb3BlcyIKYXV0aG9yOiAiUHJvZi4gRXJpYyBBLiBTdWVzcyIKZGF0ZTogIkZlYnJ1YXJ5IDI5LCAyMDIwIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgpUaGlzIGlzIGFuIFIgTWFya2Rvd24gZG9jdW1lbnQgZ2l2aW5nIGFuIGV4YW1wbGUgb2YgKipTaW1wc29uJ3MgUGFyYWRveCoqLiAgTG9va2luZyAKZmlyc3QgYXQgY29ycmVsYXRpb24gYW5kIHNlY29uZCBhdCB0aGUgZXN0aW1hdGVkIHNsb3BlcyBpbiByZWdyZXNzaW9uIGxpbmVzLgoKVGhlIGV4YW1wbGUgY29uc2lkZXJzICoqYm9vayBwcmljZXMqKiBhbmQgdGhlICoqbnVtYmVyIG9mIHBhZ2VzKiogaW4gYm9va3MuICBUaGUgKmNvbmZvdW5kaW5nIHZhcmlhYmxlKiB0byBjb25zaWRlciBpcyAqKnR5cGUgb2YgYm9vayoqLCBlaXRoZXIgaGFyZGNvdmVyIG9yIHBhcGVyYmFjay4KCkFuZCBzb21lIGV4YW1wbGVzIG9mIHVzaW5nICoqZ2dwbG90MioqIGFyZSBpbmNsdWRlZC4KCkhlcmUgYXJlIHNvbWUgbGlua3MgdG8gZ2V0IHN0YXJ0ZWQ6CgotIHdpa2lwZWRpYSBhYm91dCBbY29ycmVsYXRpb25dKGh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0NvcnJlbGF0aW9uX2FuZF9kZXBlbmRlbmNlKSAKLSBhIHBvc3Qgb24gdGhlIFtBbmFseXNpcyB3aXRoIFByb2dyYW1taW5nXShodHRwOi8vYWxzdGF0ci5ibG9nc3BvdC5jb20vKSB3ZWJzaXRlIGFib3V0IHRoZSBbVGhlb3J5IG9mIExpbmVhciBMZWFzdCBTcXVhcmVzXShodHRwOi8vYWxzdGF0ci5ibG9nc3BvdC5jb20vMjAxNS8xMi9yLWFuZC1weXRob24tdGhlb3J5LW9mLWxpbmVhci1sZWFzdC5odG1sI21vcmUpLCB3aGljaCBnaXZlcyBhbiBpbnRyb2R1Y3Rpb24gdG8gdGhlIGxpbmVhciByZWdyZXNzaW9uIGFuZCBob3cgaXQgaXMgZml0dGVkIHRvIGRhdGEuICAKLSB3aWtpcGVkaWEgYWJvdXQgW1NpbXBzb24ncyBQYXJhZG94XShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9TaW1wc29uJTI3c19wYXJhZG94KQotIFtDb29rYm9vayBmb3IgUl0oaHR0cDovL3d3dy5jb29rYm9vay1yLmNvbS8pCgpUaGUgZGF0YS4KCmBgYHtyfQojIyMgc3ludGhldGljIGRhdGEKCiMgQ29uc2lkZXIgYm9vayBwcmljZSAoeSkgYnkgbnVtYmVyIG9mIHBhZ2VzICh4KQoKeiA9IGMoImhhcmRjb3ZlciIsImhhcmRjb3ZlciIsCiAgICAgICJoYXJkY292ZXIiLCJoYXJkY292ZXIiLAogICAgICAicGFwZXJiYWNrIiwgInBhcGVyYmFjayIsInBhcGVyYmFjayIsIAogICAgICAicGFwZXJiYWNrIikKCngxID0gYyggMTUwLCAyMjUsIDM0MiwgMTg1KQp5MSA9IGMoIDI3LjQzLCA0OC43NiwgNTAuMjUsIDMyLjAxICkKCngyID0gYyggNDc1LCA4MzQsIDEwMjAsIDc5MCkKeTIgPSBjKCAxMC4wMCwgMTUuNzMsIDIwLjAwLCAxNy44OSApCgp4ID0gYyh4MSwgeDIpCnkgPSBjKHkxLCB5MikKYGBgCgpQbG90dGluZyBhIHNjYXR0ZXJwbG90LgpgYGB7cn0KcGxvdCh4LHkpCmBgYAoKQ29tcHV0ZSB0aGUgY29ycmVsYXRpb24gY29lZmZpY2llbnQgJHIkIHRvIHNlZSB0aGUgZWZmZWN0IG9mIFNpbXBzb24ncyBQYXJhZG94LiAgVGhlIG92ZXJhbGwgY29ycmVsYXRpb24gaXMgKm5lZ2F0aXZlKiBhbmQgdGhlIGNvcnJlbGF0aW9ucywgdGFraW5nIGludG8gY29uc2lkZXJhdGlvbiB0aGUgY29uZm91bmRpbmcgdmFyaWFibGUsICoqdHlwZSBvZiBib29rKiosIHRoZSBjb3JyZWxhdGlvbnMgY2hhbmdlIHRvIHBvc2l0aXZlLCBmb3IgYm90aCAqKmhhcmRjb3ZlcioqIGFuZCAqKnBhcGVyYmFjayoqIGJvb2tzLgoKYGBge3J9CiMgY29ycmVsYXRpb24KCmNvcih5LCB4KQoKY29yKHkxLCB4MSkKY29yKHkyLCB4MikKYGBgCgpOb3cgY29tcHV0ZSB0aGUgc2xvcGVzIGluIHRoZSBsaW5lYXIgcmVncmVzc2lvbnMgdG8gc2VlIHRoZSBzYW1lIGVmZmVjdC4KCk92ZXJhbGwsIHRoZSBzbG9wZSBpcyBuZWdhdGl2ZS4KYGBge3J9CiMgbGluZWFyIHJlZ3Jlc3Npb24KCmxtKHkgfiB4KQpgYGAKCkZvciB0aGUgKipoYXJkY292ZXIqKiBib29rcywgdGhlIHNsb3BlIGlzIHBvc2l0aXZlLgpgYGB7cn0KIyBsaW5lYXIgcmVncmVzc2lvbgoKbG0oeTEgfiB4MSkKYGBgCgpGb3IgdGhlICoqcGFwZXJiYWNrKiogYm9va3MsIHRoZSBzbG9wZSBpcyBwb3NpdGl2ZS4KYGBge3J9CiMgbGluZWFyIHJlZ3Jlc3Npb24KCmxtKHkyIH4geDIpCmBgYAoKTm93IGxldCdzIHZpc3VhbGl6ZSBTaW1wb24ncyBQYXJhZG94IGluIHRoZSBzY2F0dGVycGxvdC4gIEZpcnN0IGJ5IHBsb3R0aW5nIHRoZSBuYW1lcyBvZiBlYWNoIHBvaW50IGluIHRoZSBzY2F0dGVycGxvdC4KCmBgYHtyfQpwbG90KHgseSwgdHlwZT0ibiIpCnRleHQoeCx5LHosIGNleD0wLjgpCmBgYAoKTm93IHBsb3QgdGhlIGxpbmVhciByZWdyZXNzaW9uIG1vZGVsIGlnbm9yaW5nIHRoZSAqKnR5cGUgb2YgYm9vayoqLgoKYGBge3J9CiMgZml0dGVkIGxpbmVhciByZWdyZXNzaW9uCgpwbG90KHgseSkKCm1vZGVsMCA8LSBsbSh5IH4geCkKYWJsaW5lKG1vZGVsMCkKYGBgCgpOb3cgcGxvdCB0aGUgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWxzIHVzaW5nIHRoZSAqKnR5cGUgb2YgYm9vayoqLgoKYGBge3J9CiMgZml0dGVkIGxpbmVhciByZWdyZXNzaW9uCgpwbG90KHgseSkKCnRleHQoMjAwLDQwLCJoYXJkY292ZXIiKQptb2RlbDEgPC0gbG0oeTEgfiB4MSkKYWJsaW5lKG1vZGVsMSkKCnRleHQoNjAwLDE1LCJwYXBlcmJhY2siKQptb2RlbDIgPC0gbG0oeTIgfiB4MikKYWJsaW5lKG1vZGVsMikKYGBgCgpOb3cgdXNpbmcgKipnZ3Bsb3QyKiouICBBIHZlcnkgbmljZSB0dXRvcmlhbCBpcyBhdmFpbGFibGUgZnJvbSBbQ29va2Jvb2sgUl0oaHR0cDovL3d3dy5jb29rYm9vay1yLmNvbS9HcmFwaHMvU2NhdHRlcnBsb3RzXyhnZ3Bsb3QyKS8pLgoKYGBge3J9CmxpYnJhcnkoZ2dwbG90MikKCiMgcHV0IHRoZSBkYXRhIGludG8gYSBkYXRhLmZyYW1lCiMgICBjb25kID0gdHlwZSBvZiBib29rCiMgICB4dmFyID0geAojICAgeXZhciA9IHkKCmRhdCA8LSBkYXRhLmZyYW1lKGNvbmQgPSB6LCB4dmFyID0geCwgeXZhciA9IHkpCmBgYAoKUGxvdCB0aGUgb3ZlcmFsbCBkYXRhLgoKYGBge3J9CmdncGxvdChkYXQsIGFlcyh4PXh2YXIsIHk9eXZhcikpICsgZ2VvbV9wb2ludChzaGFwZT0xKSAgICAgICMgVXNlIGhvbGxvdyBjaXJjbGVzCgpnZ3Bsb3QoZGF0LCBhZXMoeD14dmFyLCB5PXl2YXIpKSArCiAgICBnZW9tX3BvaW50KHNoYXBlPTEpICsgICAgIyBVc2UgaG9sbG93IGNpcmNsZXMKICAgIGdlb21fc21vb3RoKG1ldGhvZD1sbSkgICAjIEFkZCBsaW5lYXIgcmVncmVzc2lvbiBsaW5lIAogICAgICAgICAgICAgICAgICAgICAgICAgICAgICMgIChieSBkZWZhdWx0IGluY2x1ZGVzIDk1JSBjb25maWRlbmNlIHJlZ2lvbikKYGBgCmBgYHtyfQpnZ3Bsb3QoZGF0LCBhZXMoeD14dmFyLCB5PXl2YXIpKSArCiAgICBnZW9tX3BvaW50KHNoYXBlPTEpICsgICAgIyBVc2UgaG9sbG93IGNpcmNsZXMKICAgIGdlb21fc21vb3RoKG1ldGhvZD1sbSwgICAjIEFkZCBsaW5lYXIgcmVncmVzc2lvbiBsaW5lCiAgICAgICAgICAgICAgICBzZT1GQUxTRSkgICAgIyBEb24ndCBhZGQgc2hhZGVkIGNvbmZpZGVuY2UgcmVnaW9uCmBgYAoKTm93IHBsb3QgdGhlIHJlZ3Jlc3Npb24gbGluZXMgZm9yIGVhY2ggdHlwZS4gIFNlZSBTaW1wc29uJ3MgUGFyYWRveC4KYGBge3J9CgpnZ3Bsb3QoZGF0LCBhZXMoeD14dmFyLCB5PXl2YXIsIGNvbG9yPWNvbmQpKSArIGdlb21fcG9pbnQoc2hhcGU9MSkKCiMgU2FtZSwgYnV0IHdpdGggZGlmZmVyZW50IGNvbG9ycyBhbmQgYWRkIHJlZ3Jlc3Npb24gbGluZXMKZ2dwbG90KGRhdCwgYWVzKHg9eHZhciwgeT15dmFyLCBjb2xvcj1jb25kKSkgKwogICAgZ2VvbV9wb2ludChzaGFwZT0xKSArCiAgICBzY2FsZV9jb2xvdXJfaHVlKGw9NTApICsgIyBVc2UgYSBzbGlnaHRseSBkYXJrZXIgcGFsZXR0ZSB0aGFuIG5vcm1hbAogICAgZ2VvbV9zbW9vdGgobWV0aG9kPWxtLCAgICMgQWRkIGxpbmVhciByZWdyZXNzaW9uIGxpbmVzCiAgICAgICAgICAgICAgICBzZT1GQUxTRSkgICAgIyBEb24ndCBhZGQgc2hhZGVkIGNvbmZpZGVuY2UgcmVnaW9uCmBgYAoKUGxvdCB1c2luZyBkaWZmZXJlbnQgc2hhcGVzLgoKYGBge3J9CiMgU2V0IHNoYXBlIGJ5IGNvbmQKZ2dwbG90KGRhdCwgYWVzKHg9eHZhciwgeT15dmFyLCBzaGFwZT1jb25kKSkgKyBnZW9tX3BvaW50KCkKCiMgU2FtZSwgYnV0IHdpdGggZGlmZmVyZW50IHNoYXBlcwpnZ3Bsb3QoZGF0LCBhZXMoeD14dmFyLCB5PXl2YXIsIHNoYXBlPWNvbmQpKSArIGdlb21fcG9pbnQoKSArCiAgICBzY2FsZV9zaGFwZV9tYW51YWwodmFsdWVzPWMoMSwyKSkgICMgVXNlIGEgaG9sbG93IGNpcmNsZSBhbmQgdHJpYW5nbGUKYGBgCg==