Uri Simonsohn recently wrote a blog post in which he argues that quadratic regressions can’t be trusted for uncovering a U-shpaed or inverted U-shaped relationship. For example, the following scatterplot was generated using a log function but the quadratic regression fitted on it finds evidence for inverted U-shaped relationship.
# Data generated using the example from Uri's code from
# https://osf.io/unb62/
set.seed(9)
n = 500
x = sort(runif(n))^1.5
x2 = x * x
y = log(x) + rnorm(n, sd = 0.1)library(ggplot2)
library(dplyr)
data.frame(x, y, y_fit1 = fitted(lm1)) %>% ggplot(aes(x, y)) + geom_point() +
geom_line(aes(y = y_fit1), color = "blue") + labs(title = "Quadratic Regression Fitted on Log Data",
caption = "Data generated using the code from https://osf.io/unb62/") +
theme_bw()# Carry out quadratic regression
lm1 = lm(y ~ x + x2, data = data.frame(y, x, x2))
DT::datatable(broom::tidy(lm1) %>% mutate_each(funs(round(., 2)), -term))
In this case, quadratic regression suggests that there is an inverted U-shaped relationship because we find a significant positive coefficient on \(x\) and a significant negative coefficient on \(x^2\). However, the data generating process (dgp) was clearly logarithmic.
Simonsohn’s solution to this problem is to plot a separate line for lower values of x and a separate line for higher values of x. This sounds awfully similar to piecewise regression (AKA segmented regression). Simonsohn in his paper on SSRN argues that piecewise regression is inferior to “interrupted” regression, which he recommends. On page 24 in the paper he compares the two and shows that piecewise regression leads to incorrect inference.
I am reproducing the diagram below:
I found the figure odd when I first looked at it. This is because piecewise regression also estimates a cut point where the two lines will join. The cut point in Simonsohn’s example is too much on the right end.
I went over his R code from the link given under the figure. To my surprise I found that he has hardcoded the cut point at x = 0.75! I think this is an unfair comparison because by manually setting the cut point he is not letting the method estimate the optimum cut point.
In order to present a fair comparison, I used segmented package in R to let the algorithm decide the optimum cut point. If indeed we get a biased curve as shown in Simonsohn’s figure then there is a potnetial problem with piecewise regression. Otherwise I don’t see any reason to move to a different method.
library(segmented)
lm2 = lm(y ~ x, data = data.frame(x, y))
seg2 <- segmented(lm2, seg.Z = ~x)
data.frame(x, y, yfit = fitted(seg2)) %>% ggplot(aes(x, y)) + geom_point() +
geom_line(aes(x, yfit), color = "red", size = 1) + theme_bw()When we let the model pick the cut point, it chose a point much more to the left of original cut point! And we clearly don’t get an inverted U-shaped relationship. I would love to see any example where piecewise model indeed gives biased results and Simonsohn’s “Robin Hood procedure” leads to a better inference.