p-curve rebuttal

The p-curve paper uses p-values to detect signs of ‘p-hacking’ and ‘file drawer issues’. Helpfully, they provide an app where you can apply their method to a set of p-values (actually Z scores).

Unfortunately, their approach is flawed. They discount studies with very significant p-values just because they are mixed in with random studies. They also throw out good studies when they are mixed with p-hacked studies. The following examples illustrate.

# Setup
# Helper functions
format_for_p_curve_dot_com <- function(x) cat(paste0("Z = ", paste0(x, collapse = "\nZ = ")))
p <- function(x) abs(qnorm(x))

# Some distributions
hacked <- p(rep(0.0249, 50))                 # p is just significant (~0.0249)
random_10 <- p(seq(0.001, 0.025, 0.0025))    # p is evenly distributed over significant values (as would happen by chance)
powered <- p(rep(0.001, 50))                 # p is 1 in one thousand
high_powered <- p(rep(1 / 1e6, 50))          # p is 1 in one million (extremely probable)

1 in a million

Say we have 50 randomly distributed p values, but one that is literally 1 in a million? We should keep the one in a million study. But p-curve does not. Try adding the following values at: http://www.p-curve.com/app4/

format_for_p_curve_dot_com(c(random_10, high_powered[1]))

## Z = 3.09023230616781
## Z = 2.69684426087813
## Z = 2.51214432793046
## Z = 2.38670773449225
## Z = 2.29036787785527
## Z = 2.21151780918668
## Z = 2.14441062091184
## Z = 2.08576406509235
## Z = 2.03352014925305
## Z = 1.98630020412943
## Z = 4.7534243088229

and you will get

Legit and hacked both

Say our list of studies includes half which have been p-hacked and half which are very significant studies (one in a thousand)? We should get both evidence of hacking and meaingful results. p-curve correctly claims the studies have evidential power (in the continuous test). But it claims the studies are indaquate and have very low power.

format_for_p_curve_dot_com(c(hacked[1:10], powered[1:10]))

## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 1.96167786906379
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781
## Z = 3.09023230616781

Conclusions

The p-curve approach is good for detecting oddities in the p-value distribtion. But not at finding good studies mixed in with the bad. A better approach to finding good studies is to simply set a more stringent p-value as a reader of scientific articles. By insisting on p-values of 0.01 for instace we throw out 80% of false positive studies and even more p-hacked studies. As the p-curve paper points out this smartly retains half of true positives for tests that have 50% power and even more for higher powered tests.

p-curve rebuttal

Benjamin Haley

August 9, 2016

1 in a million

Legit and hacked both

Conclusions