Setup

Three of these functions were derived based on papers mentioned in two blogposts from Daniel Lakens and one uses code derived based on a MOOC of his:

https://daniellakens.blogspot.com/2014/05/prior-probabilities-and-replicating.html https://daniellakens.blogspot.com/2014/05/the-probability-of-p-values-as-function.html

library(pacman); p_load(pwr, ggplot2, shiny)

ZREG <- function(B1, B2, SEB1, SEB2) {
  Z = (B1-B2)/sqrt((SEB1)^2 + (SEB2)^2)
  return(Z)}

NomoP <- function(p, q, tails = 2, rnd = 3){
  if(p < (1 / exp(1))){BF = -exp(1) * p * log(p)} else {BF = 1}
  MPP = (1 + ((BF * q)/(1 - q))^-1)^-1
  cat(paste0("The minimum posterior probability given a p-value of ", p, " and a prior probability of ", q, " is ", round(MPP, rnd), " (BF = ", round(BF, rnd), "). \n"))}

#NomoP(p = .04, q = .5)   #Test 1
#NomoP(p = .009, q = .75) #Test 2

PPlot <- function(b = .8, sims = 10000, bm = 0, M = .5, SD = 1, a = .05, bars = 100, zoom = F, rnd = 0, seed = 1){
  set.seed(seed)
  require(pwr); require(ggplot2)
  n <- round(pwr.t.test(d = (M - bm)/SD, sig.level = a, power = b, type = "one.sample", alternative = "two.sided")$n, 0)
  p <- numeric(sims)
  for(i in 1:sims){
    x <- rnorm(n = n, mean = M, sd = SD)
    z <- t.test(x, mu = bm, conf.level = 1 - a)
    p[i] <- z$p.value}
  x <- sum((p < a)/sims)
  power = pwr.t.test(d = (M - bm)/SD, n = n, sig.level = a, type = "one.sample", alternative = "two.sided")$power
  x_max <- 1; y_max <- sims; labels <- seq(0, 1, .01)
  if(zoom == T){
    x_max <- .05; y_max <- sims; labels <- seq(0, 1, .01)} else {}
  graph <- ggplot(as.data.frame(p), aes(p)) + 
    geom_histogram(binwidth = 1/bars, fill = "steelblue", col = "black", alpha = .5, boundary = 0) +
    xlab("p-Values") + ylab("Frequencies of p-Values") + 
    coord_cartesian(xlim = c(0, x_max), ylim = c(0, y_max)) + theme_minimal(base_size = 18) + 
    ggtitle(paste0("p-Value Distribution for ", sims, " one-sample two-sided t-tests with ", round(power * 100, rnd), "% power.")) +
    theme(plot.title = element_text(hjust = .5), legend.title = element_blank(), legend.position = "top")
  cat(paste0("The empirical power was ", toString(100 * x), "%, and the theoretical power was ", round(b * 100, rnd), "%. \n")); graph}

#PPlot(b = .8)           # Test 1
#PPlot(b = .8, zoom = T) # Test 2

SBBCal <- function(p = .05, rnd = 3){ #Not use
  Frequentist = (1 + (-exp(1) * p * log(p))^-1)^-1
  Bayesian = -exp(1) * p * log(p)
  cat(paste0("The frequentist calibration of a p-value of ", p, " is ", round(Frequentist, rnd), " and the Bayesian calibration is ", round(Bayesian, rnd), "\n"))}

#SBBCal(c(.2, .1, .05, .01, .005, .001)) #Test values

Rationale

It’s not fun when someone’s job market paper makes the rounds, looks neat, and turns out to be nonsense. Unfortunately, that is a case with a recent paper arguing that anti-discrimination laws reduce the number of mental health days taken by sexual minorities, and in particular, sexual minority men (Mann, 2022). There are a few problems with this paper.

Firstly, the sample sizes are misrepresented because the standard errors are clustered (see Senn, 2019).

Secondly, many findings are marginal, and are above the typical \(\alpha = .05\) threshold used for statistical significance. It is unlikely that this could occur by chance. If you actually seek high power, you will have very few results that are this borderline if the null hypothesis is false (See Andre, 2022).

Third and finally, interactions are described and discussed but none of them are significant (see Gelman & Stern, 2006).

Analyses

Power

Is it possible to target a p-value of just under .05? No, and we can quantify the likelihood of getting a certain number of results in that range for some level of power. Say we have 50%, 60%, 70%, 80%, or 90% power.

PPlot(b = .5, zoom = T)

## The empirical power was 48.08%, and the theoretical power was 50%.

PPlot(b = .6, zoom = T)

## The empirical power was 60.71%, and the theoretical power was 60%.

PPlot(b = .7, zoom = T)

## The empirical power was 70.62%, and the theoretical power was 70%.

PPlot(b = .8, zoom = T)

## The empirical power was 79.54%, and the theoretical power was 80%.

PPlot(b = .9, zoom = T)

## The empirical power was 90.25%, and the theoretical power was 90%.

With 50% power, the majority of significant results are not just below .05, but below .01, and this is truer the greater your power. With a decent level of power, few significant results will be in the interval between .05 and .01. If we have a prior on the likelihood of a given \(H_1\), we can assess the evidential value of a given p-value. Lets check all the p-values between .05 and .01 for priors of 50%, 60%, 70%, 80%, and 90% of our effect of interest being real.

cat("50% Prior \n")

## 50% Prior

NomoP(.05, .5); NomoP(.04, .5); NomoP(.03, .5); NomoP(.02, .5); NomoP(.01, .5)

## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.5 is 0.289 (BF = 0.407).

## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.5 is 0.259 (BF = 0.35).

## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.5 is 0.222 (BF = 0.286).

## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.5 is 0.175 (BF = 0.213).

## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.5 is 0.111 (BF = 0.125).

cat("60% Prior \n")

## 60% Prior

NomoP(.05, .6); NomoP(.04, .6); NomoP(.03, .6); NomoP(.02, .6); NomoP(.01, .6)

## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.6 is 0.379 (BF = 0.407).

## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.6 is 0.344 (BF = 0.35).

## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.6 is 0.3 (BF = 0.286).

## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.6 is 0.242 (BF = 0.213).

## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.6 is 0.158 (BF = 0.125).

cat("70% Prior \n")

## 70% Prior

NomoP(.05, .7); NomoP(.04, .7); NomoP(.03, .7); NomoP(.02, .7); NomoP(.01, .7)

## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.7 is 0.487 (BF = 0.407).

## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.7 is 0.45 (BF = 0.35).

## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.7 is 0.4 (BF = 0.286).

## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.7 is 0.332 (BF = 0.213).

## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.7 is 0.226 (BF = 0.125).

cat("80% Prior \n")

## 80% Prior

NomoP(.05, .8); NomoP(.04, .8); NomoP(.03, .8); NomoP(.02, .8); NomoP(.01, .8)

## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.8 is 0.62 (BF = 0.407).

## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.8 is 0.583 (BF = 0.35).

## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.8 is 0.534 (BF = 0.286).

## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.8 is 0.46 (BF = 0.213).

## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.8 is 0.334 (BF = 0.125).

cat("90% Prior \n")

## 90% Prior

NomoP(.05, .9); NomoP(.04, .9); NomoP(.03, .9); NomoP(.02, .9); NomoP(.01, .9)

## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.9 is 0.786 (BF = 0.407).

## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.9 is 0.759 (BF = 0.35).

## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.9 is 0.72 (BF = 0.286).

## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.9 is 0.657 (BF = 0.213).

## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.9 is 0.53 (BF = 0.125).

With a strong prior on an effect being real of just 90%, a p-value of .04 is consistent with a 76% probability that the null hypothesis is correct. This is similar to the observation that, with a sample size of a million, a p-value of .04 is much stronger evidence for the null than against it thanks to Lindley’s Paradox (Wagenmakers & Ly, 2022’ see also Sellke, Bayarri & Berger, 2001; Robert, 2014). A simple proof for this was seen in the plots given above, where it was noted that with significant results, a p-value just under the selected \(\alpha\) is improbable.

With these two observations, we can throw in another: that the observation of multiple improbable significant values in a well-powered test can be treated like a series of rigged coin flips given their posterior probabilities under some prior or given the a priori power of a test. If we treat the prior as 50/50 for simplicity, this is trivial. Finding a bunch of p-values just under .05 means the study probably supports the null or the author p-hacked. p-values right near \(\alpha\) are not trustworthy.

What about this study?

Looking at the p-values obtained by Mann (2022), they are dubious. Tables 4 through 8 are the relevant ones. They include p-values of

cat("Table 1 - Row-wise \n")

## Table 1 - Row-wise

cat("Row 1 \n")

## Row 1

pnorm(-.440/.165)*2 #Reported as p between .05 and .01; x2 because two-sided

## [1] 0.007660761

pnorm(-.376/.167)*2 #Reported as p between .05 and .01

## [1] 0.02435408

pnorm(-.474/.210)*2 #Reported as p between .05 and .01

## [1] 0.02399915

pnorm(-.459/.164)*2 #Reported as p below .01

## [1] 0.0051296

cat("Row 2 \n")

## Row 2

pnorm(-.123/.085)*2 #Reported as p above .1

## [1] 0.1478804

pnorm(-.098/.084)*2 #Reported as p above .1

## [1] 0.243345

pnorm(-.046/.064)*2 #Reported as p above .1

## [1] 0.472295

pnorm(-.042/.063)*2 #Reported as p above .1

## [1] 0.5049851

cat("Row 3 \n")

## Row 3

pnorm(-.352/.124)*2 #Reported as p below .01

## [1] 0.004529635

pnorm(-.267/.118)*2 #Reported as p between .05 and .01

## [1] 0.02365346

pnorm(-.154/.152)*2 #Reported as p above .1

## [1] 0.3109847

pnorm(-.081/.262)*2 #Reported as p above .1

## [1] 0.7571996

cat("Row 4 \n")

## Row 4

pnorm(-.085/.113)*2 #Reported as p above .1

## [1] 0.4519233

pnorm(-.057/.112)*2 #Reported as p above .1

## [1] 0.6108023

pnorm(.115/.081, lower.tail = F)*2 #Reported as p above .1

## [1] 0.1556796

pnorm(-.045/.094)*2 #Reported as p above .1

## [1] 0.6321354

Table 1, row 1, cell 1 is misreported unfavorably to Mann (2022). Why this happens, we do not know. It may indicate an issue in reporting that when corrected will deliver the correct p-value as reported or maybe the p-value as reported needs to change. Three of the six significant results are in the dubious range. The important thing here, however, is not the p-values of the tests themselves, but those of the interactions. Mann has taken a significant result in one subgroup but not another to mean that there is a difference between those subgroups. These differences are not statistically significant, however. Contrasting men, women, LGBT, and non-LGBT should have been done only after significant interactions across these categories were observed. For the sake of simplicity, lets compare extreme contrasts: LGBT men and heterosexual women.

ZREG(-.440, -.085, .165, .113) #Row 1, cell 1 vs row 4, cell 1

## [1] -1.775133

ZREG(-.374, -.057, .167, .112) #Row 1, cell 2 vs row 4, cell 2

## [1] -1.57649

ZREG(-.474, .115, .210, .081)  #Row 1, cell 3 vs row 4, cell 3

## [1] -2.616848

ZREG(-.459, -.045, .164, .094) #Row 1, cell 4 vs row 4, cell 4

## [1] -2.190139

There are some differences. But what about the intrasex contrasts? In order, for LGBT vs heterosexual men and then LGBT vs heterosexual women:

cat("Males \n")

## Males

ZREG(-.440, -.123, .165, .085)

## [1] -1.707909

ZREG(-.374, -.098, .167, .084)

## [1] -1.476443

ZREG(-.474, -.046, .210, .064)

## [1] -1.949567

ZREG(-.459, -.042, .164, .063)

## [1] -2.373575

cat("Females \n")

## Females

ZREG(-.352, -.085, .124, .113)

## [1] -1.591515

ZREG(-.267, -.057, .118, .112)

## [1] -1.290799

ZREG(-.154, -.115, .152, .081)

## [1] -0.2264344

ZREG(-.081, -.045, .262, .094)

## [1] -0.1293325

For the contrast discussed in the paper (i.e., sexual minority men vs women), we get z-values for the same cell-wise comparisons of

ZREG(-.440, -.352, .165, .124)

## [1] -0.4263564

ZREG(-.374, -.267, .167, .118)

## [1] -0.5232727

ZREG(-.474, -.154, .210, .152)

## [1] -1.23439

ZREG(-.459, -.081, .164, .262)

## [1] -1.222923

And nothing: the differences between LGBT men and LGBT women were not significant. p-values in Table 5 were

cat("1 - 2
3 - 4 \n")

## 1 - 2
## 3 - 4

pnorm(-.098/.055)*2 #Reported as p between .10 and .05

## [1] 0.07477889

pnorm(-.108/.059)*2 #Reported as p between .10 and .05

## [1] 0.06717394

pnorm(-.026/.030)*2 #Reported as p above .05

## [1] 0.3861247

pnorm(-.051/.037)*2 #Reported as p above .05

## [1] 0.1680865

cat("Sex interactions, 1 - 3 vs 2 - 4 \n")

## Sex interactions, 1 - 3 vs 2 - 4

ZREG(-.098, -.026, .055, .030)

## [1] -1.149245

ZREG(-.108, -.051, .059, .037)

## [1] -0.8184723

Non-significant p-values and no significant interactions. Table 6 is, from Any to 30 days and men first followed by women, and then their interaction tests:

cat("Males \n")

## Males

pnorm(-.01/.009)*2  #Reported as p above .05

## [1] 0.2665205

pnorm(-.024/.008)*2 #Reported as p below .01

## [1] 0.002699796

pnorm(-.023/.007)*2 #Reported as p below .01

## [1] 0.001017241

pnorm(-.016/.006)*2 #Reported as p between .05 and .01

## [1] 0.007660761

pnorm(-.008/.006)*2 #Reported as p above .1

## [1] 0.1824224

pnorm(-.007/.006)*2 #Reported as p above .1

## [1] 0.243345

cat("Females \n")

## Females

pnorm(.018/.016, lower.tail = F)*2 #Reported as p above .1

## [1] 0.260589

pnorm(-.006/.013)*2 #Reported as p above .1

## [1] 0.6444123

pnorm(-.004/.012)*2 #Reported as p above .1

## [1] 0.7388827

pnorm(-.004/.007)*2 #Reported as p above .1 x3

## [1] 0.5677092

cat("Sex interactions - significant results only \n")

## Sex interactions - significant results only

ZREG(-.024, -.006, .008, .013)

## [1] -1.179219

ZREG(-.023, -.004, .007, .012)

## [1] -1.36765

ZREG(-.016, -.004, .006, .007)

## [1] -1.301583

In addition to not showing significant interactions, this table is even less impressive because it clearly needs to be corrected for multiple comparisons (and all of the tables need a correction for multiple comparisons together), and it has not been. With the Bonferroni correction, one p-value becomes borderline, another becomes nonsignificant, and another is fine.

pnorm(-.024/.008)*24 #12 comparisons, two-sided = 24. I know this is supposed to be applied to the alpha threshold.

## [1] 0.03239755

pnorm(-.023/.007)*24

## [1] 0.0122069

pnorm(-.016/.006)*24

## [1] 0.09192913

Now Table 7, p-values then interactions:

pnorm(-.313/.158)*2

## [1] 0.04758985

pnorm(-.148/.091)*2

## [1] 0.1038702

ZREG(-.313, -.148, .158, .091)

## [1] -0.9049421

And finally Table 8, significant results only, ignoring the Wild Bootstraps, since those clearly do not differ by sex:

pnorm(-.474/.076)*2 #Reported as p between .1 and .05

## [1] 4.464922e-10

pnorm(-.358/.116)*2 #Reported as p between .05 and .01

## [1] 0.002027277

pnorm(-.493/.163)*2 #Reported as p below .01

## [1] 0.002490116

pnorm(-.544/.160)*2 #Reported as p below .01

## [1] 0.0006738585

ZREG(-.474, -.154, .076, .466)

## [1] -0.677741

ZREG(-.358, -.179, .166, .230)

## [1] -0.6310646

ZREG(-.493, -.063, .163, .262)

## [1] -1.393543

ZREG(-.544, -.207, .160, .264)

## [1] -1.091673

No significant interactions.

Discussion

Mann (2022) was too hastily concluded. The results are not robust and there is no reason to even think that the interactions discussed in the paper were present. If the improperly presented sample sizes that sometimes numbered over a million were taken seriously, this paper could not have any standing whatsoever because p-values near the border of significance with huge samples signal greater confidence in the null than the alternative hypothesis. Given the errors in presentation and how they were often too large to be simple rounding errors, something is clearly amiss in addition to the methodological failure to test for interactions before inference about them was conducted.

References

Mann, S. (2022). Anti-Discrimination Laws and Mental Health: Evidence from Sexual Minorities [Job Market Paper]. Vanderbilt University; https://web.archive.org/web/20221025222929/https://drive.google.com/file/d/1FFmZPpORqNu5vMC0bpQJghmxHuQRyyXF/view. https://drive.google.com/file/d/1FFmZPpORqNu5vMC0bpQJghmxHuQRyyXF/view

Senn, S. (2019, March 10). To infinity and beyond: how big are your data, really?. Error Statistics Philosophy. https://errorstatistics.com/2019/03/09/s-senn-to-infinity-and-beyond-how-big-are-your-data-really-guest-post/

Andre, Q. (2022, August 18). Large P-Values Cannot be Explained by Power Analysis. Quentin Andre. https://quentinandre.net/post/large-pvalues-and-power-analysis/

Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 60(4), 328–331. https://doi.org/10.1198/000313006X152649

Wagenmakers, E.-J., & Ly, A. (2022). History and nature of the Jeffreys–Lindley paradox. Archive for History of Exact Sciences. https://doi.org/10.1007/s00407-022-00298-3

Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of ρ Values for Testing Precise Null Hypotheses. The American Statistician, 55(1), 62–71. https://doi.org/10.1198/000313001300339950

Robert, C. P. (2014). On the Jeffreys-Lindley Paradox. Philosophy of Science, 81(2), 216–232. https://doi.org/10.1086/675729