Setup

Three of these functions were derived based on papers mentioned in two blogposts from Daniel Lakens and one uses code derived based on a MOOC of his:

https://daniellakens.blogspot.com/2014/05/prior-probabilities-and-replicating.html https://daniellakens.blogspot.com/2014/05/the-probability-of-p-values-as-function.html

library(pacman); p_load(pwr, ggplot2, shiny)

ZREG <- function(B1, B2, SEB1, SEB2) {
  Z = (B1-B2)/sqrt((SEB1)^2 + (SEB2)^2)
  return(Z)}

NomoP <- function(p, q, tails = 2, rnd = 3){
  if(p < (1 / exp(1))){BF = -exp(1) * p * log(p)} else {BF = 1}
  MPP = (1 + ((BF * q)/(1 - q))^-1)^-1
  cat(paste0("The minimum posterior probability given a p-value of ", p, " and a prior probability of ", q, " is ", round(MPP, rnd), " (BF = ", round(BF, rnd), "). \n"))}

#NomoP(p = .04, q = .5)   #Test 1
#NomoP(p = .009, q = .75) #Test 2

PPlot <- function(b = .8, sims = 10000, bm = 0, M = .5, SD = 1, a = .05, bars = 100, zoom = F, rnd = 0, seed = 1){
  set.seed(seed)
  require(pwr); require(ggplot2)
  n <- round(pwr.t.test(d = (M - bm)/SD, sig.level = a, power = b, type = "one.sample", alternative = "two.sided")$n, 0)
  p <- numeric(sims)
  for(i in 1:sims){
    x <- rnorm(n = n, mean = M, sd = SD)
    z <- t.test(x, mu = bm, conf.level = 1 - a)
    p[i] <- z$p.value}
  x <- sum((p < a)/sims)
  power = pwr.t.test(d = (M - bm)/SD, n = n, sig.level = a, type = "one.sample", alternative = "two.sided")$power
  x_max <- 1; y_max <- sims; labels <- seq(0, 1, .01)
  if(zoom == T){
    x_max <- .05; y_max <- sims; labels <- seq(0, 1, .01)} else {}
  graph <- ggplot(as.data.frame(p), aes(p)) + 
    geom_histogram(binwidth = 1/bars, fill = "steelblue", col = "black", alpha = .5, boundary = 0) +
    xlab("p-Values") + ylab("Frequencies of p-Values") + 
    coord_cartesian(xlim = c(0, x_max), ylim = c(0, y_max)) + theme_minimal(base_size = 18) + 
    ggtitle(paste0("p-Value Distribution for ", sims, " one-sample two-sided t-tests with ", round(power * 100, rnd), "% power.")) +
    theme(plot.title = element_text(hjust = .5), legend.title = element_blank(), legend.position = "top")
  cat(paste0("The empirical power was ", toString(100 * x), "%, and the theoretical power was ", round(b * 100, rnd), "%. \n")); graph}

#PPlot(b = .8)           # Test 1
#PPlot(b = .8, zoom = T) # Test 2

SBBCal <- function(p = .05, rnd = 3){ #Not use
  Frequentist = (1 + (-exp(1) * p * log(p))^-1)^-1
  Bayesian = -exp(1) * p * log(p)
  cat(paste0("The frequentist calibration of a p-value of ", p, " is ", round(Frequentist, rnd), " and the Bayesian calibration is ", round(Bayesian, rnd), "\n"))}

#SBBCal(c(.2, .1, .05, .01, .005, .001)) #Test values

Rationale

It’s not fun when someone’s job market paper makes the rounds, looks neat, and turns out to be nonsense. Unfortunately, that is a case with a recent paper arguing that anti-discrimination laws reduce the number of mental health days taken by sexual minorities, and in particular, sexual minority men (Mann, 2022). There are a few problems with this paper.

Firstly, the sample sizes are misrepresented because the standard errors are clustered (see Senn, 2019).

Secondly, many findings are marginal, and are above the typical \(\alpha = .05\) threshold used for statistical significance. It is unlikely that this could occur by chance. If you actually seek high power, you will have very few results that are this borderline if the null hypothesis is false (See Andre, 2022).

Third and finally, interactions are described and discussed but none of them are significant (see Gelman & Stern, 2006).

Analyses

Power

Is it possible to target a p-value of just under .05? No, and we can quantify the likelihood of getting a certain number of results in that range for some level of power. Say we have 50%, 60%, 70%, 80%, or 90% power.

PPlot(b = .5, zoom = T)
## The empirical power was 48.08%, and the theoretical power was 50%.

PPlot(b = .6, zoom = T)
## The empirical power was 60.71%, and the theoretical power was 60%.

PPlot(b = .7, zoom = T)
## The empirical power was 70.62%, and the theoretical power was 70%.

PPlot(b = .8, zoom = T)
## The empirical power was 79.54%, and the theoretical power was 80%.

PPlot(b = .9, zoom = T)
## The empirical power was 90.25%, and the theoretical power was 90%.

With 50% power, the majority of significant results are not just below .05, but below .01, and this is truer the greater your power. With a decent level of power, few significant results will be in the interval between .05 and .01. If we have a prior on the likelihood of a given \(H_1\), we can assess the evidential value of a given p-value. Lets check all the p-values between .05 and .01 for priors of 50%, 60%, 70%, 80%, and 90% of our effect of interest being real.

cat("50% Prior \n")
## 50% Prior
NomoP(.05, .5); NomoP(.04, .5); NomoP(.03, .5); NomoP(.02, .5); NomoP(.01, .5)
## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.5 is 0.289 (BF = 0.407).
## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.5 is 0.259 (BF = 0.35).
## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.5 is 0.222 (BF = 0.286).
## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.5 is 0.175 (BF = 0.213).
## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.5 is 0.111 (BF = 0.125).
cat("60% Prior \n")
## 60% Prior
NomoP(.05, .6); NomoP(.04, .6); NomoP(.03, .6); NomoP(.02, .6); NomoP(.01, .6)
## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.6 is 0.379 (BF = 0.407).
## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.6 is 0.344 (BF = 0.35).
## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.6 is 0.3 (BF = 0.286).
## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.6 is 0.242 (BF = 0.213).
## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.6 is 0.158 (BF = 0.125).
cat("70% Prior \n")
## 70% Prior
NomoP(.05, .7); NomoP(.04, .7); NomoP(.03, .7); NomoP(.02, .7); NomoP(.01, .7)
## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.7 is 0.487 (BF = 0.407).
## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.7 is 0.45 (BF = 0.35).
## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.7 is 0.4 (BF = 0.286).
## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.7 is 0.332 (BF = 0.213).
## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.7 is 0.226 (BF = 0.125).
cat("80% Prior \n")
## 80% Prior
NomoP(.05, .8); NomoP(.04, .8); NomoP(.03, .8); NomoP(.02, .8); NomoP(.01, .8)
## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.8 is 0.62 (BF = 0.407).
## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.8 is 0.583 (BF = 0.35).
## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.8 is 0.534 (BF = 0.286).
## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.8 is 0.46 (BF = 0.213).
## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.8 is 0.334 (BF = 0.125).
cat("90% Prior \n")
## 90% Prior
NomoP(.05, .9); NomoP(.04, .9); NomoP(.03, .9); NomoP(.02, .9); NomoP(.01, .9)
## The minimum posterior probability given a p-value of 0.05 and a prior probability of 0.9 is 0.786 (BF = 0.407).
## The minimum posterior probability given a p-value of 0.04 and a prior probability of 0.9 is 0.759 (BF = 0.35).
## The minimum posterior probability given a p-value of 0.03 and a prior probability of 0.9 is 0.72 (BF = 0.286).
## The minimum posterior probability given a p-value of 0.02 and a prior probability of 0.9 is 0.657 (BF = 0.213).
## The minimum posterior probability given a p-value of 0.01 and a prior probability of 0.9 is 0.53 (BF = 0.125).

With a strong prior on an effect being real of just 90%, a p-value of .04 is consistent with a 76% probability that the null hypothesis is correct. This is similar to the observation that, with a sample size of a million, a p-value of .04 is much stronger evidence for the null than against it thanks to Lindley’s Paradox (Wagenmakers & Ly, 2022’ see also Sellke, Bayarri & Berger, 2001; Robert, 2014). A simple proof for this was seen in the plots given above, where it was noted that with significant results, a p-value just under the selected \(\alpha\) is improbable.

With these two observations, we can throw in another: that the observation of multiple improbable significant values in a well-powered test can be treated like a series of rigged coin flips given their posterior probabilities under some prior or given the a priori power of a test. If we treat the prior as 50/50 for simplicity, this is trivial. Finding a bunch of p-values just under .05 means the study probably supports the null or the author p-hacked. p-values right near \(\alpha\) are not trustworthy.

What about this study?

Looking at the p-values obtained by Mann (2022), they are dubious. Tables 4 through 8 are the relevant ones. They include p-values of

cat("Table 1 - Row-wise \n")
## Table 1 - Row-wise
cat("Row 1 \n")
## Row 1
pnorm(-.440/.165)*2 #Reported as p between .05 and .01; x2 because two-sided
## [1] 0.007660761
pnorm(-.376/.167)*2 #Reported as p between .05 and .01
## [1] 0.02435408
pnorm(-.474/.210)*2 #Reported as p between .05 and .01
## [1] 0.02399915
pnorm(-.459/.164)*2 #Reported as p below .01
## [1] 0.0051296
cat("Row 2 \n")
## Row 2
pnorm(-.123/.085)*2 #Reported as p above .1
## [1] 0.1478804
pnorm(-.098/.084)*2 #Reported as p above .1
## [1] 0.243345
pnorm(-.046/.064)*2 #Reported as p above .1
## [1] 0.472295
pnorm(-.042/.063)*2 #Reported as p above .1
## [1] 0.5049851
cat("Row 3 \n")
## Row 3
pnorm(-.352/.124)*2 #Reported as p below .01
## [1] 0.004529635
pnorm(-.267/.118)*2 #Reported as p between .05 and .01
## [1] 0.02365346
pnorm(-.154/.152)*2 #Reported as p above .1
## [1] 0.3109847
pnorm(-.081/.262)*2 #Reported as p above .1
## [1] 0.7571996
cat("Row 4 \n")
## Row 4
pnorm(-.085/.113)*2 #Reported as p above .1
## [1] 0.4519233
pnorm(-.057/.112)*2 #Reported as p above .1
## [1] 0.6108023
pnorm(.115/.081, lower.tail = F)*2 #Reported as p above .1
## [1] 0.1556796
pnorm(-.045/.094)*2 #Reported as p above .1
## [1] 0.6321354

Table 1, row 1, cell 1 is misreported unfavorably to Mann (2022). Why this happens, we do not know. It may indicate an issue in reporting that when corrected will deliver the correct p-value as reported or maybe the p-value as reported needs to change. Three of the six significant results are in the dubious range. The important thing here, however, is not the p-values of the tests themselves, but those of the interactions. Mann has taken a significant result in one subgroup but not another to mean that there is a difference between those subgroups. These differences are not statistically significant, however. Contrasting men, women, LGBT, and non-LGBT should have been done only after significant interactions across these categories were observed. For the sake of simplicity, lets compare extreme contrasts: LGBT men and heterosexual women.

ZREG(-.440, -.085, .165, .113) #Row 1, cell 1 vs row 4, cell 1
## [1] -1.775133
ZREG(-.374, -.057, .167, .112) #Row 1, cell 2 vs row 4, cell 2
## [1] -1.57649
ZREG(-.474, .115, .210, .081)  #Row 1, cell 3 vs row 4, cell 3
## [1] -2.616848
ZREG(-.459, -.045, .164, .094) #Row 1, cell 4 vs row 4, cell 4
## [1] -2.190139

There are some differences. But what about the intrasex contrasts? In order, for LGBT vs heterosexual men and then LGBT vs heterosexual women:

cat("Males \n")
## Males
ZREG(-.440, -.123, .165, .085)
## [1] -1.707909
ZREG(-.374, -.098, .167, .084)
## [1] -1.476443
ZREG(-.474, -.046, .210, .064)
## [1] -1.949567
ZREG(-.459, -.042, .164, .063)
## [1] -2.373575
cat("Females \n")
## Females
ZREG(-.352, -.085, .124, .113)
## [1] -1.591515
ZREG(-.267, -.057, .118, .112)
## [1] -1.290799
ZREG(-.154, -.115, .152, .081)
## [1] -0.2264344
ZREG(-.081, -.045, .262, .094)
## [1] -0.1293325

For the contrast discussed in the paper (i.e., sexual minority men vs women), we get z-values for the same cell-wise comparisons of

ZREG(-.440, -.352, .165, .124)
## [1] -0.4263564
ZREG(-.374, -.267, .167, .118)
## [1] -0.5232727
ZREG(-.474, -.154, .210, .152)
## [1] -1.23439
ZREG(-.459, -.081, .164, .262)
## [1] -1.222923

And nothing: the differences between LGBT men and LGBT women were not significant. p-values in Table 5 were

cat("1 - 2
3 - 4 \n")
## 1 - 2
## 3 - 4
pnorm(-.098/.055)*2 #Reported as p between .10 and .05
## [1] 0.07477889
pnorm(-.108/.059)*2 #Reported as p between .10 and .05
## [1] 0.06717394
pnorm(-.026/.030)*2 #Reported as p above .05
## [1] 0.3861247
pnorm(-.051/.037)*2 #Reported as p above .05
## [1] 0.1680865
cat("Sex interactions, 1 - 3 vs 2 - 4 \n")
## Sex interactions, 1 - 3 vs 2 - 4
ZREG(-.098, -.026, .055, .030)
## [1] -1.149245
ZREG(-.108, -.051, .059, .037)
## [1] -0.8184723

Non-significant p-values and no significant interactions. Table 6 is, from Any to 30 days and men first followed by women, and then their interaction tests:

cat("Males \n")
## Males
pnorm(-.01/.009)*2  #Reported as p above .05
## [1] 0.2665205
pnorm(-.024/.008)*2 #Reported as p below .01
## [1] 0.002699796
pnorm(-.023/.007)*2 #Reported as p below .01
## [1] 0.001017241
pnorm(-.016/.006)*2 #Reported as p between .05 and .01
## [1] 0.007660761
pnorm(-.008/.006)*2 #Reported as p above .1
## [1] 0.1824224
pnorm(-.007/.006)*2 #Reported as p above .1
## [1] 0.243345
cat("Females \n")
## Females
pnorm(.018/.016, lower.tail = F)*2 #Reported as p above .1
## [1] 0.260589
pnorm(-.006/.013)*2 #Reported as p above .1
## [1] 0.6444123
pnorm(-.004/.012)*2 #Reported as p above .1
## [1] 0.7388827
pnorm(-.004/.007)*2 #Reported as p above .1 x3
## [1] 0.5677092
cat("Sex interactions - significant results only \n")
## Sex interactions - significant results only
ZREG(-.024, -.006, .008, .013)
## [1] -1.179219
ZREG(-.023, -.004, .007, .012)
## [1] -1.36765
ZREG(-.016, -.004, .006, .007)
## [1] -1.301583

In addition to not showing significant interactions, this table is even less impressive because it clearly needs to be corrected for multiple comparisons (and all of the tables need a correction for multiple comparisons together), and it has not been. With the Bonferroni correction, one p-value becomes borderline, another becomes nonsignificant, and another is fine.

pnorm(-.024/.008)*24 #12 comparisons, two-sided = 24. I know this is supposed to be applied to the alpha threshold.
## [1] 0.03239755
pnorm(-.023/.007)*24
## [1] 0.0122069
pnorm(-.016/.006)*24
## [1] 0.09192913

Now Table 7, p-values then interactions:

pnorm(-.313/.158)*2
## [1] 0.04758985
pnorm(-.148/.091)*2
## [1] 0.1038702
ZREG(-.313, -.148, .158, .091)
## [1] -0.9049421

And finally Table 8, significant results only, ignoring the Wild Bootstraps, since those clearly do not differ by sex:

pnorm(-.474/.076)*2 #Reported as p between .1 and .05
## [1] 4.464922e-10
pnorm(-.358/.116)*2 #Reported as p between .05 and .01
## [1] 0.002027277
pnorm(-.493/.163)*2 #Reported as p below .01
## [1] 0.002490116
pnorm(-.544/.160)*2 #Reported as p below .01
## [1] 0.0006738585
ZREG(-.474, -.154, .076, .466)
## [1] -0.677741
ZREG(-.358, -.179, .166, .230)
## [1] -0.6310646
ZREG(-.493, -.063, .163, .262)
## [1] -1.393543
ZREG(-.544, -.207, .160, .264)
## [1] -1.091673

No significant interactions.

Discussion

Mann (2022) was too hastily concluded. The results are not robust and there is no reason to even think that the interactions discussed in the paper were present. If the improperly presented sample sizes that sometimes numbered over a million were taken seriously, this paper could not have any standing whatsoever because p-values near the border of significance with huge samples signal greater confidence in the null than the alternative hypothesis. Given the errors in presentation and how they were often too large to be simple rounding errors, something is clearly amiss in addition to the methodological failure to test for interactions before inference about them was conducted.

References

Mann, S. (2022). Anti-Discrimination Laws and Mental Health: Evidence from Sexual Minorities [Job Market Paper]. Vanderbilt University; https://web.archive.org/web/20221025222929/https://drive.google.com/file/d/1FFmZPpORqNu5vMC0bpQJghmxHuQRyyXF/view. https://drive.google.com/file/d/1FFmZPpORqNu5vMC0bpQJghmxHuQRyyXF/view

Senn, S. (2019, March 10). To infinity and beyond: how big are your data, really?. Error Statistics Philosophy. https://errorstatistics.com/2019/03/09/s-senn-to-infinity-and-beyond-how-big-are-your-data-really-guest-post/

Andre, Q. (2022, August 18). Large P-Values Cannot be Explained by Power Analysis. Quentin Andre. https://quentinandre.net/post/large-pvalues-and-power-analysis/

Gelman, A., & Stern, H. (2006). The Difference Between “Significant” and “Not Significant” is not Itself Statistically Significant. The American Statistician, 60(4), 328–331. https://doi.org/10.1198/000313006X152649

Wagenmakers, E.-J., & Ly, A. (2022). History and nature of the Jeffreys–Lindley paradox. Archive for History of Exact Sciences. https://doi.org/10.1007/s00407-022-00298-3

Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of ρ Values for Testing Precise Null Hypotheses. The American Statistician, 55(1), 62–71. https://doi.org/10.1198/000313001300339950

Robert, C. P. (2014). On the Jeffreys-Lindley Paradox. Philosophy of Science, 81(2), 216–232. https://doi.org/10.1086/675729