Setup

PDerive <- function(ES, SE, Tail = 2){
  Z = abs(ES/SE)
  P = pnorm(Z, lower.tail = F) * Tail
  return(P)}

CorrelatedBonferroni <- function(p, m, r, cutoff = .05){
  msharp = (m + 1) - (1 + (m - 1) * r)
  AdjustedP = p * msharp
  AdjustedCutoff = cutoff / msharp
  return(cat("The adjusted P-value is", AdjustedP, "and the adjusted cutoff is", AdjustedCutoff, "\n"))}

Rationale

The Bonferroni correction is the “strong” control for the familywise-error rate. It is “strong” because it controls the probability of at least one false-positive. However, it is often too strong, and it may make it hard to achieve sufficient power. Therefore, people tend to use weak controls, or controls for the false-discovery rate. Controlling the false-discovery rate involves adjusting for the expected percentage of discoveries that are false rather than the likelihood of at least one false-positive.

Often enough, outcomes are correlated. The Bonferroni correction assumes all outcomes are uncorrelated. If there are true hypotheses among a set of studies, correlated outcomes should generate closer p-values, such that if one of them is significant, the other one should be more significant as well due to the variance they share. Moreover, using the Bonferroni correction militates against our power much more than it should if outcomes are correlated. Importantly, knowing that outcomes are correlated can also guide choosing the number of tests to correct for.

Since it is unlikely to be the case that endpoints are uncorrelated, Shi, Pavey & Carter (2012) proposed a simple amendment to account for (intraclass) correlations among endpoints. Where the typical Bonferroni-corrected \(\alpha\) value is computed as \(\frac{\alpha}{m}\), where \(m\) is the number of endpoints, the correlation-adjusted Bonferroni \(\alpha\) value is equal to

\[\frac{\alpha}{(m + 1) - [1 + (g - 1) \times \text{ICC}]}\]

When tests are uncorrelated, this correction equals the Bonferroni, but when tests are correlated, it is more lenient and power is accordingly greater. Perfectly correlated tests are treated as the same test, since they effectively are, but because that is so rare, tests are still corrected for multiple comparisons, as they ought to be.

I applied this testing method to Bodenhom’s (2007) test of single parenthood and orphaning effects on childhood outcomes. Bodenhom had 47 significant estimates (i.e., p < .05), but many more if counting p < .1, in a sample with data for 6,133 White, and 4,561 Black children. In his first two tables, he conducted 27 tests for Blacks and Whites, for a total of 108 tests, and in the last table, 18 tests each, for a grand total of 144 tests. Importantly, these were obviously dependent tests. In the first two cases, they were just the same models done with different measures of household resources. In the first, the measure was home ownership and occupational status, and in the second, total household wealth. These were also correlated with the third, because that had labor force participation as its outcome instead of class attendance, as in the former two, and labor force participation is correlated with occupational status, household wealth, and home ownership. Because the data is unavailable, the correlations of the outcomes are unavailable, so the significant estimates were correlation-adjusted Bonferroni corrected for correlations of .1, .2, .3, .4, and .5, in addition to the simple Bonferroni correction, for both the number of significant tests, and the total number in the tables.

Analysis

SigEstimates <- data.frame(
  "ES" = c(-.10, #Whites, School Attendance, Single Motherhood, Resource Control is Home Ownership and Occupational Status
           -.16, -.25, -.24, -.22, -.3, -.28, -.31, #Whites, School Attendance, Orphaned
           -.08, -.08, -.07, -.04, #Blacks, School Attendance, Single Motherhood
           -.13, -.12, -.09, -.19, -.18, -.16, -.07, -.06, -.06, #Blacks, School Attendance, Orphaned
           
           -.10, -.09, #Whites, School Attendance, Single Motherhood, Resource Control is Total Household Wealth
           -.16, -.16, -.25, -.25, -.22, -.3, -.29, -.31, #Whites, School Attendance, Orphaned
           -.08, -.09, -.08, -.04, #Blacks, School Attendance, Single Motherhood
           -.13, -.13, -.09, -.19, -.18, -.16, -.07, -.06, -.06, #Blacks, School Attendance, Orphaned
           
           .11, .11, #Whites, Labor Force Participation, Single Motherhood, Marginal Probit Effects
           .17), #Blacks, Labor Force Participation, Orphaned
  "SE" = c(.044, 
           .078, .053, .054, .059, .033, .037, .031,
           .034, .034, .036, .020, 
           .035, .036, .039, .037, .038, .039, .021, .022, .019,
           
           .045, .045,
           .078, .079, .054, .054, .059, .033, .037, .030,
           .034, .033, .035, .020,
           .035, .036, .039, .037, .038, .039, .021, .021, .019,
           
           .078, .080,
           .096))

SigEstimates$Ps <- PDerive(SigEstimates$ES, SigEstimates$SE); round(SigEstimates$Ps, 5)
##  [1] 0.02304 0.04024 0.00000 0.00001 0.00019 0.00000 0.00000 0.00000 0.01863
## [10] 0.01863 0.05184 0.04550 0.00020 0.00086 0.02102 0.00000 0.00000 0.00004
## [19] 0.00086 0.00639 0.00159 0.02627 0.04550 0.04024 0.04283 0.00000 0.00000
## [28] 0.00019 0.00000 0.00000 0.00000 0.01863 0.00639 0.02227 0.04550 0.00020
## [37] 0.00030 0.02102 0.00000 0.00000 0.00004 0.00086 0.00427 0.00159 0.15846
## [46] 0.16913 0.07659
sum(SigEstimates$Ps >= 0); sum(SigEstimates$Ps > .05); sum(SigEstimates$Ps * sum(SigEstimates$Ps >= 0) > .05); sum(SigEstimates$Ps * 144 > .05)
## [1] 47
## [1] 4
## [1] 23
## [1] 26
sum(SigEstimates$Ps > .05)/sum(SigEstimates$Ps >= 0)
## [1] 0.08510638
sum(SigEstimates$Ps * sum(SigEstimates$Ps >= 0) > .05)/sum(SigEstimates$Ps >= 0)
## [1] 0.4893617
sum(sum(SigEstimates$Ps * 144 > .05)/sum(SigEstimates$Ps >= 0))
## [1] 0.5531915

At baseline, there are three manifestly either incorrectly reported estimates, SEs, or significance estimates, and one that could have plausibly resulted from rounding. It’s good to always include a note if your rounded coefficients end up providing incorrect p-values. Perhaps consider providing more digits to make that clear. At baseline, 9% of significant estimates were not actually significant. The naive Bonferroni correction makes 49% nonsignificant, and correcting for all of the tests provided in Bodenhom’s tables, 55% became nonsignificant.

CorrelatedBonferroniJustP <- function(p, m, r){
  msharp = (m + 1) - (1 + (m - 1) * r)
  AdjustedP = p * msharp
  return(AdjustedP)}

sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(47, 47), rep(.1, 47)) > .05)
## [1] 23
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(47, 47), rep(.2, 47)) > .05)
## [1] 23
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(47, 47), rep(.3, 47)) > .05)
## [1] 23
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(47, 47), rep(.4, 47)) > .05)
## [1] 21
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(47, 47), rep(.5, 47)) > .05)
## [1] 21
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(144, 47), rep(.1, 47)) > .05)
## [1] 26
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(144, 47), rep(.2, 47)) > .05)
## [1] 26
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(144, 47), rep(.3, 47)) > .05)
## [1] 26
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(144, 47), rep(.4, 47)) > .05)
## [1] 26
sum(CorrelatedBonferroniJustP(SigEstimates$Ps, rep(144, 47), rep(.5, 47)) > .05)
## [1] 26

The results were are very comparable for the naive Bonferroni and the correlation-corrected Bonferroni. The differences will tend to be greater for smaller numbers of tests, and as outcome correlations increase.

References

Shi, Q., Pavey, E. S., & Carter, R. E. (2012). Bonferroni-based correction factor for multiple, correlated endpoints. Pharmaceutical Statistics, 11(4), 300–309. https://doi.org/10.1002/pst.1514

Bodenhom, H. (2007). Single Parenthood and Childhood Outcomes in the Mid-Nineteenth-Century Urban South. The Journal of Interdisciplinary History, 38(1), 33–64.

sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.28   R6_2.5.1        jsonlite_1.7.2  magrittr_2.0.1 
##  [5] evaluate_0.14   rlang_0.4.12    stringi_1.7.5   jquerylib_0.1.4
##  [9] bslib_0.3.1     rmarkdown_2.11  tools_4.1.2     stringr_1.4.0  
## [13] xfun_0.27       yaml_2.2.1      fastmap_1.1.0   compiler_4.1.2 
## [17] htmltools_0.5.2 knitr_1.36      sass_0.4.0

Unrelated pro-tip: If you want a CI that lets you reject \(>\alpha\) on either side for a threshold of .05, use a 90% CI.