Understanding Multiple Comparisons

The Problem: Multiple Comparisons

When you run one statistical test (like a t-test) and get a p-value of 0.05, you're accepting a 5% chance of a false positive (a "Type I Error"). This is the chance you'll say "there is an effect" when, in reality, there is none.

The problem starts when you run more than one test. If you run 20 tests on data where there is no real effect, your chance of getting at least one false positive isn't 5%. It's 64%.

The Problem Compounded

The probability of getting at least one false positive (the Family-Wise Error Rate or FWER) is 1 - (1 - α)ⁿ, where α is your p-value threshold (e.g., 0.05) and n is the number of tests.

# of Tests (n)	Chance of At Least One False Positive
1	5.0%
2	9.8%
5	22.6%
10	40.1%
20	64.2%
50	92.3%

The formula for this "Family-Wise Error Rate" (FWER) is:

FWER = 1 - (1 - α)ⁿ

For our example (α=0.05, n=20):

FWER = 1 - (1 - 0.05)²⁰ = 1 - (0.95)²⁰ ≈ 0.6415

A Striking Example: The Dead Salmon

In neuroscience, fMRI studies scan the brain for active regions, comparing thousands of tiny "voxels" (3D pixels) before and after a task. This creates a massive multiple comparisons problem.

A famous study demonstrated this by placing a dead Atlantic salmon in an fMRI scanner and showing it "photographs depicting human individuals in social situations." The analysis found a "statistically significant" cluster of brain activity (p = 0.001) in the dead fish's brain. This was a vivid false positive, a direct result of performing thousands of tests without correction. While most modern studies apply corrections, this example shows how easy it is to find "significant" results in pure noise.

Type I & Type II Errors

In any statistical test, you are comparing reality to your test result. This creates four possible outcomes.

Type I Error (False Positive)

Reality: No effect exists.

Your Test: Says there IS an effect.

This is "finding" something that isn't real. This is what the p-value threshold α (alpha) is supposed to control. Multiple comparisons dramatically inflate the chance of this error.

Type II Error (False Negative)

Reality: An effect DOES exist.

Your Test: Says there is NO effect.

This is "missing" a real finding. The probability of this error is called β (beta). The power of a study is 1 - β (the chance of correctly finding a real effect).

Adjusted P-Values vs. Significance Levels

So, how do we fix this? There are two main ways to think about corrections.

1. Adjust the Significance Level (α)

This is the classic Bonferroni approach. Instead of using 0.05 for every test, you use a much stricter threshold.

Original α: 0.05
Number of Tests (m): 20
New Adjusted α: 0.05 / 20 = 0.0025

You then compare all your *original p-values* to this new, tiny threshold. Only p-values less than 0.0025 are "significant."

2. Adjust the P-Values

This is the more modern approach. You leave the significance level at 0.05, but you mathematically inflate all of your p-values to account for the number of tests.

Original p-value: 0.003
Number of Tests (m): 20
Adjusted p-value (Bonferroni): 0.003 * 20 = 0.06

Your original 0.003 *looked* significant, but the adjusted p-value of 0.06 is greater than 0.05, so it is *not* significant. This is the approach most software uses.

The Base Rate Fallacy (Why P < 0.05 Isn't Enough)

A "significant" p-value does not tell you the chance that your hypothesis is true. This is a common and dangerous misunderstanding known as the base rate fallacy.

The p-value is calculated *assuming the null hypothesis is true* (i.e., that there is no effect). It tells you the probability of seeing your data (or more extreme data) by chance *in that scenario*. It does *not* tell you the probability that the effect is real. That probability depends entirely on the "base rate" — how many of the hypotheses you're testing are actually true to begin with.

Example 1: The Cancer Drug Trials

Suppose you are testing 100 potential cancer medications. You don't know it, but only 10 of these drugs actually work (a 10% base rate). You run experiments with a statistical power of 0.8 (meaning you'll correctly detect 80% of the working drugs) and look for p < 0.05.

Of the 10 "True" Drugs:
- You will correctly detect 8 of them (80% power). These are True Positives.
- You will miss 2 of them. These are False Negatives.
Of the 90 "False" Drugs:
- You will get a "significant" result for 5% of them (your p < 0.05 threshold). 90 * 0.05 = 4.5, so let's say 5 drugs. These are False Positives.
- You will correctly find no effect for the other 85. These are True Negatives.

At the end of your study, you have 13 "significant" results (8 + 5). What is the chance that any *one* of these significant results is a truly effective drug?

8 True Positives / (8 True Positives + 5 False Positives) = 8 / 13 ≈ 62%

Your "False Discovery Rate" is 5 / 13, or 38%!

Even with a p-value of 0.05, your chance of any given "significant" finding being a fluke is 38%. This is because the base rate of effective drugs was so low (10%), giving many opportunities for false positives.

Example 2: Medical Testing (Mammograms)

This fallacy is famous in medical diagnostics. Suppose:

The base rate of breast cancer for women getting mammograms is 0.8% (1 in 125).
The test correctly detects 90% of cancers (90% power / "sensitivity").
The test gives a false positive for 7% of women *without* cancer (7% false positive rate).

If 1,000 women are screened, what happens?

Of the 8 women *with* cancer:
- The test finds 7 (90% of 8). These are True Positives.
- The test misses 1. This is a False Negative.
Of the 992 women *without* cancer:
- The test gives a false positive for 7% of them (0.07 * 992 ≈ 70). These are False Positives.
- The test correctly clears 922. These are True Negatives.

In total, 77 women get a positive result (7 + 70). If you get a positive mammogram, what is the chance you actually have breast cancer?

7 True Positives / (7 True Positives + 70 False Positives) = 7 / 77 ≈ 9%

Despite the test seeming "accurate," a positive result has a 91% chance of being a false positive.

Example 3: Survey Data

This also applies to surveys. A 1992 survey estimated 2.5 million annual gun self-defense uses. But other data showed only 430,000 home burglaries where the resident was home and awake. The survey *must* be a massive overestimate.

Why? The base rate. If 99.9% of people have *not* used a gun in self-defense, but even 1% of them misunderstand the question, want to sound "manly," or are just joking, those false positives (1% of 99.9%) will vastly outnumber the *true* positives (a tiny fraction of 0.1%).

Measures Accounting for Type I Errors

Because simple p-values are misleading, statisticians have developed two main ways to measure and control the error rate across a *family* of tests.

Family-Wise Error Rate (FWER)

This is the probability of getting at least one Type I Error (false positive) in your entire set of tests.

Goal: To keep the FWER at or below 5%.

Methods that control this: Bonferroni, Holm, Hochberg, Hommel.

Best for: Situations where *any* false positive is highly problematic (e.g., final approval of a new drug).

False Discovery Rate (FDR)

This is the expected proportion of your "significant" results that are actually false positives.

Goal: To keep the FDR at or below 5% (i.e., you accept that *of all your "discoveries,"* 5% might be flukes).

Methods that control this: Benjamini-Hochberg (BH), Benjamini-Yekutieli (BY).

Best for: Exploratory research where you are testing many things (e.g., genetics, fMRI) and are willing to accept a few false positives in order to find more true effects.

Disjunctive vs. Marginal Power (A Key Consideration)

When designing a trial with multiple outcomes, it's not just the error rate that matters, but also the *statistical power*. A 2019 study by Vickerstaff et al. highlights a critical distinction between two types of power.

1. Disjunctive Power

This is the probability of detecting at least one true intervention effect across *all* of the primary outcomes.

Clinical Goal: This is appropriate when your trial is considered a success if the intervention shows a significant effect on *any* of the important outcomes.

2. Marginal Power

This is the probability of finding a true intervention effect on a *single, specific, nominated outcome* (calculated separately for each one).

Clinical Goal: This is often the default calculation. A common (and resource-intensive) approach is to calculate the sample size needed to achieve 80% or 90% power for *every single outcome* individually.

The Impact on Sample Size

The key takeaway from Vickerstaff et al. (2019) is that the sample size required to achieve 90% *marginal* power for all outcomes is often much larger than the size needed to achieve 90% *disjunctive* power. If the clinical objective is just to find *one* significant effect, using marginal power for all outcomes can "overestimate the number of participants required."

Sample Size Per Arm for 90% Power (Adapted from Vickerstaff et al., 2019, Table 3)
Effect Sizes (Outcome 1, 2)	Correlation (ρ)	Sample for 90% Disjunctive Power	Sample for 90% Marginal Power (Max)
(0.2, 0.2) - Both Small	0.2	402	622
(0.2, 0.2) - Both Small	0.8	522	622
(0.3, 0.5) - Mixed	0.2	96	278
(0.5, 0.5) - Both Large	0.2	65	101

Adjustment Methods: The Algorithms

Here are the most common methods, from simplest to most complex.

Visual Comparison of Adjustment Methods

This plot shows the real-world impact of different adjustment methods. We start with a sample of 20 raw p-values (sorted on the X-axis) and plot their adjusted p-values (on the Y-axis) for each method. The diagonal dotted line represents the raw p-value (no adjustment), and the horizontal dotted line is the standard 0.05 significance level.

Key Takeaways from the Plot:

No Adjustment (Diagonal Line): All raw p-values below 0.05 (the horizontal line) would be wrongly declared significant.
Bonferroni (Red): This is the most conservative method. It shifts the p-values up the most, resulting in the fewest "significant" findings.
Holm (Blue) & Hochberg (Green): These FWER-controlling methods are "step" procedures. They are less conservative than Bonferroni but still stricter than FDR methods.
Benjamini-Hochberg (BH, Purple): This FDR-controlling method is much less conservative. It allows more p-values to remain below the 0.05 line, "discovering" more potential effects while controlling the *proportion* of false discoveries.
Benjamini-Yekutieli (BY, Orange): This is a stricter version of FDR control for data with complex dependencies. It's more conservative than BH but still more powerful than the FWER methods.

Conclusion: Practical Recommendations

Choosing a method involves a trade-off between power, simplicity, and the stage of your research (design vs. analysis). Based on the findings of Vickerstaff et al. (2019) for clinical trials, here are practical recommendations:

For Sample Size Calculation (Design Phase)

Use the Bonferroni correction.
Why? It is simple to apply (just adjust α = 0.05 / m), reliably controls the FWER, and can be calculated *before* the trial, unlike data-driven methods.
Critically consider if your goal is Disjunctive Power (at least one effect). If so, you may need a significantly smaller sample size than if you power for all outcomes *marginally*.

For Statistical Analysis (After Data Collection)

Use the Hommel method (if distributional assumptions are met).
Why? It provides slightly more statistical power than Bonferroni or Holm while still robustly controlling the FWER.
If assumptions are in doubt or implementation is complex, the Holm method is an excellent and powerful alternative.

The Problem: Multiple Comparisons

The Problem Compounded

A Striking Example: The Dead Salmon

Type I & Type II Errors

Type I Error (False Positive)

Type II Error (False Negative)

Adjusted P-Values vs. Significance Levels

1. Adjust the Significance Level (α)

2. Adjust the P-Values

The Base Rate Fallacy (Why P < 0.05 Isn't Enough)

Example 1: The Cancer Drug Trials

Example 2: Medical Testing (Mammograms)

Example 3: Survey Data

Measures Accounting for Type I Errors

Family-Wise Error Rate (FWER)

False Discovery Rate (FDR)

Disjunctive vs. Marginal Power (A Key Consideration)

1. Disjunctive Power

2. Marginal Power

The Impact on Sample Size

Adjustment Methods: The Algorithms

Bonferroni Method

Algorithm:

Holm (Step-Down) Method

Algorithm:

Hochberg (Step-Up) Method

Algorithm:

Hommel Method

Algorithm:

Benjamini-Hochberg (BH) Method

Algorithm (Step-Up):

Benjamini-Yekutieli (BY) Method

Algorithm:

Dubey/Armitage-Parmar (D/AP)

Warning:

Stepdown-minP

Caveat (Missing Data):

Visual Comparison of Adjustment Methods

Key Takeaways from the Plot:

Conclusion: Practical Recommendations

For *Sample Size Calculation* (Design Phase)

For *Statistical Analysis* (After Data Collection)

For Sample Size Calculation (Design Phase)

For Statistical Analysis (After Data Collection)