Multiple Testing

Background
Discussion
Additional Discussion
Summary

Background

If P-value less than $\alpha$, this test is considered as unusual, or significant.

$Figure\ 1.\$

\[Figure\ 1.\ \]

Explanation:

Suppose we’ve tested $m$ null hypotheses, $m_0$ of which are actually true, and $m-m_0$ are actually false. Out of the $m$ tests R have been declared significant, that is, the associated p-values were less than $\alpha$, and $m-R$ were nonsignificant, or boring results.

$V$ – Type I error (Reject a True idea)

Another name for a Type I error is False Positive, since it is falsely claiming a significant (positive) result.

$T$ —— Type II error (Accept a false idea)

Another name for a Type II error is False Negative, since it is falsely claiming a nonsignificant (negative) result.

$\frac{V}{R}$ —— Ratio of false discoveries

$R$ is Total significant observations , since $V$ is a random variable (i.e., unknown until we do an experiment), we call the expected value of the ratio, $E(\frac{V}{R})$, the False Discovery Rate (FDR).

$\frac{V}{m_0}$ – Ratio of FALSE positives

From the chart, $m_0$ represents the number of true $H_0$’s and $m_0$ is unknown. $V$ is the number of those falsely declared significant. Since $V$ is a random variable (i.e., unknown until we do an experiment), we call the expected value of the ratio, E($\frac{V}{m_0}$), the False Positive rate.

Discussion

Family Wise Error Rate (FWER): the probability of at least one false positives,Pr(V >= 1)

If we assume ALL TESTs with $p < \alpha$ as significant, so it means that our false positive rate is at most on average.

$e.g.$ If we perform 10,000 tests and $\alpha$ = 0.05, we expect the number of positives is 500.

Ways to control False Positives (Reject a True Hypo.):

1.Bonferroni Correction

The way to control so many false positives is through control FWER with $Bonferroni correction$, the oldest multiple testing correction.

Control FWER at level $\alpha$ so that $Pr(V >= 1) < \alpha$.One way to do this is by reduce alpha_fwer to $\frac{\alpha}{m}$, we will only call a test result significant if its $p-value < \alpha_{fwer}$.

Drawback: Too many results will fail.

2.Benjamini-Hochberg (BH) Method (Most Popular method)

Another way to limit false positive rate: control false discovery rate E(V/R)

Set FDR = $\alpha$, We’ll calculate the p-values as usual and order them from smallest to largest, $p_1,p_2,...p_m$. We’ll call significant any result with $p_i\ \leq\ \frac{alpha*I}{m}$. A p-value is compared to a value that depends on its ranking. This is equivalent to finding the largest $k$ such that $p_k\ \leq\ \frac{k\ *\ alpha}{m}$, (for a given $\alpha$) and then rejecting all the null hypotheses for $i=1,...,k$.

Drawback: Like the Bonferroni correction, this is easy to calculate and it’s much less conservative. It might let more false positives through and it may behave strangely if the tests aren’t independent.

$Figure\ 2.\$

\[Figure\ 2.\ \]

Explanation:

It shows the p-values for 10 tests performed at the alpha=0.2 level and three cutoff lines. The p-values are shown in order from left to right along the x-axis. The red line is the threshold for No Corrections (p-values are compared to $\alpha=.2$), the blue line is the Bonferroni threshold, $alpha=0.2/10 = 0.02$, and the gray line shows the BH correction. Note that it is not horizontal but has a positive slope as we expect.

3.Adjusting the p-values

By adjusting p-values, so they’re not classical p-values anymore, but they can be compared directly to the original alpha.

Suppose the p-values are $p_1, ... , p_m$. With the $Bonferroni method$ you would adjust these by setting $p'_i = max(m * p_i, 1)$ for each p-value. Then if you call all $p'_i < \alpha$ significant you will control the FWER.

Example: pValues is an array that is 1000-long and the result of linear regression show no significant relationship between X and Y.

sum(pValues<0.05)
>51
    #return the number of values that are less than 0.05
sum(p.adjust(pValues,method="bonferroni") < 0.05)
>0
    #The correction eliminated all the false positives that had passed the uncorrected alpha test

sum(p.adjust(pValues,method="BH") < 0.05)
>0

The boolean obviously has two outcomes and each entry of trueStatus has one of two possible values. The function table aligns the two arguments and counts how many of each combination (TRUE,“zero”), (TRUE,“not zero”), (FALSE,“zero”), and (FALSE,“not zero”) appear.

Additional Discussion

Suppose having created pValues2, for which, first 500 elements are created randomly, last 500 have some significant relationship between X and Y. Zero column is the truly random tests, but 24 results were flagged as significant

table(pValues2 < 0.05,trueStatus)

$Figure\ 3.\$

\[Figure\ 3.\ \]

$Figure\ 4.\$

\[Figure\ 4.\ \]

Percentage of false positives: 24/500 = 0.048, as we expected, less than 0.05

table(p.adjust(pValues2,method="bonferroni") < 0.05, trueStatus)

$Figure\ 5.\$

\[Figure\ 5.\ \]

Since the Bonferroni correction method is more conservative than just comparing p-values to alpha all the truly random tests are correctly identified in the zero column. In other words, we have no false positives. However, the threshold has been adjusted so much that 23 of the truly significant results have been misidentified in the not zero column. $Figure\ 6.\$

Summary

$Figure\ 7.\$

\[Figure\ 7.\ \]

Here’s a plot of the two sets of adjusted p-values, Bonferroni on the left and BH on the right. The x-axis indicates the original p-values. For the Bonferroni, (adjusting by multiplying by 1000, the number of tests), only a few of the adjusted values are below 1. For the BH, the adjusted values are slightly larger than the original values. Usually Bonferrno/BH correction is good enough to eliminate false positives,