Multiple Testing

Multiple testing is when you use data to test several hypotheses, for example if we set \(\alpha\) =.5 we can only test 20 hypotheses before one of the outcomes is an error and even if a p value is significantly low, it could be false. In these cases error measures, like the the Family Wise Error Rate, \(FWER\), come into play, especially for big data analysis, where for example if we run 10 000 test at \(\alpha\) = .05, we would get 500 false positives!

To avoid this many false positive we can use multiple test corrections, like the Bonferroni correction, by reducing \(\alpha\) dramatically, however here too many results fail, so the False Discovery Rate, \(FDR\) is simply limited in most disciplines with the Benjamini-Hochberg method, \(BH\) this is much like the \(k\) evaluation.

Both these methods adjust the threshold \(\alpha\), an equivalent corrective approach is to adjust the \(p\)-values, then call all \(p'_i\) < \(\alpha\) significant.

Status Quo vs Discoveries

\(m_0\) are actually true and
\(m\) - \(m_0\) are actually false
\(R\) have been declared significant, these are discoveries
A False Positive falsely claims a significant positive result, it is a Type I error, denoted \(V\).
A False Negative falsely claims a non-significant negative result, it is a Type II error, denoted \(T\).
So, the False Discovery Rate, \(FDR\) is the expected value of the Type I errors over the discoveries: \(E(V/R)\)
and the False Positive Rate, \(FPR\), will be the expected value of the Type I errors over the true events: \(E(V/m_0)\)

\(FDR\) = \(E(V/R)\)

\(FPR\) = \(E(V/m_0)\)

The Family Wise Error Rate, \(FWER\)

The probability of at least one false positive, \(Pr\), that is \(V\) >= 1, is the Family Wise Error Rate, \(FWER\).

Bonferroni correction

Is a correction test which dramatically reduces \(\alpha\) to avoid many false positives, in big data analysis. We set \(\alpha_F\) = \(\alpha/m\) and call a test result significant if its \(p\)-value < \(\alpha_F\). The drawback is that too many results would fail if this correction is applied.

Here you could adjust these by setting \(p'_i\) = max(\(m\) * \(p_i\), 1) for each \(p\)-value, then if you would call all \(p'_i\) < \(\alpha\) significant you will control the \(FWER\).

Limit the \(FDR\)

In the Benjamini-Hochberg method, \(BH\) a \(p\)-values is compared to a value that depends on its ranking, \(p_i\) <= (\(\alpha\)*\(i\))/\(m\).

This is equivalent to finding the largest \(k\) such that \(p_k\) <= (\(k\) * \(\alpha\))/\(m\), for a given alpha and then rejecting all the null hypotheses for \(i\) = 1,…,\(k\).

The following chart shows the \(p\)-values for 10 tests performed at the \(\alpha\)=.2 level and three cutoff lines. The \(p\)-values are shown in order from left to right along the x-axis.

JHU chart

The red line is the threshold for No Corrections (p-values are compared to alpha=.2),
the blue line is the Bonferroni threshold, alpha=.2/10 = .02, and
the gray line shows the Benjamini-Hochberg correction.

The \(BH\) correction which limits the \(FWER\) is proportional to the ranking of the values. Both corrections, Bonferroni and BH methods adjust the threshold \(\alpha\), an equivalent corrective approach is to adjust the p-values, so they’re not classical p-values anymore, but they can be compared directly to the original alpha.

Worked Examples:

no relation

## create an array of p-values

set.seed(1010093)
pValues <- rep(NA,1000)
for(i in 1:1000){
  y <- rnorm(20)
  x <- rnorm(20)
  pValues[i] <- summary(lm(y ~ x))$coeff[2,4]
}
#inspect array
head(pValues)

## [1] 0.5334915 0.2765785 0.8380943 0.6721730 0.8122037 0.4078675

#count p < .05
sum(pValues < 0.05)

## [1] 51

There should be close to 50 false positives, we can control this with the R function p.adjust()

sum(p.adjust(pValues,method="bonferroni") < 0.05)

## [1] 0

The above method should be more radical than the next

sum(p.adjust(pValues,method="BH") < 0.05)

## [1] 0

but as there was no relation to begin will all false positive were eliminated.

Half of the is paired

## create an array of p-values

set.seed(1010093)
pValues2 <- rep(NA,1000)
for(i in 1:1000){
  x <- rnorm(20)
  # First 500 beta=0, last 500 beta=2
  if(i <= 500){y <- rnorm(20)}else{ y <- rnorm(20,mean=2*x)}
  pValues2[i] <- summary(lm(y ~ x))$coeff[2,4]
}
trueStatus <- rep(c("zero","not zero"),each=500)

#inspect array
tail(trueStatus)

## [1] "not zero" "not zero" "not zero" "not zero" "not zero" "not zero"

We can count & tabulate the each combination (TRUE,“zero”), (TRUE,“not zero”), (FALSE,“zero”), and (FALSE,“not zero”)

table(pValues2 < 0.05, trueStatus)

##        trueStatus
##         not zero zero
##   FALSE        0  476
##   TRUE       500   24

This is the result without any correction, there should be about 5% false positives, this should be reduced if we apply a correction.

table(p.adjust(pValues2,method="bonferroni") < 0.05, trueStatus)

##        trueStatus
##         not zero zero
##   FALSE       23  500
##   TRUE       477    0

And here is a much less stringent control.

table(p.adjust(pValues2,method="BH") < 0.05, trueStatus)

##        trueStatus
##         not zero zero
##   FALSE        0  487
##   TRUE       500   13

Here is a plot from the SWIRL repo

plot.new()
par(mfrow=c(1,2))
plot(pValues2,p.adjust(pValues2,method="bonferroni"),pch=2)
plot(pValues2,p.adjust(pValues2,method="BH"),pch=2)

We see that with the Bonferroni control only a few of the adjusted values are below 1. For the BH, the adjusted values are slightly larger than the original values.

Documenting file creation

It’s useful to record some information about how your file was created.

File creation date: 2021-01-04
R version 4.0.2 (2020-06-22)
R version (short form): 4.0.2
mosaic package version: 1.8.2
Additional session information

sessionInfo()  # could use devtools::session_info() if you prefer that

## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] reshape2_1.4.4    knitr_1.30        mosaic_1.8.2      ggridges_0.5.2   
##  [5] mosaicData_0.20.1 ggformula_0.9.4   ggstance_0.3.4    dplyr_1.0.2      
##  [9] Matrix_1.2-18     ggplot2_3.3.2     lattice_0.20-41  
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0  xfun_0.18         purrr_0.3.4       splines_4.0.2    
##  [5] colorspace_1.4-1  vctrs_0.3.4       generics_0.0.2    htmltools_0.5.0  
##  [9] yaml_2.2.1        rlang_0.4.8       pillar_1.4.6      glue_1.4.2       
## [13] withr_2.3.0       tweenr_1.0.1      lifecycle_0.2.0   plyr_1.8.6       
## [17] mosaicCore_0.8.0  stringr_1.4.0     munsell_0.5.0     gtable_0.3.0     
## [21] htmlwidgets_1.5.2 evaluate_0.14     crosstalk_1.1.0.1 broom_0.7.1      
## [25] Rcpp_1.0.5        readr_1.4.0.9000  scales_1.1.1      backports_1.1.10 
## [29] leaflet_2.0.3     farver_2.0.3      gridExtra_2.3     ggforce_0.3.2    
## [33] hms_0.5.3         digest_0.6.25     stringi_1.5.3     ggrepel_0.8.2    
## [37] polyclip_1.10-0   grid_4.0.2        tools_4.0.2       magrittr_1.5     
## [41] tibble_3.0.4      ggdendro_0.1.22   crayon_1.3.4      tidyr_1.1.2      
## [45] pkgconfig_2.0.3   MASS_7.3-51.6     ellipsis_0.3.1    rmarkdown_2.6    
## [49] R6_2.4.1          compiler_4.0.2

RECAP: hypothesis testing

A Type I error is rejecting a true hypothesis or “convicting an innocent person”
A Type II error is failing to reject a false hypothesis or “acquitting a guilty person”
The null hypothesis, \(H_0\), represents the status_quo and is assumed to be true.
The test statistic is \(\mu_a\), usually the mean of the alternative.
The \(\alpha\) levels are set before the test is conducted, often at 5%.
The \(p\) value is the probability under the null hypothesis of obtaining evidence as or more extreme than the test statistic.
if a p-value is found to be less than alpha (say 0.05), then the test result is considered statistically significant, i.e., surprising and unusual, and the the status quo is rejected.

Destination formats

File can be knit to HTML, PDF, or Word.

Documenting file creation

File creation date: 2021-01-04
R version 4.0.2 (2020-06-22)
R version (short form): 4.0.2
mosaic package version: 1.8.2
Additional session information

Multiple Testing

Linda Angulo Lopez

04/01/2021

Ask the right questions:

Multiple Testing

Status Quo vs Discoveries

The Family Wise Error Rate, \(FWER\)

Bonferroni correction

Limit the \(FDR\)

Worked Examples:

There should be close to 50 false positives, we can control this with the R function p.adjust()

We can count & tabulate the each combination (TRUE,“zero”), (TRUE,“not zero”), (FALSE,“zero”), and (FALSE,“not zero”)

Documenting file creation

Destination formats

Documenting file creation