class: middle background-image: url(data:image/png;base64,#LTU_logo.jpg) background-position: top left background-size: 30% # STM1001 Lecture # `\(p\)`-value adjustments for multiple hypothesis testing ## Science/Health and Data Science streams ### La Trobe University --- # Welcome! ### In this lecture we will cover issues with and solutions for `\(p\)`-value calculations made when multiple hypothesis tests are conducted simultaneously. -- We will focus on how to conduct .teal_style[p-value adjustments] in R/jamovi, using some biological .teal_style[big data] in our demonstrations. -- * By the end of this lecture you will: -- * understand why `\(p\)`-values must be adjusted when multiple hypothesis tests are conducted simultaneously -- * be aware of the main `\(p\)`-value adjustment methods available -- * be able to distinguish between Family-wise Error Rate and False Discovery Rate `\(p\)`-value corrections -- We will practice `\(p\)`-value adjustments in Computer Lab 8B. --- # Overview Over the following slides, we will cover: -- * Type I error recap -- * The impact of multiple simultaneous hypothesis tests on `\(p\)`-values -- * The Family-wise Error Rate (FWER) -- * Bonferroni Correction -- * The False Discovery Rate (FDR) -- * Example application to biological data -- Throughout this lecture, we will include notes on conducting `\(p\)`-value adjustments in jamovi and R. -- *It is recommended that you cover this lecture material before you start Computer Lab 8B.* --- # 1. Type I error recap By now, you should be familiar with the concept of a hypothesis test. -- You should also feel comfortable conducting a statistical test that involves computing a .teal_style[p-value] to compare a null hypothesis `\((H_0)\)` against an alternative hypothesis `\((H_1)\)`. -- When we conduct a hypothesis test, we can compare the `\(p\)`-value to our specified .teal_style[level of significance] `\(\alpha\)` (typically 0.05), to decide what to do: -- * We reject `\(H_0\)` if the `\(p\)`-value is less than `\(\alpha\)` -- * We do not reject `\(H_0\)` otherwise --- # 1. Type I error recap Whenever we conduct a hypothesis test, there is a chance that we may reach an incorrect conclusion. -- A .red_style[Type I error] occurs when we decide to reject `\(H_0\)`, when in reality `\(H_0\)` was correct. -- * A Type I error is known as a .red_style[false positive] (i.e. we have falsely identified something as being significant, when it is non-significant) -- The level of significance we choose denotes our accepted rate of Type I error. This is why we always choose a small `\(\alpha\)` value. -- * For example, if `\(\alpha = 0.05\)`, this means that we are ok with making a Type I error 5% of the time -- * I.e. `\(\alpha\)` denotes the accepted probability of a Type I error occurring. -- *Note that we cannot set `\(\alpha = 0\)`, as this would make it impossible for us to obtain any significant results (correctly or falsely).* --- # 1. Type I error recap If we are simply conducting one hypothesis test, then the `\(\alpha\)` value we choose will correctly reflect our risk of Type I error. -- However, in many situations, we may want to conduct multiple hypothesis tests simultaneously. -- * For example, in analyses of biological data, it is common to simultaneously test thousands of features against a null hypothesis, with the expectation that a number of features will be statistically significant. --
<i class="fas fa-exclamation-triangle faa-pulse animated " style=" color:red;"></i>
Unfortunately, **when we conduct multiple hypothesis tests simultaneously, the probability of obtaining a Type I error (i.e. a false positive) increases**, meaning that the original `\(\alpha\)` value chosen becomes inaccurate. -- Let's look at a short example. --- # 2. The impact of multiple simultaneous hypothesis tests on `\(p\)`-values Let's start by setting `\(\alpha = 0.05\)` for a single test. -- This means that we have a `\(1-\alpha = 0.95\)` probability of **not** observing a Type I error for our test. -- However, suppose that instead of just the one test, we carry out two tests simultaneously. -- Now, the probability of **not** observing a Type I error is `\(1-\alpha\)` for the first test, **times** `\(1-\alpha\)` for the second test - i.e. `\((1-0.05)^2 = 0.9025\)`. -- * So our probability of observing (at least) one Type I error is now `\(1-0.9025 = 0.0975\)`, which is almost double our specified `\(\alpha\)` value! --- # 2. The impact of multiple simultaneous hypothesis tests on `\(p\)`-values It gets worse as we increase the number of tests: -- Suppose we carry out ten tests simultaneously, using `\(\alpha = 0.05\)`. -- * Our probability of observing (at least) one Type I error will now be approximately 0.4 (i.e. `\(1 - (1-0.05)^{10}\)`). -- * So here we have a 40% chance of observing one or more Type I errors (false positives), even though we set our accepted rate of Type I error at `\(\alpha = 0.05\)`. -- Clearly, something must be done. --- # 3. The Family-wise Error Rate Fortunately, statistical methods exist for adjusting `\(p\)`-values obtained from performing multiple tests. -- The .teal_style[family-wise error rate (FWER)] denotes the probability of making **one or more** Type I errors when conducting multiple tests simultaneously. * Here *family* refers to the set of hypothesis tests conducted (loosely speaking). -- If we let `\(V\)` denote the number of false positives obtained in our testing procedure, we can calculate the FWER as: `$$FWER = \text{Pr}(V \geq 1) = 1 - \text{Pr}(V = 0)$$` -- If we know that we will be conducting multiple hypothesis tests simultaneously, we can focus on controlling the FWER, to ensure that any significant results we obtain are likely to be accurate. --- # 3. The Family-wise Error Rate There are several statistical methods we can use to adjust `\(p\)`-values/control the FWER at a reasonable level (e.g. `\(0.05\)`). These include: -- * .teal_style[Bonferroni Correction] -- * .teal_style[The Holm Procedure] -- * .teal_style[The Hochberg Procedure] -- * .teal_style[The Hommel Procedure] -- *You will practice using all these methods in Computer Lab 8B.* -- Let's look at Bonferroni correction. --- class:middle ## Bonferroni Correction Suppose we conduct `\(m\)` hypothesis tests simultaneously. -- To control the FWER at `\(\leq \alpha\)`, the .teal_style[Bonferroni Correction] multiplies each `\(p\)`-value we compute by `\(m\)` (with any resultant values over 1 scaled back to 1). -- The Bonferroni-corrected `\(p\)`-values are then able to be compared to our specified level of significance `\(\alpha\)`, in the usual manner. -- For example, if `\(\alpha = 0.05\)` and `\(m=2\)`, the Bonferroni correction process is equivalent to comparing each original `\(p\)`-value to `$$\frac{\alpha}{m} = \frac{0.05}{2} = 0.025 \text{ (rather than } 0.05)$$` --- class:middle ## Bonferroni Correction Equivalently, we could instead adjust the `\(p\)`-values, and compare them to the original `\(\alpha\)` value specified. -- * Suppose for example we have initial `\(p\)`-values of 0.04 and `\(0.07\)`. Using `\(\alpha = 0.05\)` and `\(m=2\)`, the adjusted `\(p\)`-values would be equal to $$ \text{initial `\(p\)`-value} \times m, \text{ i.e. } 0.08 \text{ and } 0.14,$$ and we would compare these adjusted `\(p\)`-values to the original `\(\alpha = 0.05\)`. --- class: top layout: true ## Bonferroni Correction Example Let's look at a more detailed example. Suppose we have conducted 10 hypothesis tests simultaneously, and obtained the following *initial* `\(p\)`-values: `$$0.0003,0.0085,0.001,0.0001,0.045,0.62,0.009,0.18,0.92,0.02$$` --- -- We could store these values in R/jamovi using the following code: ``` r p_values <- c(0.0003, 0.0085, 0.001, 0.0001, 0.045, 0.62, 0.009, 0.18, 0.92, 0.02) ``` -- If we simply used these `\(p\)`-values, we would find that 7 are significant, at the 5% level of significance (i.e. for `\(\alpha = 0.05\)`). --- We can perform .teal_style[Bonferroni Correction] on these `\(p\)`-values, to obtain *adjusted* `\(p\)`-values, using the following code: ``` r p.adjust(p_values, method = "bonferroni") ``` ``` ## [1] 0.003 0.085 0.010 0.001 0.450 1.000 0.090 1.000 1.000 0.200 ``` -- We find that only 3 of the `\(p\)`-values are significant, when controlling the FWER at `\(\leq 0.05\)`. -- *In Computer Lab 8B you will find that performing other FWER corrections, e.g. the Holm, Hochberg and Hommel procedures, requires only minimal changes in the code above.* --- layout: false class: middle ## Bonferroni Recap To recap, for each of the initial `\(m=10\)` `\(p\)`-values we had, the Bonferroni correction has multiplied them by `\(m\)`, such that e.g.: -- * `\(0.0085\)` (initial)
<i class="fas fa-arrow-right faa-pulse animated " style=" color:blue;"></i>
`\(0.085\)` (adjusted) -- * `\(0.18\)` (initial)
<i class="fas fa-arrow-right faa-pulse animated " style=" color:blue;"></i>
`\(1\)` (adjusted, scaled from 1.8 to 1, since `\(p\)`-values `\(\in [0,1]\)`) -- We then compare the adjusted `\(p\)`-values to our initial `\(\alpha\)` value, to determine statistical significance. -- *Note that as `\(m\)` increases, it becomes more difficult to obtain statistically significant Bonferroni-adjusted `\(p\)`-values. In other words, the Bonferroni correction can end up being overly conservative when `\(m\)` is large.* --- # 4. The False Discovery Rate A more flexible and more powerful alternative to controlling the FWER is to control the .teal_style[False Discovery Rate (FDR)] when conducting multiple hypothesis tests simultaneously. -- The FDR can be defined as the *expected proportion of errors* (i.e. false discoveries) *among the rejected hypotheses*. -- * Controlling the FDR helps to reduce false positives, but also helps minimise false negatives. -- When dealing with Big Data, particularly in biological studies, using the FDR is generally preferred over more conservative adjustments such as Bonferonni correction. -- We can think of controlling the FDR as being a lenient middle-ground between controlling the Type I error rate and controlling the FWER. --- # 4. The False Discovery Rate Recall that * `\(m\)` denotes the number of hypothesis tests conducted simultaneously * `\(V\)` denotes the number of false positives obtained in our testing procedure -- Suppose we also define `\(R\)` as the total number of rejections (both correct and incorrect) from a set of `\(m\)` hypotheses. -- Then mathematically we can write the FDR as: `$$FDR = E\left(\frac{V}{R}\right),$$` i.e. the `\(FDR\)` is equal to the expected value of `\(V\)` divided by `\(R\)`. --- class: middle # 4. The False Discovery Rate So now our focus has shifted slightly. -- Instead of being concerned with the probability of making at least one false discovery, we accept that we will make some false discoveries, and focus on .teal_style[controlling the false discovery rates]. --- ## Extension FDR Calculation Process The FDR calculation process is a little more complicated than the Bonferroni correction process. -- Don't worry, the software will do the necessary calculations for us! -- If you would like to learn about the details of the FDR calculation process, see Storey & Tibshirani (2003). --- ## FDR Correction Example Recall that for the set of 10 `\(p\)`-values we assessed earlier: * There were 7 initial `\(p\)`-values which were statistically significant * There were 3 adjusted `\(p\)`-values which were statistically significant following Bonferroni correction -- We can conduct FDR correction on these `\(p\)`-values using the following code: -- ``` r p.adjust(p_values, method = "fdr") ``` ``` ## [1] 0.001500000 0.018000000 0.003333333 0.001000000 ## [5] 0.064285714 0.688888889 0.018000000 0.225000000 ## [9] 0.920000000 0.033333333 ``` -- We observe that there are 6 adjusted `\(p\)`-values which are statistically significant (i.e. < `\(\alpha\)`) following FDR correction. -- The FDR method has been less stringent than the FWER methods, but has still reduced the number of `\(p\)`-values considered to be statistically significant. --- # 5. SARS-CoV-2 Study In Computer Lab 8B, we will practice how to apply FWER and FDR corrections to `\(p\)`-values, using big data from a recent study (Lee et al. 2022) on different strains of SARS-CoV-2 (covid). -- It will be helpful to take a look at the .teal_style[Foundational Biology for Analyses of Biological Data] supplement material prior to the lab, in order to familiarise yourself with/refresh your memory on biological terms such as: -- * DNA -- * Genes -- * Gene Expression -- The SARS-CoV-2 study assessed changes in genes (units of genetic information) of people following infection with one of two strains of covid - the **VIC** and **B.1.1.7** strains. -- These strains have different replication and transmission characteristics, and the researchers were interested in whether the different strains affected people's genes differently. --- # 5. SARS-CoV-2 Study Differences in the observed impact of the two strains on people's genes were assessed for each individual gene. -- In other words, multiple hypothesis tests were conducted simultaneously, one for each gene, to check which genes were .teal_style[statistically significantly differently affected] by the two covid strains. -- This means `\(p\)`-values were computed for each gene assessed (over 9,000). -- The table below shows some output from the analysis, including some initial `\(p\)`-values: <table class=" lightable-paper lightable-hover" style='font-family: "Arial Narrow", arial, helvetica, sans-serif; width: auto !important; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="text-align:left;"> gene_name </th> <th style="text-align:right;"> pvalue.Alpha_8h_vs_VIC_8h </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> MTND2P28 </td> <td style="text-align:right;"> 0.2222321 </td> </tr> <tr> <td style="text-align:left;"> MTATP6P1 </td> <td style="text-align:right;"> 0.1415036 </td> </tr> <tr> <td style="text-align:left;"> LINC01128 </td> <td style="text-align:right;"> 0.7729915 </td> </tr> <tr> <td style="text-align:left;"> SAMD11 </td> <td style="text-align:right;"> 0.0000052 </td> </tr> <tr> <td style="text-align:left;"> NOC2L </td> <td style="text-align:right;"> 0.8024931 </td> </tr> </tbody> </table> --- # 5. SARS-CoV-2 Study We will see in Computer Lab 8B that the choice of `\(p\)`-value adjustment we make will lead to very different numbers of `\(p\)`-values being identified as statistically significant, in this covid data set. -- * Note e.g. that the `SAMD11` gene shown on the previous slide, with an initial `\(p\)`-value of `\(0.0000052\)`, **is not** found to be significantly differently affected between the two strains, following .teal_style[Bonferroni correction], but **is** following .teal_style[FDR correction]. -- In the context of this SARS-CoV-2 study, correctly identifying the genes which are most statistically significantly differently affected between the strains could help researchers to: -- * better understand the biological impacts of the different strains at different time points post infection -- * develop medication targeted to each strain, which could be more efficacious than a generic medication --- class: centre # Final Comments Please note that for a small number of tests, e.g. post-hoc tests conducted following a one-way ANOVA, more conservative correction methods, e.g. FWER, are the preferred choice of correction method. -- When dealing with Big Data, less conservative corrections, e.g. FDR, are generally preferred. --- # End That concludes our lecture on `\(p\)`-value adjustments for multiple hypothesis testing. -- What to do next: * Before Computer Lab 8B, make sure you are up-to-date with the current assessments. * Take a look over the .teal_style[Foundational Biology for Analyses of Biological Data] supplement. * If you have any `\(p\)`-value adjustment questions, we can resolve them in the computer labs. --- background-image: url(data:image/png;base64,#computerlab.jpg) background-position: bottom background-size: 75% class: center # See you in the computer labs! --- # References * Benjamini, Y., and Y. Hochberg. 1995. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. *J. R. Statist. Soc. B* 57 (1): 289–300. * Lee, J. Y., P. A. C. Wing, D. S. Gala, M. Noerenberg, A. I. Järvelin, J. Titlow, X. Zhuang, et al. 2022. Absolute Quantitation of Individual SARS-CoV-2 RNA Molecules Provides a New Paradigm for Infection Dynamics and Variant Differences. *Elife* 11: e74153. [https://doi.org/10.7554/eLife.74153](https://doi.org/10.7554/eLife.74153). * Posit team (2023). RStudio: Integrated Development Environment for R. Posit Software, PBC, Boston, MA. URL [http://www.posit.co/](http://www.posit.co/). * R Core Team. 2021. *R: A Language and Environment for Statistical Computing*. Vienna, Austria: R Foundation for Statistical Computing. [https://www.R-project.org/](https://www.R-project.org/) * Storey, J. D., and R. Tibshirani. 2003. Statistical Significance for Genomewide Studies. *PNAS* 100 (16): 9440–45. * The jamovi project. 2022. *Jamovi [Computer Software]*.[https://www.jamovi.org](https://www.jamovi.org). --- class: middle <font color = "grey"> These notes have been prepared by Amanda Shaker and Rupert Kuveke. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License <a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/" target="_blank"> BY-NC-ND. </a> </font>