Welcome to the eighth computer lab for the Science, Health and Data Science streams for STM1001.
In this computer lab we will look at methods for adjusting \(p\)-values obtained when conducting multiple inferences.
This computer lab introduces some biology terminology. Therefore, it is recommended that you read over section 2 of the Foundational Biology for Analyses of Biological Data supplement. The material in this supplement provides all the background information you will need.
By the end of this lab, you should understand how to adjust \(p\)-values to counteract the effects of conducting multiple tests simultaneously. Let’s get started!
1 Introduction
🏡 By now, you should be familiar with the concept of a hypothesis test. You should also feel comfortable conducting a statistical test that involves computing a \(p\)-value to compare a null hypothesis against an alternative hypothesis.
Note: A simple example would be an independent samples \(t\)-test to assess the difference in the mean value of a variable between two groups.
In many situations however, we may be interested in performing a hypothesis test on multiple features simultaneously. For example, in analyses of biological data, it is now common for thousands (or hundreds of thousands) of features to be simultaneously tested against a null hypothesis (where the expectation is that a number of features will be statistically significant).
Unfortunately, when we conduct multiple tests, the probability of obtaining a false positive (aka Type I error) will increase, meaning that the original significance level of \(\alpha\) designated for a single test does not hold.
Type I error probability example
🏡 As a simple example, suppose that we set \(\alpha = 0.05\) for a single test. Therefore, the probability of not observing a Type I error for this test is \(1-\alpha = 0.95\).
So far, so good.
However, if we use this level of significance and carry out 10 tests, then the probability of not observing a Type I error in any of these 10 tests becomes \((1 - \alpha)^{10} \approx 0.6\).
In other words, by carrying out multiple testing, our probability of observing at least one Type I error has increased from \(5\%\) to \(40\%\)! As the number of tests increases, this error rate will continue to rise.
2 Adjusting \(p\)-values
💻 Fortunately, there are several statistical methods we can use to adjust \(p\)-values obtained from performing multiple tests.
We have already come across one such adjustment in the Topic 7 Readings: the Tukey adjustment, which was used to carry out post-hoc tests after carrying out a One-way ANOVA.
In this computer lab, we will learn about other types of adjustments which may be more appropriate in different settings.
Most of these adjustments can be applied using the built-in R function p.adjust, which can be accessed within both RStudio and jamovi. Let’s take a look.
2.1
💻
If you are in the Data Science stream, open RStudio and open a new script file.
If you are in the Science/Health stream, open jamovi and open the Rj editor module (refer to Computer Lab 4 for details).
The following video demonstrates how to carry out \(p\)-value adjustments using R code in jamovi. If you are using RStudio rather than jamovi, you can of course use the same R code shown in the video.
If you are using jamovi, you will need to use the Rj Editor - see the video inComputer Lab 4if you need a refresher on how to install this.
2.2
💻 Suppose that we have compared gene expression levels between two groups of individuals (cases and controls) for 10 key genes of interest. As a result, we have obtained a set of 10 \(p\)-values (one for each gene). These \(p\)-values are presented below:
💻 Recall that we say a gene is differentially expressed if it exhibits a large change in expression between two groups (e.g. between cases and controls).
If we assessed the \(p\)-values above, how many genes would we find to have statistically significant differential expression between the cases and controls, based on an \(\alpha\) value of \(0.05\)?
🎧 Online students
💬 Enter your answer next to the question on the shared google doc.
2.4
💻 A simple way to adjust our initial \(p\)-values is to multiply each \(p\)-value by the overall number of \(p\)-values we have (with any resultant values over 1 scaled back to 1). This is known as the Bonferroni correction method.
Run the code below to conduct this adjustment:
p.adjust(base_p_values, method = "bonferroni")
2.5
💻 Assessing the adjusted \(p\)-values, how many of the genes remain statistically significantly differentially expressed, at the \(\alpha = 0.05\) level of significance (i.e., for an \(\alpha = 0.05\) threshold)?
🎧 Online students
💬 Enter your answer next to the question on the shared google doc.
2.6
💻 As you will have noticed, the Bonferroni correction is quite conservative, and in practice can often lead to rejecting true positives (especially when dealing with larger numbers of tests).
This is because the adjustment method focuses on controlling the probability of making one or more Type I errors when conducting multiple tests. This probability is known as the family-wise error rate (FWER).
The p.adjust function contains options for several other methods which focus on controlling the FWER.
These methods include:
holm
hochberg
hommel
Obtain adjusted \(p\)-values using these three methods, just as you did above for the Bonferroni correction approach used in 2.4. Compare and contrast your results, at the 5% level of significance (i.e., for an \(\alpha = 0.05\) threshold).
Hint: Use the code below as a guide:
# Here we use the `holm` method to compute the number of significant results
sum(p.adjust(base_p_values, method = "holm") < 0.05)
# the < 0.05 part checks if each stored value is less than 0.05.
# If this is true, a value of TRUE is recorded, otherwise a FALSE is recorded
# TRUE is read as 1, while FALSE is read as 0.
# Hence, by taking the sum of these results,
# we obtain the number of statistically significant p-values
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
2.7
💻 Repeat 2.6 using an 1% level of significance (i.e. using a \(\alpha = 0.01\) threshold for determining statistical significance).
Hint: Use the code below as a guide:
# Here we use the `holm` method to compute the number of significant results
sum(p.adjust(base_p_values, method = "holm") < 0.01)
# the < 0.01 part checks if each stored value is less than 0.01.
# If this is true, a value of TRUE is recorded, otherwise a FALSE is recorded
# TRUE is read as 1, while FALSE is read as 0.
# Hence, by taking the sum of these results,
# we obtain the number of statistically significant p-values
3 False Discovery Rate
💻 A more flexible and more powerful alternative to controlling the FWER is to control the False Discovery Rate (FDR) when conducting multiple tests. When dealing with Big Data, using the FDR is generally preferred over more conservative adjustments such as Bonferonni correction. FDR control is also typically favoured over FWER control in biological studies involving Big Data1.
The FDR can be defined as the expected proportion of errors (i.e. false discoveries) among the rejected hypotheses2. Controlling the FDR helps to reduce false positives, but also helps minimise false negatives.
3.1
💻 We can apply FDR correction to \(p\)-values using the R function p.adjust. To do so, specify the method as method = "fdr".
Try this now, on our base_p_values vector of \(p\)-values.
3.2
💻 Compare your FDR-corrected results from 3.1 to your Bonferroni-corrected results in 2.4. What do you notice?
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
4 SARS-CoV-2 Study \(p\)-values
🏡 To conclude this computer lab, let’s take a look at a real-world application of \(p\)-value adjustment, in the context of biological data.
For this example, we will consider a subset of SARS-CoV-2 (Coronavirus) gene expression profiling data, from a study by Lee et al. (2022). Information on the study can be found here and the full data set is publicly available on the NCBI Gene Expression Omnibus website3.
The study considered the VIC and B.1.1.7 (Alpha) strains of SARS-CoV-2, which have different replication and transmission characteristics. Samples were taken of human cells at several time points following infection with either of these two strains. Gene expression differences were then assessed, and \(p\)-values were computed for each gene, with a small \(p\)-value suggesting the gene was statistically significantly differentially expressed at that time point, between the two strains.
4.1
🏡 The data we will assess is stored in the file SARS_data_subset.csv which is available on the LMS.
The table below provides an example of some recorded values for the different variables (the column names) in this data set. Don’t worry about all these names just yet - we will discuss them as we work through this question.
gene_name
gene_biotype
log2FoldChange.Alpha_8h_vs_VIC_8h
pvalue.Alpha_8h_vs_VIC_8h
MTND2P28
unprocessed_pseudogene
0.2355462
0.2222321
LINC01128
lncRNA
0.0289815
0.7729915
NOC2L
protein_coding
-0.0154409
0.8024931
PLEKHN1
protein_coding
0.0922164
0.4848935
4.2
💻 For this question:
If you are using RStudio, please only follow the instructions in the RStudio Users Only section.
If you are using jamovi, please only follow the instructions in the Jamovi Users Only section.
Once these instructions have been completed, you should be able to complete the rest of this lab (questions 4.3 onwards) in either RStudio or jamovi, using the code provided.
RStudio Users Only
Download the file SARS_data_subset.csv from LMS now.
Save the SARS_data_subset.csv file in your working directory, then read the SARS_data_subset.csv file into RStudio, and store it in an object called sars_data.
Hint: Recall that you can use the function read.csv to load a .csv file into RStudio.
jamovi Users Only
Download the file SARS_data_subset.csv from LMS now.
Save the SARS_data_subset.csv file to your personal device, then import the file into jamovi, following the same process used in earlier weeks (see e.g. Section 3.2 of Computer Lab 1).
Run the following code in the Rj Editor:
sars_data <- data
Note: While nothing appears to happen when you run this code in the Rj editor, don’t worry, the data will load in the background.
4.3
💻 The \(p\)-values stored in the sars_data.csv file were computed for gene expression comparisons between the VIC and Alpha strains at 8 hours post-infection.
Run the code below to store the initial \(p\)-values we will be using:
# Store base p-values
covid_base_p_values <- sars_data$pvalue.Alpha_8h_vs_VIC_8h
4.4
💻 Run the code below to determine the number of genes that are considered statistically significantly differentially expressed between the two strains at the 8 hour post-infection mark, for an \(\alpha = 0.05\) level of significance.
# Number of significant p-values at 0.05 level of significance
sum(covid_base_p_values < .05)
Note 1: The number that appears in the RStudio console (or jamovi Results section) is the number of genes considered to be statistically significant.
Note 2: The default level of significance typically used is 0.05, but other values can also be used, such as \(\alpha = 0.01\) or even sometimes \(\alpha = 0.10\). For example, if we wanted to determine the number of \(p\)-values that were statistically significant at the \(\alpha = 0.10\) level of significance, we could use the code in the code chunk below:
# Number of significant p-values at 0.10 level of significance
sum(covid_base_p_values < .10)
🎧 Online students
💬 Enter your answer next to the question on the shared google doc.
4.5
💻 Use the p.adjust function to perform Bonferroni correction on the \(p\)-values stored in the covid_base_p_values object, just like you did in 2.4.
Store your results in a new object called covid_bonferroni_p_values.
Hint: Check the code chunk below if you are unsure how to proceed.
💻 Using your covid_bonferroni_p_values and covid_fdr_p_values objects from 4.5 and 4.6, determine the numbers of genes at the 8 hour post-infection mark that, respectively, are statistically significantly differentially expressed:
at the 0.05 level of significance (Bonferroni), and
for an FDR level of \(0.05\) (FDR).
Compare these results with your initial results in 4.4. What do you observe?
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
4.8 Volcano Plots
💻 In addition to the statistical significance of the gene expression differences, we are also interested in biological significance.
Notice in the sars_data file that there is a column labelled log2FoldChange.Alpha_8h_vs_VIC_8h. The values recorded in this column denote the log ratio of the change in gene expression status between the two strains, for each gene. If there is no difference in the gene expression status, the ratio would be 1, e.g.: \[\frac{4.2 \text{ for VIC strain}}{4.2 \text{ for Alpha strain}} = 1\text{, with }\log_2(1) = 0.\]
So log2FoldChange.Alpha_8h_vs_VIC_8h values close to or equal to 0 suggest there is no biologically significant change in gene expression between the two strains. In contrast, values far from 0 suggest there is a biologically significant change.
In gene expression analyses, a volcano plot can be used to highlight genes which are statistically and biologically significant (these genes are probably the most important to consider in further detail).
A volcano plot is a bit like a scatter plot. In a volcano plot, each point represents a gene.
A point close to 0 on the x-axis (i.e. here a log2FoldChange.Alpha_8h_vs_VIC_8h value close to 0) is not considered biologically significant. On the other hand a point far from 0 on the x-axis suggests that there is a biologically significant change in the expression status of that gene. You can think of large x-axis values for a volcano plot being a bit like large effect size values.
Points high on the y-axis have small \(p\)-values and are therefore considered statistically significant, whereas points close to 0 on the y-axis have large \(p\)-values and are therefore not statistically significant.
4.9
💻 Run the code below to create two volcano plots for our SARS-CoV-2 gene expression profiling data.
plot(sars_data$log2FoldChange.Alpha_8h_vs_VIC_8h,
-log10(covid_base_p_values),
main = "Volcano Plot for SARS-CoV-2
VIC and Alpha strains
8h post-infection, using initial p-values",
xlab = "log FC", ylab = "-log10(pval)",
cex = 0.5, col = as.numeric(covid_base_p_values <= 0.05) + 1)
plot(sars_data$log2FoldChange.Alpha_8h_vs_VIC_8h,
-log10(covid_fdr_p_values),
main = "Volcano Plot for SARS-CoV-2
VIC and Alpha strains
8h post-infection, using FDR-corrected p-values",
xlab = "log FC", ylab = "-log10(pval)",
cex = 0.5, col = as.numeric(covid_fdr_p_values <= 0.05) + 1)
4.10
💻 In the first plot, genes with small \(p\)-values are coloured red.
In the second plot, genes with small FDR-corrected \(p\)-values are coloured red.
All other genes are coloured black.
What differences do you observe between the two plots in 4.9? Do you think the FDR correction has been beneficial?
🎧 Online students
💬 Volunteer to share your screen and explain your answers to this question.
Great job, that’s everything for today.
If you have any questions about current assessments, please ask for help.
References
Benjamini, Y., and Y. Hochberg. 1995. “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.”J. R. Statist. Soc. B 57 (1): 289–300.
Lee, J. Y., P. A. C. Wing, D. S. Gala, M. Noerenberg, A. I. Järvelin, J. Titlow, X. Zhuang, et al. 2022. “Absolute Quantitation of Individual SARS-CoV-2 RNA Molecules Provides a New Paradigm for Infection Dynamics and Variant Differences.”Elife 11: e74153. https://doi.org/10.7554/eLife.74153.
Storey, J. D., and R. Tibshirani. 2003. “Statistical Significance for Genomewide Studies.”PNAS 100 (16): 9440–45.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License
BY-NC-ND.