Science/Health Science/Data Science Module
Topic 8B: Big Data III (Gene Expression Analysis)
Welcome to the eighth computer lab for the Science, Health Science and Data Science modules.
In this computer lab we will conduct a simplified gene expression analysis.
This computer lab is designed to run alongside the content in the Introduction to Bioinformatics in R supplement. You are not required to have a background in biology or genomics in order to complete this lab. However, some terms from these disciplines are unavoidable when conducting bioinformatics analyses - the material in this supplement provides all the background information you will need.
By the end of this lab, you should be able to conduct some of the main statistical steps in a gene expression analysis. Let’s get started!
Preparations
Before we proceed, please make sure you have read the content in the Introduction to Bioinformatics in R supplement - if you haven’t, these lab questions may not make sense. It may also be helpful to keep this content open in a separate tab while you work through the lab material.
Thale Cress Gene Count Data
In this lab, we will assess gene count data from the thale cress plant (Arabidopsis thaliana).
The analysis steps we cover will be equally applicable to gene expression data for other organisms, e.g. humans.
An overview of the thale cress data is provided in Section 3 of the Introduction to Bioinformatics in R supplement.
In summary, we have some RNA-Seq gene expression data, which has been recorded at different times over the germination and post-germination period of thale cress seeds.
This data contains recordings for two time points, denoted X24hSL
and X48hSL
.
These are, respectively, the time points for thale cress seeds 24 hours and 48 hours after exposure to sunlight, following a stratification process (whereby the seeds are encouraged to germinate).
For each time point, we have recordings for three replicates - hence the _1
’s, _2
’s and _3
’s appended to the X24hSL
and X48hSL
column names.
We have actually carried out some filtering and pre-processing on the original thale cress gene count data set, to save you time (and possibly a headache), and to ensure that our focus in this lab can be on the final gene expression analysis and \(p\)-value assessments.
The pre-prepared thale cress data is stored in two files on LMS:
thale_cress_gene_count_data.RDS
contains filtered data on the gene counts for our different replicates
thale_cress_gene_exp_data.RDS
contains pre-processed gene count data, ready for our gene expression analysis
Pleased download these files now, and store them in your working directory.
These files have the extension .RDS
, which you might not have previously encountered. To read .RDS
files into R, we need to use the readRDS()
function. The code below shows how one would read in a .RDS
file called example.RDS
.
Note: The file must be stored in your current working directory in order for this code to work.
example_data <- readRDS("example.RDS")
Using this code as a guide:
- Store the
thale_cress_gene_count_data.RDS
data in an object called thale_cress_gene_counts
- Store the
thale_cress_gene_exp_data.RDS
data in an object called thale_cress_data
Gene Data Overview
Before we proceed with our gene expression analysis, let’s spend a couple of minutes familiarising ourselves with our two data sets.
Run the following R code to obtain an overview of the gene count data stored in the thale_cress_gene_counts
object:
head(thale_cress_gene_counts$counts)
This should produce the table of gene counts for the first few genes, identical to the table provided in Section 3 of the Introduction to Bioinformatics in R supplement
The thale_cress_gene_counts
object is actually a type of list
, and various information can be extracted from the object using the $
symbol - e.g. thale_cress_gene_counts$counts
. However, the counts data is the most relevant for us - the other information relates to pre-processing steps, which we don’t have to worry about.
Run the following R code to obtain a summary of the gene samples stored in the thale_cress_data
object:
thale_cress_data$samples
Here, the lib.size
column tells us the library sizes for each replicate (i.e. the total number of read counts for all genes in each replicate).
The norm.factors
column tells us the normalisation values for each replicate.
Originally, all the values in this column were 1
’s, meaning that each library was treated as equal, despite the library sizes being different.
As part of our pre-processing, we carried out a normalisation process, like the standardisation process covered in section 4.2 of Topic 3. As a result, the norm.factors
are now all different to 1, with some replicates having smaller weights (norm.factors < 1
) and some replicates having larger weights (norm.factors > 1
).
Does any one replicate have very different qualities to the others?
Identifying differentially expressed genes
Using our pre-processed data, we can try to identify the specific genes that are differentially expressed (DE), between the time points X24hSL
and X48hSL
.
As part of our pre-processing, we computed some \(p\)-values for each gene, using a test known as a Likelihood Ratio Test
(which is a little bit like a \(t\)-test).
If a gene has a small \(p\)-value, then this means it is likely to be differentially expressed between the time points X24hSL
and X48hSL
.
Run the R code below to see a table of some results for the first few genes:
head(thale_cress_data$table)
Don’t worry about the first few columns - just focus on the PValue
column, which contains the computed \(p\)-values.
Based on their \(p\)-values, what can you conclude about:
- Gene 1 (
AT1G01010
),
- Gene 3 (
AT1G01030
), and
- Gene 5 (
AT1G01050
)?
Since there are so many genes, and a \(p\)-value has been computed for each gene, we need to adjust our \(p\)-values to take into account the fact that multiple simultaneous tests have been performed. In other words, we need to use the p.adjust
function that we introduced in Computer Lab 7B!
The p.adjust
function has actually been built in to the topTags
function from the edgeR
package, via the argument adjust.method
.
Take a look at the partially completed R code below:
topTags(thale_cress_data, adjust.method = "...", n = ...)$table
Here, the ...
in adjust.method = "..."
can take the same variety of method specifications used in the p.adjust
function.
The n = ...
argument allows us to specify the maximum number of genes to include in our output (genes with the smallest \(p\)-values are selected first). The default setting is 10, which is a little low for us.
Let’s try a few options now.
Using the code in 3.2 as a base, specify that Bonferroni correction should be performed on the \(p\)-values in the thale_cress_data
object, and that results for the top 20,000 genes should be included.
Store your results in the object thale_cress_bonferroni
.
Hint: You might like to run the command ?p.adjust
and check the help file, if you can’t remember the argument options for the p.adjust
function.
As an alternative, specify that false discovery rate correction should be performed on the \(p\)-values in the thale_cress_data
object, and that results for the top 20,000 genes should be included.
Store your results in the object thale_cress_fdr
.
Since we are assessing so many genes (the top 20,000), it might be helpful to filter out genes with higher \(p\)-values.
The argument p.value
in the topTags
function can be used to specify a \(p\)-value cut-off - any genes with \(p\)-values above the specified value are removed from the data set.
Try updating your thale_cress_bonferroni
and thale_cress_fdr
objects, by re-running your topTags
code with a cut-off value of 0.01 specified via the p.value
argument.
Call your new objects thale_cress_bonferroni_filtered
and thale_cress_fdr_filtered
respectively.
Use the dim()
function to determine the overall number of genes remaining in your updated thale_cress_bonferroni_filtered
and thale_cress_fdr_filtered
objects.
Compare the results - which p.adjust
method do you prefer?
Volcano Plots
To visualise the gene information stored in our objects, we can use a new type of plot called a Volcano Plot
.
A volcano plot is a bit like a scatter plot, but with a few key differences.
For our data:
- The plot is centred at 0 on the x-axis.
- Each point represents a gene.
- A point close to
x=0
represents a gene that is probably not differentially expressed.
- Points far from
x=0
represent genes that are probably differentially expressed.
- Points far from
y=0
represent genes with small p-values.
Therefore, points that are far from both x=0
and y=0
most likely represent statistically significantly differentially expressed genes.
To create a volcano plot in R, we can just use the plot
function - the key is in the data we provide to this function. Run the R code below and take a look at the output.
plot(thale_cress_data$table$logFC,
-log10(thale_cress_data$table$PValue),
pch = 20, cex = 0.5, col = "blue",
main = "Volcano Plot for all Thale Cress Genes
for the timepoints X24hSL and X48hSL",
ylab="-log10(p-value)", xlab="logFC")
Note: Here the logFC
x-axis label refers to the log of the ratio of the gene count measurements at the two time points (the log-fold-change). A value close to 0 means that the measurement hasn’t changed much in the 24 hour period, while a larger value means there has been a larger change (either positive or negative).
Using the code shown above in 4.1 as a base, create volcano plots for your thale_cress_bonferroni_filtered
and thale_cress_fdr_filtered
objects.
Compare your volcano plots with each other, and with the volcano plot shown in 4.1.
Do you think that the \(p\)-value adjustments you performed were beneficial in identifying the genes which were most likely to be differentially expressed between the time points X24hSL
and X48hSL
?
Looking at your two volcano plots, has your preference between the two p.adjust
methods you used changed? Why/why not?
Great work, that’s everything for today!
That was a lot to cover in conjunction with the supplement material, but hopefully this lab has given you a taste of the world of Bioinformatics, and the integral role of statistics within this world.
By conducting this gene expression analysis, we have been able to identify genes that are statistically significantly differentially expressed between two seed germination time points for the thale cress plant.
While this might not seem immediately ground-breaking, it’s important to note that we could use the same procedure to identify, for example, genes that are statistically significantly differentially expressed between a healthy human genome, and the genome of a human with a certain illness or disease. This has immediate practical implications, and could for example lead to the development of drugs which target those genes in an effort to reduce the suffering of individuals with that condition (or possibly even cure the condition).
References
Bioconductor.org. 2021.
“Bioconductor: Open Source Software for Bioinformatics.” 2021.
https://bioconductor.org/.
Chen, Y., D. J. McCarthy, M. Ritchie, M. Robinson, and G. K. Smyth. 2018. edgeR: Differential Expression Analysis of Digital Gene Expression Data User’s Guide.
McCarthy, D. J., Y. Chen, and G. K. Smyth. 2012. “Differential Expression Analysis of Multifactor RNA-Seq Experiments with Respect to Biological Variation.” Nucleic Acids Research 40 (10): 4288–97.
Robinson, M. D., D. J. McCarthy, and G. K. Smyth. 2010. “edgeR: A Bioconductor Package for Differential Expression Analysis of Digital Gene Expression Data.” Bioinformatics 26 (1): 139–40.
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License
BY-NC-ND.
---
title: "STM1001: Computer Lab 8B"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Science/Health Science/Data Science Module {-}

### Topic 8B: Big Data III (Gene Expression Analysis) {-}

<br>

Welcome to the eighth computer lab for the Science, Health Science and Data Science modules.
In this computer lab we will conduct a simplified gene expression analysis.

This computer lab is designed to run alongside the content in the [Introduction to Bioinformatics in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_bioinformatics_in_r/). You are not required to have a background in biology or genomics in order to complete this lab. However, some terms from these disciplines are unavoidable when conducting bioinformatics analyses - the material in this supplement provides all the background information you will need. 

By the end of this lab, you should be able to conduct some of the main statistical steps in a gene expression analysis. Let's get started!

# Preparations {#prep}

Before we proceed, please make sure you have read the content in the [Introduction to Bioinformatics in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_bioinformatics_in_r/) - if you haven't, these lab questions may not make sense. It may also be helpful to keep this content open in a separate tab while you work through the lab material.
 
## Bioinformatics Packages 
 
In previous computer labs, we have downloaded R packages from CRAN, using the command `install.packages(...)`.
However, most Bioinformatics-specific R packages are not hosted on CRAN, but rather are hosted on the Bioconductor website [@Bioconductor].

In this computer lab, we will need to use one such package, `edgeR` ^[see @edgeRbase, @edgeR, and @edgeRpackage]. To install this package, we will need to follow a slightly different approach to usual.

To begin, we shall install the `BiocManager` package (using the normal installation process), which we will then use to install the Bioconductor `edgeR` package. Run the code below to install these packages, and to load the `edgeR` package.

```{r eval = F, include = T}
install.packages("BiocManager")
BiocManager::install(version = "3.14")
BiocManager::install("edgeR")

library(edgeR)
```

```{r eval = T, include = F}
library(edgeR)
```

*Note: Some additional dependency packages, namely `limma`, `locfit` and `Rcpp` may be installed as part of the `edgeR` installation process. The entire installation should only take a few minutes. If you are asked to update existing packages, with the line `Update all/some/none? [a/s/n]: ` in the command console, type `n` and hit enter.*

## Thale Cress Gene Count Data

In this lab, we will assess gene count data from the thale cress plant (*Arabidopsis thaliana*).
The analysis steps we cover will be equally applicable to gene expression data for other organisms, e.g. humans. 

An overview of the thale cress data is provided in [Section 3 of the Introduction to Bioinformatics in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_bioinformatics_in_r/thale-cress-gene-expression.html).

In summary, we have some RNA-Seq gene expression data, which has been recorded at different times over the germination and post-germination period of thale cress seeds.

This data contains recordings for two time points, denoted `X24hSL` and `X48hSL`. 

These are, respectively, the time points for thale cress seeds 24 hours and 48 hours after exposure to sunlight, following a stratification process (whereby the seeds are encouraged to germinate).

For each time point, we have recordings for three replicates - hence the `_1`'s, `_2`'s and `_3`'s appended to the `X24hSL` and `X48hSL` column names.

### 

We have actually carried out some filtering and pre-processing on the original thale cress gene count data set, to save you time (and possibly a headache), and to ensure that our focus in this lab can be on the final gene expression analysis and $p$-value assessments.  

The pre-prepared thale cress data is stored in two files on LMS: 

* `thale_cress_gene_count_data.RDS` contains filtered data on the gene counts for our different replicates
* `thale_cress_gene_exp_data.RDS` contains pre-processed gene count data, ready for our gene expression analysis <!-- $^{\dagger}$-->

Pleased download these files now, and store them in your working directory.

<!-- $^{\dagger}$ *If you are interested in the pre-processing steps we have conducted, some optional reading is provided in [Section 4 of the Introduction to Bioinformatics in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_bioinformatics_in_r/thale-cress-gene-expression.html)* -->

### 

These files have the extension `.RDS`, which you might not have previously encountered. To read `.RDS` files into R, we need to use the `readRDS()` function. The code below shows how one would read in a `.RDS` file called `example.RDS`.

*Note: The file must be stored in your current working directory in order for this code to work.*

``` {r, echo = T, eval = F}
example_data <- readRDS("example.RDS")
```

Using this code as a guide:

* Store the `thale_cress_gene_count_data.RDS` data in an object called `thale_cress_gene_counts`
* Store the `thale_cress_gene_exp_data.RDS` data in an object called `thale_cress_data`

``` {r, eval = T, include = F}
thale_cress_gene_counts <- readRDS("data/thale_cress_gene_count_data.RDS")
thale_cress_data <- readRDS("data/thale_cress_gene_exp_data.RDS")
```


# Gene Data Overview {#summary}

Before we proceed with our gene expression analysis, let's spend a couple of minutes familiarising ourselves with our two data sets.

##

Run the following R code to obtain an overview of the gene count data stored in the `thale_cress_gene_counts` object:

``` {r, echo = T, eval = F}
head(thale_cress_gene_counts$counts)
```

This should produce the table of gene counts for the first few genes, identical to the table provided in [Section 3 of the Introduction to Bioinformatics in R supplement](https://bookdown.org/rehk/stm1001_dsm_introduction_to_bioinformatics_in_r/thale-cress-gene-expression.html)

###

The `thale_cress_gene_counts` object is actually a type of `list`, and various information can be extracted from the object using the `$` symbol - e.g. `thale_cress_gene_counts$counts`. However, the counts data is the most relevant for us - the other information relates to pre-processing steps, which we don't have to worry about.

##

Run the following R code to obtain a summary of the gene samples stored in the `thale_cress_data` object:

``` {r, echo = T, eval = F}
thale_cress_data$samples
```

Here, the `lib.size` column tells us the library sizes for each replicate (i.e. the total number of read counts for all genes in each replicate). 
The `norm.factors` column tells us the normalisation values for each replicate. 

Originally, all the values in this column were `1`'s, meaning that each library was treated as equal, despite the library sizes being different.

As part of our pre-processing, we carried out a normalisation process, like the standardisation process covered in [section 4.2 of Topic 3](https://bookdown.org/a_shaker/STM1001_Topic_3/4-2-standardisation.html). As a result, the `norm.factors` are now all different to 1, with some replicates having smaller weights (`norm.factors < 1`) and some replicates having larger weights (`norm.factors > 1`).

###

Does any one replicate have very different qualities to the others? 

#	Identifying differentially expressed genes {#degenes}

Using our pre-processed data, we can try to identify the specific genes that are differentially expressed (DE), between the time points `X24hSL` and `X48hSL`.

## 

As part of our pre-processing, we computed some $p$-values for each gene, using a test known as a `Likelihood Ratio Test` (which is a little bit like a $t$-test).

If a gene has a small $p$-value, then this means it is likely to be differentially expressed between the time points `X24hSL` and `X48hSL`.

Run the R code below to see a table of some results for the first few genes:

``` {r, echo = T, eval = F}
head(thale_cress_data$table)
```

Don't worry about the first few columns - just focus on the `PValue` column, which contains the computed $p$-values.

Based on their $p$-values, what can you conclude about: 

* Gene 1 (`AT1G01010`), 
* Gene 3 (`AT1G01030`), and
* Gene 5 (`AT1G01050`)?

## {#padjust}

Since there are so many genes, and a $p$-value has been computed for each gene, we need to adjust our $p$-values to take into account the fact that multiple simultaneous tests have been performed. In other words, we need to use the `p.adjust` function that we introduced in [Computer Lab 7B](https://rpubs.com/LTU_STM1001/SMDSMCL7)!

The `p.adjust` function has actually been built in to the `topTags` function from the `edgeR` package, via the argument `adjust.method`.

Take a look at the partially completed R code below:

``` {r, echo = T, eval = F}
topTags(thale_cress_data, adjust.method = "...", n = ...)$table
```

Here, the  `...` in `adjust.method = "..."` can take the same variety of method specifications used in the `p.adjust` function.

The `n = ...` argument allows us to specify the maximum number of genes to include in our output (genes with the smallest $p$-values are selected first). The default setting is 10, which is a little low for us.

Let's try a few options now.

###

Using the code in \@ref(padjust) as a base, specify that Bonferroni correction should be performed on the $p$-values in the `thale_cress_data` object, and that results for the top 20,000 genes should be included.

Store your results in the object `thale_cress_bonferroni`.

*Hint: You might like to run the command `?p.adjust` and check the help file, if you can't remember the argument options for the `p.adjust` function.*

###

As an alternative, specify that false discovery rate correction should be performed on the $p$-values in the `thale_cress_data` object, and that results for the top 20,000 genes should be included.

Store your results in the object `thale_cress_fdr`.

## {#filter}

Since we are assessing so many genes (the top 20,000), it might be helpful to filter out genes with higher $p$-values. 

The argument `p.value` in the `topTags` function can be used to specify a $p$-value cut-off - any genes with $p$-values above the specified value are removed from the data set.

Try updating your `thale_cress_bonferroni` and `thale_cress_fdr` objects, by re-running your `topTags` code with a cut-off value of 0.01 specified via the `p.value` argument.

Call your new objects  `thale_cress_bonferroni_filtered` and `thale_cress_fdr_filtered` respectively.

###

Use the `dim()` function to determine the overall number of genes remaining in your updated `thale_cress_bonferroni_filtered` and `thale_cress_fdr_filtered` objects.

Compare the results - which `p.adjust` method do you prefer?

# Volcano Plots

To visualise the gene information stored in our objects, we can use a new type of plot called a `Volcano Plot`.
A volcano plot is a bit like a scatter plot, but with a few key differences. 

For our data:

* The plot is centred at 0 on the x-axis.
* Each point represents a gene.
* A point close to `x=0` represents a gene that is probably not differentially expressed.
* Points far from `x=0` represent genes that are probably differentially expressed.
* Points far from `y=0` represent genes with small p-values.

Therefore, points that are far from both `x=0` and `y=0` most likely represent statistically significantly differentially expressed genes.

## {#volcanoplot}

To create a volcano plot in R, we can just use the `plot` function - the key is in the data we provide to this function. Run the R code below and take a look at the output.

``` {r, eval = F, echo = T}
plot(thale_cress_data$table$logFC, 
     -log10(thale_cress_data$table$PValue), 
     pch = 20, cex = 0.5, col = "blue",
     main = "Volcano Plot for all Thale Cress Genes
     for the timepoints X24hSL and X48hSL",
     ylab="-log10(p-value)", xlab="logFC")
```

*Note: Here the `logFC` x-axis label refers to the log of the ratio of the gene count measurements at the two time points (the log-fold-change). A value close to 0 means that the measurement hasn't changed much in the 24 hour period, while a larger value means there has been a larger change (either positive or negative).*

##

Using the code shown above in \@ref(volcanoplot) as a base, create volcano plots for your `thale_cress_bonferroni_filtered` and `thale_cress_fdr_filtered` objects.

##

Compare your volcano plots with each other, and with the volcano plot shown in \@ref(volcanoplot).

Do you think that the $p$-value adjustments you performed were beneficial in identifying the genes which were most likely to be differentially expressed between the time points `X24hSL` and `X48hSL`?

Looking at your two volcano plots, has your preference between the two `p.adjust` methods you used changed? Why/why not?

<br>

#### Great work, that's everything for today! #### {-}

That was a lot to cover in conjunction with the supplement material, but hopefully this lab has given you a taste of the world of Bioinformatics, and the integral role of statistics within this world.

By conducting this gene expression analysis, we have been able to identify genes that are statistically significantly differentially expressed between two seed germination time points for the thale cress plant. 

While this might not seem immediately ground-breaking, it's important to note that we could use the same procedure to identify, for example, genes that are statistically significantly differentially expressed between a healthy human genome, and the genome of a human with a certain illness or disease. This has immediate practical implications, and could for example lead to the development of drugs which target those genes in an effort to reduce the suffering of individuals with that condition (or possibly even cure the condition).



<br>

# References {- #Ref}
<div id="refs"></div>

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>