Example R code solutions for the Week 8 Science/Health Science/Data Science Computer Lab, which uses thale cress gene count data collected by Narsai et al. (2017), and functions from the Bioconductor (Bioconductor.org 2021) edgeR
R package (see Robinson, McCarthy, and Smyth (2010), McCarthy, Chen, and Smyth (2012), and Chen et al. (2018)), are presented below.
This computer lab is designed to run alongside the content in the Introduction to Bioinformatics in R supplement. It might be helpful to have this material open as you look through these solutions.
The following R code should have been run:
install.packages("BiocManager")
BiocManager::install(version = "3.14")
BiocManager::install("edgeR")
library(edgeR)
No answer required.
The files
thale_cress_gene_count_data.RDS
andthale_cress_gene_exp_data.RDS
should have been downloaded from the LMS, and stored in your working directory.
Example R code is provided below:
thale_cress_gene_counts <- readRDS("thale_cress_gene_count_data.RDS")
thale_cress_data <- readRDS("thale_cress_gene_exp_data.RDS")
Example R code is provided below:
head(thale_cress_gene_counts$counts)
## X24hSL_1 X24hSL_2 X24hSL_3 X48hSL_1 X48hSL_2 X48hSL_3
## AT1G01010 282 136 315 646 622 610
## AT1G01020 1199 830 1341 768 769 888
## AT1G01030 264 79 267 266 218 333
## AT1G01040 1594 416 905 1640 1497 1893
## AT1G01050 4650 2976 4684 5350 5385 6000
## AT1G01060 8464 3007 8813 5066 5098 5923
No answer required.
thale_cress_data$samples
## group lib.size norm.factors
## X24hSL_1 X24hSL 68573580 1.0061746
## X24hSL_2 X24hSL 35899251 0.9320893
## X24hSL_3 X24hSL 68625532 0.9933386
## X48hSL_1 X48hSL 67286822 1.0320906
## X48hSL_2 X48hSL 66044497 1.0141724
## X48hSL_3 X48hSL 75747390 1.0255154
We observe that the second replicate for the X24hSL
timepoint, namely X24hSL_2
, has a much smaller library size than the other replicates. We can see that this has been taken into account during the normalisation process, as its corresponding norm.factor
is less than 1 (at 0.93, it is the smallest of the 6), signifying less weight is attached to this replicates’ results compared to the other replicates.
head(thale_cress_data$table)
## logFC logCPM LR PValue
## AT1G01010 1.04459593 2.723652 48.8911978 2.705622e-12
## AT1G01020 -0.86052305 3.994456 28.3069659 1.035228e-07
## AT1G01030 0.15519953 1.864742 0.4722892 4.919359e-01
## AT1G01040 0.52441104 4.317075 4.8488284 2.766492e-02
## AT1G01050 0.06154996 6.259886 0.1681752 6.817387e-01
## AT1G01060 -0.60201039 6.563097 13.3110537 2.638461e-04
Based on the output above, we conclude that:
AT1G01010
) has a very low \(p\)-value, and therefore is likely to be differentially expressed between the two time points.AT1G01030
) and 5 (AT1G01050
) have large \(p\)-values, and therefore are unlikely to be differentially expressed between the two time points.No answer required.
Example R code is provided below:
thale_cress_bonferroni <- topTags(thale_cress_data,
adjust.method = "bonferroni",
n = 20000)$table
Example R code is provided below:
thale_cress_fdr <- topTags(thale_cress_data,
adjust.method = "fdr",
n = 20000)$table
Example R code is provided below:
thale_cress_bonferroni_filtered <- topTags(thale_cress_data,
adjust.method = "bonferroni",
n = 20000, p.value = 0.01)$table
thale_cress_fdr_filtered <- topTags(thale_cress_data,
adjust.method = "fdr",
n = 20000, p.value = 0.01)$table
dim(thale_cress_bonferroni_filtered)
## [1] 6689 5
dim(thale_cress_fdr_filtered)
## [1] 12063 5
Using a p.value
cut-off of 0.01 and Bonferroni correction, we obtain 6689 significant genes.
In contrast, using a p.value
cut-off of 0.01 and false discovery rate correction, we obtain 12063 significant genes - almost double the Bonferroni correction results!
We can see that the FDR approach has been more lenient. As we learnt in Computer Lab 7B, controlling the false discovery rate is more powerful than controlling the family-wise error rate (via the Bonferroni correction approach). Therefore, we would conclude that in general, the FDR approach is preferable. If we wanted a smaller number of genes, we could simply specify n to be smaller than 12063, and this would help us identify the genes with the smallest \(p\)-values.
The R code below should have been run:
plot(thale_cress_data$table$logFC,
-log10(thale_cress_data$table$PValue),
pch = 20, cex = 0.5, col = "blue",
main = "Volcano Plot for all Thale Cress Genes
for the timepoints X24hSL and X48hSL",
ylab="-log10(p-value)", xlab="logFC")
Example R code is provided below:
plot(thale_cress_bonferroni_filtered$logFC,
-log10(thale_cress_bonferroni_filtered$PValue),
pch = 20, cex = 0.5, col = "blue",
main = "Volcano Plot for filtered Thale Cress Genes
for the timepoints X24hSL and X48hSL,
after Bonferroni correction",
ylab="-log10(p-value)", xlab="logFC")
plot(thale_cress_fdr_filtered$logFC,
-log10(thale_cress_fdr_filtered$PValue),
pch = 20, cex = 0.5, col = "blue",
main = "Volcano Plot for filtered Thale Cress Genes
for the timepoints X24hSL and X48hSL,
after FDR correction",
ylab="-log10(p-value)", xlab="logFC")
We observe that the volcano plots using the adjusted \(p\)-values have less observations around the point \((0, 0)\). Such observations would be for genes that exhibited small log-fold-changes between the time points - these genes would therefore be unlikely to be differentially expressed between the two time points.
As a result, we can conclude that the \(p\)-value adjustments we performed were beneficial in identifying the genes most likely to be differentially expressed between the time points X24hSL
and X48hSL
.
There are some differences between the Bonferroni-corrected and FDR-corrected volcano plots. The Bonferroni-corrected volcano plot has less observations, particularly for values around x=0
. However, there are also noticeably fewer points in the region x=2
to x=5
, in comparison to the FDR-corrected volcano plot.
The FDR-corrected volcano plot also exhibits slightly more of a curve away from the point x=0
on the x-axis when looking at values close to y=0
on the y-axis - it seems that this approach is better at filtering out genes that have larger \(p\)-values despite having medium log-fold-change values.
Overall these results would not change our preference between the Bonferroni or FDR approach. As mentioned in 3.3.1 above, if we thought the FDR approach resulted in too few genes being filtered out, we could simply reduce the n
value.
These notes have been prepared by Rupert Kuveke. The copyright for the material in these notes resides with the author named above, with the Department of Mathematics and Statistics and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License BY-NC-ND.