This document can be reproduced by using the R code and the data (for illustrating purposes and for the excersises) that are available here: https://github.com/isglobal-brge/short_course_omic_R
Test whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed (DE) genes. See Goeman and Buehlmann, 2007 for a critical review.
A convenient way to find DE gene sets is by detecting the so-called functional enrichment in two steps. First, we search for DE genes. Second, we verify whether these DE genes belong to the gene set in a proportion that exceeds any expectation of finding that number of genes in that gene set by chance alone. A straightforward way to assess this hypothesis consists of applying a hypergeometric test, which in fact corresponds to the one-tailed Fisher’s exact test, following the so-called urn model.
A hypergeometric test assesses whether a number of successes in a sequence of draws follows a hypergeometric distribution. The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of draws from a finite population without replacement, just as the binomialdistribution describes the number of successes for draws with replacement. In our context this involves the following quantities:
DE | non-DE | total | |
---|---|---|---|
Insiside Gene Set | k | m-k | m |
Outside Gene Set | n-k | N+k-n-m | N-m |
total | n | N-m | N |
where:
Given these quantities and a random variable \(X\) representing the possible outcomes of a hypergeometric process, the probability of getting exactly \(k\) genes inside a particular gene set is given by
\[\textrm{Pr}(X=k)=\frac{{m\choose k} {N-m\choose n-k}}{{N\choose n}}\] Therefore, the probability of getting \(k\) or more genes inside a particular gene set, as a hypergeometric random process, is
\[\textrm{Pr}(X >= k)=\sum_{x=k}^n \frac{{m\choose x} {N-m\choose n-x}}{{N\choose n}}\] This probability is the quantity being used as surrogate for the enrichment of the functional gene set with our list of DE genes, where the lower the value the less likely that we observe those \(k\) genes inside the given gene set by pure chance and thus the more enriched that this gene set is enriched with our DE genes.
Example: Transcriptomic study, in which 12,671 genes have been tested for differential expression between two sample conditions and 529 genes were found DE. Among the DE genes, 28 are annotated to a specific functional gene set (or pathway), which contains in total 170 genes. This setup corresponds to a 2x2 contingency table,
deTable <- matrix(c(28, 142, 501, 12000),
nrow = 2,
dimnames = list(DE=c("yes","no"),
GeneSet=c("in","out")))
deTable
GeneSet
DE in out
yes 28 501
no 142 12000
where the overlap of 28 genes can be assessed based on the hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test, yielding here a significant enrichment.
fisher.test(deTable, alternative = "greater")
Fisher's Exact Test for Count Data
data: deTable
p-value = 4.088e-10
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
3.226736 Inf
sample estimates:
odds ratio
4.721744
This basic principle is at the foundation of major public and commercial enrichment tools such as DAVID and Pathway Studio.
Although gene set enrichment methods have been primarily developed and applied on transcriptomic data, they have recently been modified, extended and applied also in other fields of genomic and biomedical research. This includes novel approaches for functional enrichment analysis of proteomic and metabolomic data as well as genomic regions and disease phenotypes, Lavallee and Yates, 2016, Chagoyen et al., 2016, McLean et al., 2010, Ried et al., 2012.
EXERCISE: Library tweeDEseqCountData
contains data corresponding to an RNA-seq experiment described in Pickrell et al. (2010). Data correspond to lymphoblastoid cell lines about 69 non-related Nigerian individuals. This information as well as phenotypic data is available as an object of class eSet
called pickrell.eset
that can be loaded after typing:
data(pickrell)
Use limma-voom method to detect those genes that are differentially expressed (DE) between males and females (variale gender
).
Select those genes that are DE at 5% FDR with a minimum fold-change of 1.5
Create a gene set of sex-especific genes by
geneUniverse <- featureNames(pickrell.eset)
geneSex <- unique(intersect(geneUniverse,
c(msYgenes, XiEgenes)))
and test whether the list of DE genes is enriched in that gene set.
This methodology can be extended to the analysis of multiple gene sets available in different databases such as GO or KEGG among others. Next sections illustrate how to perform such analyses using different BioC packages.
Gene sets, pathways & regulatory networks
Gene sets are simple lists of usually functionally related genes without further specification of relationships between genes.
Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process. Pathways are commonly divided in metabolic and signaling pathways. Metabolic pathways such as glycolysis represent biochemical substrate conversions by specific enzymes. Signaling pathways such as the MAPK signaling pathway describe signal transduction cascades from receptor proteins to transcription factors, resulting in activation or inhibition of specific target genes.
Gene regulatory networks describe the interplay and effects of regulatory factors (such as transcription factors and microRNAs) on the expression of their target genes.
Resources
GO, KEGG and MSigDB annotations are most frequently used for the enrichment analysis of functional gene sets. Despite an increasing number of gene set and pathway databases, they are typically the first choice due to their long-standing curation and availability for a wide range of species.
GO: The Gene Ontology (GO) consists of three major sub-ontologies that classify gene products according to molecular function (MF), biological process (BP) and cellular component (CC). Each ontology consists of GO terms that define MFs, BPs or CCs to which specific genes are annotated. The terms are organized in a directed acyclic graph, where edges between the terms represent relationships of different types. They relate the terms according to a parent-child scheme, i.e. parent terms denote more general entities, whereas child terms represent more specific entities.
KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development. Metabolism and drug development pathways differ from pathways of the other 5 categories by illustrating reactions between chemical compounds. Pathways of the other 5 categories illustrate molecular interactions between genes and gene products.
MSigDB: Molecular Signatures Database (MSigDB) are divided into 8 major collections, and several sub-collections. These include:
H: hallmark gene sets: Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression.
C1: positional gene sets: Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene.
C2: curated gene sets: Gene sets curated from various sources such as online pathway databases, the biomedical literature, and knowledge of domain experts. It contains 5 sub-collections.
C3 motif gene sets: Gene sets representing potential targets of regulation by transcription factors or microRNAs. It has 2 sub-collections.
C4 computational gene sets: Computational gene sets defined by mining large collections of cancer-oriented microarray data. It has 4 sub-collections.
C5 GO gene sets: Gene sets that contain genes annotated by the same GO term. The C5 collection is divided into three sub-collections based on GO ontologies: BP, CC, and MF.
C6 oncogenic signatures: Gene sets that represent signatures of cellular pathways which are often dis-regulated in cancer.
C7 inmunologic signatures: Gene sets that represent cell states and perturbations within the immune system.
Statistical approaches
The two predominantly used enrichment methods are:
To better distinguish from the specific method, some authors use the term gene set analysis to denote the general strategy. However, there is also a specific method from Efron and Tibshirani, 2007 of this name.
Goeman and Buehlmann further raise several critical issues concerning the 2x2 ORA:
With regard to these statistical concerns, GSEA is considered superior:
However, the simplicity and general applicability of ORA is unmet by subsequent methods improving on these issues. For instance, GSEA requires the expression data as input, which is not available for gene lists derived from other experiment types. On the other hand, the involved sample permutation procedure has been proven inaccurate and time-consuming Efron and Tibshirani, 2007, Phipson and Smyth, 2010, Larson and Owen, 2015.
Khatri et al., 2012 have taken a slightly different approach by classifying methods along the timeline of development into three generations:
Although topology-based (also: network-based) methods appear to be most realistic, their straightforward application can be impaired by features that are not-detectable on the transcriptional level (such as protein-protein interactions) and insufficient network knowledge Geistlinger et al., 2013, Bayerlova et al., 2015.
Given the individual benefits and limitations of existing methods, cautious interpretation of results is required to derive valid conclusions. Whereas no single method is best suited for all application scenarios, applying multiple methods can be beneficial. This has been shown to filter out spurious hits of individual methods, thereby reducing the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.
clusterProfiler
clusterProfiler performs statistical analysis and visualization of functional profiles for genes and gene clusters. It includes GO, KEEG and a general function to perform ORA for any database such as MSigDB or DisGeNET among others.
Let us illustrate how to perform ORA (enrichment analysis) using RNAseq data. Our interest is to compare gene expression between cell lines that have been treated with dexamethasone or not. Data are available in the airway
Bioconductor package.
library(airway)
data(airway)
table(airway$dex)
trt untrt
4 4
First, we need to perform DE analysis.
library(edgeR)
dge <- DGEList(assay(airway), group = airway$dex)
dge <- calcNormFactors(dge)
mm <- model.matrix( ~ group, data=dge$samples)
keep.exprs <- filterByExpr(dge)
dge.filt <- dge[keep.exprs,]
dim(dge.filt)
[1] 15956 8
mm <- model.matrix( ~ group, data=dge.filt$samples)
v <- voom(dge.filt, design = mm, plot = TRUE)
fit <- lmFit(v, mm)
fit <- eBayes(fit)
topTable(fit)
logFC AveExpr t P.Value adj.P.Val B
ENSG00000152583 -4.567978 4.165462 -18.90224 1.815671e-08 0.0001863661 9.540147
ENSG00000125148 -2.192581 7.022630 -16.58569 5.618684e-08 0.0001863661 9.086536
ENSG00000148175 -1.434051 8.854588 -16.78864 5.059831e-08 0.0001863661 9.076459
ENSG00000134686 -1.374679 6.837469 -16.51144 5.839999e-08 0.0001863661 9.064591
ENSG00000179094 -3.178899 4.418980 -16.81211 4.999274e-08 0.0001863661 9.035454
ENSG00000120129 -2.944349 6.643013 -15.76333 8.700236e-08 0.0002313683 8.677905
ENSG00000183044 -1.142278 4.430619 -15.16997 1.208955e-07 0.0002346028 8.351504
ENSG00000178695 2.536550 6.450283 15.00364 1.328615e-07 0.0002346028 8.279820
ENSG00000189221 -3.328277 5.949117 -14.58487 1.692384e-07 0.0002346028 8.041290
ENSG00000196517 2.221216 3.024889 15.19662 1.190918e-07 0.0002346028 8.009218
We call DE genes with a minimum 2-fold change of expression at a maximum FDR of 5%:
tt <- topTable(fit, n=Inf)
mask <- tt$adj.P.Val < 0.05 &
abs(tt$logFC) > log2(2)
deGenes <- rownames(tt[mask, ])
head(deGenes)
[1] "ENSG00000152583" "ENSG00000125148" "ENSG00000148175" "ENSG00000134686" "ENSG00000179094" "ENSG00000120129"
length(deGenes)
[1] 637
The gene universe is obtained by:
geneUniverse <- rownames(tt)
length(geneUniverse)
[1] 15956
In order to asses functional enrichment, both DE gene list and gene universe must be annotated in Entrez IDs:
library(org.Hs.eg.db)
deGenes <- unlist(mget(deGenes, envir=org.Hs.egENSEMBL2EG,
ifnotfound = NA))
geneUniverse <- unlist(mget(geneUniverse, envir=org.Hs.egENSEMBL2EG,
ifnotfound = NA))
The GO enrichment analysis using clusterProfiler
is performed by
library(clusterProfiler)
ans.go <- enrichGO(gene = deGenes, ont = "BP",
OrgDb ="org.Hs.eg.db",
universe = geneUniverse,
readable=TRUE,
pvalueCutoff = 0.05)
tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)
tab.go[1:5, 1:6]
ID Description GeneRatio BgRatio pvalue p.adjust
GO:0001525 GO:0001525 angiogenesis 42/524 376/11666 4.205065e-08 0.0001502143
GO:0006935 GO:0006935 chemotaxis 43/524 401/11666 9.400249e-08 0.0001502143
GO:0042330 GO:0042330 taxis 43/524 403/11666 1.083276e-07 0.0001502143
GO:0048514 GO:0048514 blood vessel morphogenesis 44/524 439/11666 4.549754e-07 0.0004731744
GO:0008015 GO:0008015 blood circulation 35/524 317/11666 7.711526e-07 0.0006415990
And the KEEG enrichment with:
ans.kegg <- enrichKEGG(gene = deGenes,
organism = 'hsa',
universe = geneUniverse,
pvalueCutoff = 0.05)
tab.kegg <- as.data.frame(ans.kegg)
tab.kegg<- subset(tab.kegg, Count>5)
tab.kegg[1:5, 1:6]
ID Description GeneRatio BgRatio pvalue p.adjust
hsa04060 hsa04060 Cytokine-cytokine receptor interaction 22/267 119/5368 5.975576e-08 1.709015e-05
hsa04213 hsa04213 Longevity regulating pathway - multiple species 10/267 55/5368 3.135226e-04 2.555243e-02
hsa04068 hsa04068 FoxO signaling pathway 15/267 111/5368 3.470615e-04 2.555243e-02
hsa04360 hsa04360 Axon guidance 18/267 150/5368 4.098462e-04 2.555243e-02
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications 13/267 90/5368 4.467208e-04 2.555243e-02
For external databases such as DisGeNET or MSigDB collections, databases must be imported into R. This is how DisGeNET is imported into R
gda <- read.delim("data/curated_gene_disease_associations.tsv.gz")
disease2gene <- gda[, c("diseaseId", "geneId")]
disease2name <- gda[, c("diseaseId", "diseaseName")]
Then, the enrichment is perform using enricher
function:
ans.dis <- enricher(deGenes, TERM2GENE=disease2gene,
TERM2NAME=disease2name)
tab.dis <- as.data.frame(ans.dis)
tab.dis<- subset(tab.dis, Count>5)
tab.dis[,1:6]
ID Description GeneRatio BgRatio pvalue p.adjust
C0014175 C0014175 Endometriosis 24/314 156/8948 8.239341e-10 1.145268e-06
C0151744 C0151744 Myocardial Ischemia 21/314 171/8948 5.162746e-07 3.588108e-04
C3495559 C3495559 Juvenile arthritis 15/314 130/8948 4.735006e-05 2.193886e-02
C0162820 C0162820 Dermatitis, Allergic Contact 10/314 66/8948 9.031442e-05 3.138426e-02
This is the R code for transcription factors which is one of the sub-collection of C3 available at MSigDb for additional gene set collections.
c3.tf <- read.gmt("c:/Juan/CREAL/HELIX/pathways/GSEA/c3.tft.v6.2.entrez.gmt")
ans.tf <- enricher(deGenes, TERM2GENE=c3.tf)
tab.tf <- as.data.frame(ans.tf)
tab.tf<- subset(tab.tf, Count>5)
tab.tf[1:5,1:5]
ID Description GeneRatio BgRatio pvalue
RP58_01 RP58_01 RP58_01 21/425 207/12774 5.542957e-06
YATTNATC_UNKNOWN YATTNATC_UNKNOWN YATTNATC_UNKNOWN 29/425 377/12774 2.414375e-05
TGGNNNNNNKCCAR_UNKNOWN TGGNNNNNNKCCAR_UNKNOWN TGGNNNNNNKCCAR_UNKNOWN 31/425 424/12774 3.395064e-05
CEBP_Q2 CEBP_Q2 CEBP_Q2 21/425 234/12774 3.605911e-05
E12_Q6 E12_Q6 E12_Q6 22/425 262/12774 6.480659e-05
All analyses performed with clusterProfiler
can be visualize by different plots:
library(enrichplot)
p1 <- barplot(ans.dis, showCategory=10)
p1
p2 <- dotplot(ans.kegg, showCategory=20) + ggtitle("KEGG")
p3 <- dotplot(ans.dis, showCategory=20) + ggtitle("Disease")
plot_grid(p2, p3, nrow=2)
p4 <- upsetplot(ans.dis)
p4
p5 <- emapplot(ans.kegg)
p5
cowplot::plot_grid(p1, p3, p5, ncol=2, labels=LETTERS[1:3])
EXERCISE: The file data_exercises/GSE18123.Rdata
contains data corresponing to the GEO accession number GSE18123. This dataset was used to test whether peripheral blood gene expression profiles could be used as a molecular diagnostic tool for distinguishing children with Autistic Spectrum Disorder (ASD) from controls by performing peripheral blood gene expression profiling on 170 patients with ASD and 115 controls collected from Boston area hospitals.
Determine those genes that are differentially expressed between CONTROLs and AUTISM (variable group
).
Perform an enrichment analysis of GO, KEGG and DisGeNET databases.
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] enrichplot_1.8.1 clusterProfiler_3.16.1 org.Hs.eg.db_3.11.4 AnnotationDbi_1.50.3
[5] edgeR_3.30.3 limma_3.44.3 airway_1.8.0 SummarizedExperiment_1.18.2
[9] DelayedArray_0.14.1 matrixStats_0.57.0 Biobase_2.48.0 GenomicRanges_1.40.0
[13] GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1 BiocGenerics_0.34.0
[17] BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] fgsea_1.14.0 colorspace_2.0-0 ellipsis_0.3.1 ggridges_0.5.2 qvalue_2.20.0
[6] XVector_0.28.0 farver_2.1.0 urltools_1.7.3 graphlayouts_0.7.1 ggrepel_0.8.2
[11] bit64_4.0.4 scatterpie_0.1.5 fansi_0.4.2 xml2_1.3.2 codetools_0.2-16
[16] splines_4.0.2 GOSemSim_2.14.2 knitr_1.29 polyclip_1.10-0 jsonlite_1.7.2
[21] GO.db_3.11.4 ggforce_0.3.2 BiocManager_1.30.10 compiler_4.0.2 httr_1.4.2
[26] rvcheck_0.1.8 assertthat_0.2.1 Matrix_1.2-18 tweenr_1.0.1 htmltools_0.5.1.1
[31] prettyunits_1.1.1 tools_4.0.2 igraph_1.2.5 gtable_0.3.0 glue_1.4.2
[36] GenomeInfoDbData_1.2.3 reshape2_1.4.4 DO.db_2.9 dplyr_1.0.5 fastmatch_1.1-0
[41] Rcpp_1.0.6 vctrs_0.3.7 ggraph_2.0.4 xfun_0.16 stringr_1.4.0
[46] lifecycle_1.0.0 DOSE_3.14.0 europepmc_0.4 zlibbioc_1.34.0 MASS_7.3-52
[51] scales_1.1.1 tidygraph_1.2.0 hms_0.5.3 ggupset_0.3.0 RColorBrewer_1.1-2
[56] yaml_2.2.1 memoise_1.1.0 gridExtra_2.3 ggplot2_3.3.3 downloader_0.4
[61] triebeard_0.3.0 stringi_1.4.6 RSQLite_2.2.0 BiocParallel_1.22.0 rlang_0.4.10
[66] pkgconfig_2.0.3 bitops_1.0-6 evaluate_0.14 lattice_0.20-41 purrr_0.3.4
[71] labeling_0.4.2 cowplot_1.1.0 bit_4.0.4 tidyselect_1.1.0 plyr_1.8.6
[76] magrittr_2.0.1 bookdown_0.20 R6_2.4.1 magick_2.4.0 generics_0.0.2
[81] DBI_1.1.0 pillar_1.6.0 RCurl_1.98-1.2 tibble_3.1.0 crayon_1.4.1
[86] utf8_1.2.1 rmarkdown_2.7 viridis_0.5.1 progress_1.2.2 locfit_1.5-9.4
[91] grid_4.0.2 data.table_1.13.0 blob_1.2.1 digest_0.6.27 tidyr_1.1.2
[96] gridGraphics_0.5-0 munsell_0.5.0 viridisLite_0.4.0 ggplotify_0.0.5