1 Gettig started

This document can be reproduced by using the R code and the data (for illustrating purposes and for the excersises) that are available here: https://github.com/isglobal-brge/short_course_omic_R

2 Introduction

Test whether known biological functions or processes are over-represented (= enriched) in an experimentally-derived gene list, e.g. a list of differentially expressed (DE) genes. See Goeman and Buehlmann, 2007 for a critical review.

A convenient way to find DE gene sets is by detecting the so-called functional enrichment in two steps. First, we search for DE genes. Second, we verify whether these DE genes belong to the gene set in a proportion that exceeds any expectation of finding that number of genes in that gene set by chance alone. A straightforward way to assess this hypothesis consists of applying a hypergeometric test, which in fact corresponds to the one-tailed Fisher’s exact test, following the so-called urn model.

A hypergeometric test assesses whether a number of successes in a sequence of draws follows a hypergeometric distribution. The hypergeometric distribution is a discrete probability distribution that describes the number of successes in a sequence of draws from a finite population without replacement, just as the binomialdistribution describes the number of successes for draws with replacement. In our context this involves the following quantities:

	DE	non-DE	total
Insiside Gene Set	k	m-k	m
Outside Gene Set	n-k	N+k-n-m	N-m
total	n	N-m	N

where:

\(N\) is the total number of genes considered (gene universe).
\(n\) is the number of DE genes.
\(m\) is the number of genes in the gene set.
\(k\) is the number of DE genes in the gene set.

Given these quantities and a random variable \(X\) representing the possible outcomes of a hypergeometric process, the probability of getting exactly \(k\) genes inside a particular gene set is given by

\[\textrm{Pr}(X=k)=\frac{{m\choose k} {N-m\choose n-k}}{{N\choose n}}\] Therefore, the probability of getting \(k\) or more genes inside a particular gene set, as a hypergeometric random process, is

\[\textrm{Pr}(X >= k)=\sum_{x=k}^n \frac{{m\choose x} {N-m\choose n-x}}{{N\choose n}}\] This probability is the quantity being used as surrogate for the enrichment of the functional gene set with our list of DE genes, where the lower the value the less likely that we observe those \(k\) genes inside the given gene set by pure chance and thus the more enriched that this gene set is enriched with our DE genes.

Example: Transcriptomic study, in which 12,671 genes have been tested for differential expression between two sample conditions and 529 genes were found DE. Among the DE genes, 28 are annotated to a specific functional gene set (or pathway), which contains in total 170 genes. This setup corresponds to a 2x2 contingency table,

deTable <-  matrix(c(28, 142, 501, 12000),
            nrow = 2,
            dimnames = list(DE=c("yes","no"),
                            GeneSet=c("in","out")))
              
deTable

     GeneSet
DE     in   out
  yes  28   501
  no  142 12000

where the overlap of 28 genes can be assessed based on the hypergeometric distribution. This corresponds to a one-sided version of Fisher’s exact test, yielding here a significant enrichment.

fisher.test(deTable, alternative = "greater")


    Fisher's Exact Test for Count Data

data:  deTable
p-value = 4.088e-10
alternative hypothesis: true odds ratio is greater than 1
95 percent confidence interval:
 3.226736      Inf
sample estimates:
odds ratio 
  4.721744

This basic principle is at the foundation of major public and commercial enrichment tools such as DAVID and Pathway Studio.

Although gene set enrichment methods have been primarily developed and applied on transcriptomic data, they have recently been modified, extended and applied also in other fields of genomic and biomedical research. This includes novel approaches for functional enrichment analysis of proteomic and metabolomic data as well as genomic regions and disease phenotypes, Lavallee and Yates, 2016, Chagoyen et al., 2016, McLean et al., 2010, Ried et al., 2012.

EXERCISE: Library tweeDEseqCountData contains data corresponding to an RNA-seq experiment described in Pickrell et al. (2010). Data correspond to lymphoblastoid cell lines about 69 non-related Nigerian individuals. This information as well as phenotypic data is available as an object of class eSet called pickrell.eset that can be loaded after typing:

data(pickrell)

Use limma-voom method to detect those genes that are differentially expressed (DE) between males and females (variale gender).
Select those genes that are DE at 5% FDR with a minimum fold-change of 1.5
Create a gene set of sex-especific genes by

geneUniverse <- featureNames(pickrell.eset)
geneSex <- unique(intersect(geneUniverse,
                  c(msYgenes, XiEgenes)))

and test whether the list of DE genes is enriched in that gene set.

3 Gene expression-based enrichment analysis

This methodology can be extended to the analysis of multiple gene sets available in different databases such as GO or KEGG among others. Next sections illustrate how to perform such analyses using different BioC packages.

3.1 A primer on terminology, existing methods & statistical theory

Gene sets, pathways & regulatory networks

Gene sets are simple lists of usually functionally related genes without further specification of relationships between genes.

Pathways can be interpreted as specific gene sets, typically representing a group of genes that work together in a biological process. Pathways are commonly divided in metabolic and signaling pathways. Metabolic pathways such as glycolysis represent biochemical substrate conversions by specific enzymes. Signaling pathways such as the MAPK signaling pathway describe signal transduction cascades from receptor proteins to transcription factors, resulting in activation or inhibition of specific target genes.

Gene regulatory networks describe the interplay and effects of regulatory factors (such as transcription factors and microRNAs) on the expression of their target genes.

Resources

GO, KEGG and MSigDB annotations are most frequently used for the enrichment analysis of functional gene sets. Despite an increasing number of gene set and pathway databases, they are typically the first choice due to their long-standing curation and availability for a wide range of species.

GO: The Gene Ontology (GO) consists of three major sub-ontologies that classify gene products according to molecular function (MF), biological process (BP) and cellular component (CC). Each ontology consists of GO terms that define MFs, BPs or CCs to which specific genes are annotated. The terms are organized in a directed acyclic graph, where edges between the terms represent relationships of different types. They relate the terms according to a parent-child scheme, i.e. parent terms denote more general entities, whereas child terms represent more specific entities.

KEGG: The Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of manually drawn pathway maps representing molecular interaction and reaction networks. These pathways cover a wide range of biochemical processes that can be divided in 7 broad categories: metabolism, genetic and environmental information processing, cellular processes, organismal systems, human diseases, and drug development. Metabolism and drug development pathways differ from pathways of the other 5 categories by illustrating reactions between chemical compounds. Pathways of the other 5 categories illustrate molecular interactions between genes and gene products.

MSigDB: Molecular Signatures Database (MSigDB) are divided into 8 major collections, and several sub-collections. These include:

H: hallmark gene sets: Hallmark gene sets summarize and represent specific well-defined biological states or processes and display coherent expression.
C1: positional gene sets: Gene sets corresponding to each human chromosome and each cytogenetic band that has at least one gene.
C2: curated gene sets: Gene sets curated from various sources such as online pathway databases, the biomedical literature, and knowledge of domain experts. It contains 5 sub-collections.
C3 motif gene sets: Gene sets representing potential targets of regulation by transcription factors or microRNAs. It has 2 sub-collections.
C4 computational gene sets: Computational gene sets defined by mining large collections of cancer-oriented microarray data. It has 4 sub-collections.
C5 GO gene sets: Gene sets that contain genes annotated by the same GO term. The C5 collection is divided into three sub-collections based on GO ontologies: BP, CC, and MF.
C6 oncogenic signatures: Gene sets that represent signatures of cellular pathways which are often dis-regulated in cancer.
C7 inmunologic signatures: Gene sets that represent cell states and perturbations within the immune system.

Statistical approaches

The two predominantly used enrichment methods are:

Overrepresentation analysis (ORA), testing whether a gene set contains disproportional many genes of significant expression change, based on the procedure outlined in the first section.
Gene set enrichment analysis (GSEA), testing whether genes of a gene set accumulate at the top or bottom of the full gene vector ordered by direction and magnitude of expression change Subramanian et al., 2005. However, the term gene set enrichment analysis nowadays subsumes a general strategy implemented by a wide range of methods Huang et al., 2009. Those methods have in common the same goal, although approach and statistical model can vary substantially Goeman and Buehlmann, 2007, Khatri et al., 2012.

To better distinguish from the specific method, some authors use the term gene set analysis to denote the general strategy. However, there is also a specific method from Efron and Tibshirani, 2007 of this name.

Goeman and Buehlmann further raise several critical issues concerning the 2x2 ORA:

rather arbitrary classification of genes in DE / not DE
based on gene sampling, although sampling of subjects is appropriate
unrealistic independence assumption between genes, resulting in highly anti-conservative p-values

With regard to these statistical concerns, GSEA is considered superior:

takes all measured genes into account
subject sampling via permutation of class labels
the incorporated permutation procedure implicitly accounts for correlations between genes

However, the simplicity and general applicability of ORA is unmet by subsequent methods improving on these issues. For instance, GSEA requires the expression data as input, which is not available for gene lists derived from other experiment types. On the other hand, the involved sample permutation procedure has been proven inaccurate and time-consuming Efron and Tibshirani, 2007, Phipson and Smyth, 2010, Larson and Owen, 2015.

Khatri et al., 2012 have taken a slightly different approach by classifying methods along the timeline of development into three generations:

Generation: ORA methods based on the 2x2 contingency table test,
Generation: functional class scoring (FCS) methods such as GSEA, which compute gene set (= functional class) scores by summarizing per-gene DE statistics,
Generation: topology-based methods, explicitly taking into account interactions between genes as defined in signaling pathways and gene regulatory networks (Geistlinger et al., 2011 for an example).

Although topology-based (also: network-based) methods appear to be most realistic, their straightforward application can be impaired by features that are not-detectable on the transcriptional level (such as protein-protein interactions) and insufficient network knowledge Geistlinger et al., 2013, Bayerlova et al., 2015.

Given the individual benefits and limitations of existing methods, cautious interpretation of results is required to derive valid conclusions. Whereas no single method is best suited for all application scenarios, applying multiple methods can be beneficial. This has been shown to filter out spurious hits of individual methods, thereby reducing the outcome to gene sets accumulating evidence from different methods Geistlinger et al., 2016, Alhamdoosh et al., 2017.

3.2 Analysis using `clusterProfiler`

clusterProfiler performs statistical analysis and visualization of functional profiles for genes and gene clusters. It includes GO, KEEG and a general function to perform ORA for any database such as MSigDB or DisGeNET among others.

Let us illustrate how to perform ORA (enrichment analysis) using RNAseq data. Our interest is to compare gene expression between cell lines that have been treated with dexamethasone or not. Data are available in the airway Bioconductor package.

library(airway)
data(airway)
table(airway$dex)


  trt untrt 
    4     4

First, we need to perform DE analysis.

library(edgeR)
dge <- DGEList(assay(airway), group = airway$dex)
dge <- calcNormFactors(dge)
mm <- model.matrix( ~ group, data=dge$samples)

keep.exprs <- filterByExpr(dge)
dge.filt <- dge[keep.exprs,]
dim(dge.filt)

[1] 15956     8

mm <- model.matrix( ~ group, data=dge.filt$samples)
v <- voom(dge.filt, design = mm, plot = TRUE)

fit <- lmFit(v, mm)
fit <- eBayes(fit)
topTable(fit)

                    logFC  AveExpr         t      P.Value    adj.P.Val        B
ENSG00000152583 -4.567978 4.165462 -18.90224 1.815671e-08 0.0001863661 9.540147
ENSG00000125148 -2.192581 7.022630 -16.58569 5.618684e-08 0.0001863661 9.086536
ENSG00000148175 -1.434051 8.854588 -16.78864 5.059831e-08 0.0001863661 9.076459
ENSG00000134686 -1.374679 6.837469 -16.51144 5.839999e-08 0.0001863661 9.064591
ENSG00000179094 -3.178899 4.418980 -16.81211 4.999274e-08 0.0001863661 9.035454
ENSG00000120129 -2.944349 6.643013 -15.76333 8.700236e-08 0.0002313683 8.677905
ENSG00000183044 -1.142278 4.430619 -15.16997 1.208955e-07 0.0002346028 8.351504
ENSG00000178695  2.536550 6.450283  15.00364 1.328615e-07 0.0002346028 8.279820
ENSG00000189221 -3.328277 5.949117 -14.58487 1.692384e-07 0.0002346028 8.041290
ENSG00000196517  2.221216 3.024889  15.19662 1.190918e-07 0.0002346028 8.009218

We call DE genes with a minimum 2-fold change of expression at a maximum FDR of 5%:

tt <- topTable(fit, n=Inf)
mask <- tt$adj.P.Val < 0.05 &
        abs(tt$logFC) > log2(2)
deGenes <- rownames(tt[mask, ])
head(deGenes)

[1] "ENSG00000152583" "ENSG00000125148" "ENSG00000148175" "ENSG00000134686" "ENSG00000179094" "ENSG00000120129"

length(deGenes)

[1] 637

The gene universe is obtained by:

geneUniverse <- rownames(tt)
length(geneUniverse)

[1] 15956

In order to asses functional enrichment, both DE gene list and gene universe must be annotated in Entrez IDs:

library(org.Hs.eg.db)
deGenes <- unlist(mget(deGenes, envir=org.Hs.egENSEMBL2EG,
                       ifnotfound = NA))

geneUniverse <- unlist(mget(geneUniverse, envir=org.Hs.egENSEMBL2EG,
                       ifnotfound = NA))

The GO enrichment analysis using clusterProfiler is performed by

library(clusterProfiler)
ans.go <- enrichGO(gene = deGenes, ont = "BP",
                   OrgDb ="org.Hs.eg.db",
                   universe = geneUniverse,
                   readable=TRUE,
                   pvalueCutoff = 0.05)
tab.go <- as.data.frame(ans.go)
tab.go<- subset(tab.go, Count>5)
tab.go[1:5, 1:6]

                   ID                Description GeneRatio   BgRatio       pvalue     p.adjust
GO:0001525 GO:0001525               angiogenesis    42/524 376/11666 4.205065e-08 0.0001502143
GO:0006935 GO:0006935                 chemotaxis    43/524 401/11666 9.400249e-08 0.0001502143
GO:0042330 GO:0042330                      taxis    43/524 403/11666 1.083276e-07 0.0001502143
GO:0048514 GO:0048514 blood vessel morphogenesis    44/524 439/11666 4.549754e-07 0.0004731744
GO:0008015 GO:0008015          blood circulation    35/524 317/11666 7.711526e-07 0.0006415990

And the KEEG enrichment with:

ans.kegg <- enrichKEGG(gene = deGenes,
                       organism = 'hsa',
                       universe = geneUniverse,
                       pvalueCutoff = 0.05)
tab.kegg <- as.data.frame(ans.kegg)
tab.kegg<- subset(tab.kegg, Count>5)
tab.kegg[1:5, 1:6]

               ID                                          Description GeneRatio  BgRatio       pvalue     p.adjust
hsa04060 hsa04060               Cytokine-cytokine receptor interaction    22/267 119/5368 5.975576e-08 1.709015e-05
hsa04213 hsa04213      Longevity regulating pathway - multiple species    10/267  55/5368 3.135226e-04 2.555243e-02
hsa04068 hsa04068                               FoxO signaling pathway    15/267 111/5368 3.470615e-04 2.555243e-02
hsa04360 hsa04360                                        Axon guidance    18/267 150/5368 4.098462e-04 2.555243e-02
hsa04933 hsa04933 AGE-RAGE signaling pathway in diabetic complications    13/267  90/5368 4.467208e-04 2.555243e-02

For external databases such as DisGeNET or MSigDB collections, databases must be imported into R. This is how DisGeNET is imported into R

gda <- read.delim("data/curated_gene_disease_associations.tsv.gz")
disease2gene <- gda[, c("diseaseId", "geneId")]
disease2name <- gda[, c("diseaseId", "diseaseName")]

Then, the enrichment is perform using enricher function:

ans.dis <- enricher(deGenes, TERM2GENE=disease2gene,
                    TERM2NAME=disease2name)
tab.dis <- as.data.frame(ans.dis)
tab.dis<- subset(tab.dis, Count>5)
tab.dis[,1:6]

               ID                  Description GeneRatio  BgRatio       pvalue     p.adjust
C0014175 C0014175                Endometriosis    24/314 156/8948 8.239341e-10 1.145268e-06
C0151744 C0151744          Myocardial Ischemia    21/314 171/8948 5.162746e-07 3.588108e-04
C3495559 C3495559           Juvenile arthritis    15/314 130/8948 4.735006e-05 2.193886e-02
C0162820 C0162820 Dermatitis, Allergic Contact    10/314  66/8948 9.031442e-05 3.138426e-02

This is the R code for transcription factors which is one of the sub-collection of C3 available at MSigDb for additional gene set collections.

c3.tf <- read.gmt("c:/Juan/CREAL/HELIX/pathways/GSEA/c3.tft.v6.2.entrez.gmt")

ans.tf <- enricher(deGenes, TERM2GENE=c3.tf)
tab.tf <- as.data.frame(ans.tf)
tab.tf<- subset(tab.tf, Count>5)
tab.tf[1:5,1:5]

                                           ID            Description GeneRatio   BgRatio       pvalue
RP58_01                               RP58_01                RP58_01    21/425 207/12774 5.542957e-06
YATTNATC_UNKNOWN             YATTNATC_UNKNOWN       YATTNATC_UNKNOWN    29/425 377/12774 2.414375e-05
TGGNNNNNNKCCAR_UNKNOWN TGGNNNNNNKCCAR_UNKNOWN TGGNNNNNNKCCAR_UNKNOWN    31/425 424/12774 3.395064e-05
CEBP_Q2                               CEBP_Q2                CEBP_Q2    21/425 234/12774 3.605911e-05
E12_Q6                                 E12_Q6                 E12_Q6    22/425 262/12774 6.480659e-05

3.3 Visualization

All analyses performed with clusterProfiler can be visualize by different plots:

library(enrichplot)
p1 <- barplot(ans.dis, showCategory=10)
p1

p2 <- dotplot(ans.kegg, showCategory=20) + ggtitle("KEGG")
p3 <- dotplot(ans.dis, showCategory=20) + ggtitle("Disease")
plot_grid(p2, p3, nrow=2)

p4 <- upsetplot(ans.dis)
p4

p5 <- emapplot(ans.kegg)
p5

cowplot::plot_grid(p1, p3, p5, ncol=2, labels=LETTERS[1:3])

EXERCISE: The file data_exercises/GSE18123.Rdata contains data corresponing to the GEO accession number GSE18123. This dataset was used to test whether peripheral blood gene expression profiles could be used as a molecular diagnostic tool for distinguishing children with Autistic Spectrum Disorder (ASD) from controls by performing peripheral blood gene expression profiling on 170 patients with ASD and 115 controls collected from Boston area hospitals.

Determine those genes that are differentially expressed between CONTROLs and AUTISM (variable group).
Perform an enrichment analysis of GO, KEGG and DisGeNET databases.

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] enrichplot_1.8.1            clusterProfiler_3.16.1      org.Hs.eg.db_3.11.4         AnnotationDbi_1.50.3       
 [5] edgeR_3.30.3                limma_3.44.3                airway_1.8.0                SummarizedExperiment_1.18.2
 [9] DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0              GenomicRanges_1.40.0       
[13] GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1            BiocGenerics_0.34.0        
[17] BiocStyle_2.16.0           

loaded via a namespace (and not attached):
 [1] fgsea_1.14.0           colorspace_2.0-0       ellipsis_0.3.1         ggridges_0.5.2         qvalue_2.20.0         
 [6] XVector_0.28.0         farver_2.1.0           urltools_1.7.3         graphlayouts_0.7.1     ggrepel_0.8.2         
[11] bit64_4.0.4            scatterpie_0.1.5       fansi_0.4.2            xml2_1.3.2             codetools_0.2-16      
[16] splines_4.0.2          GOSemSim_2.14.2        knitr_1.29             polyclip_1.10-0        jsonlite_1.7.2        
[21] GO.db_3.11.4           ggforce_0.3.2          BiocManager_1.30.10    compiler_4.0.2         httr_1.4.2            
[26] rvcheck_0.1.8          assertthat_0.2.1       Matrix_1.2-18          tweenr_1.0.1           htmltools_0.5.1.1     
[31] prettyunits_1.1.1      tools_4.0.2            igraph_1.2.5           gtable_0.3.0           glue_1.4.2            
[36] GenomeInfoDbData_1.2.3 reshape2_1.4.4         DO.db_2.9              dplyr_1.0.5            fastmatch_1.1-0       
[41] Rcpp_1.0.6             vctrs_0.3.7            ggraph_2.0.4           xfun_0.16              stringr_1.4.0         
[46] lifecycle_1.0.0        DOSE_3.14.0            europepmc_0.4          zlibbioc_1.34.0        MASS_7.3-52           
[51] scales_1.1.1           tidygraph_1.2.0        hms_0.5.3              ggupset_0.3.0          RColorBrewer_1.1-2    
[56] yaml_2.2.1             memoise_1.1.0          gridExtra_2.3          ggplot2_3.3.3          downloader_0.4        
[61] triebeard_0.3.0        stringi_1.4.6          RSQLite_2.2.0          BiocParallel_1.22.0    rlang_0.4.10          
[66] pkgconfig_2.0.3        bitops_1.0-6           evaluate_0.14          lattice_0.20-41        purrr_0.3.4           
[71] labeling_0.4.2         cowplot_1.1.0          bit_4.0.4              tidyselect_1.1.0       plyr_1.8.6            
[76] magrittr_2.0.1         bookdown_0.20          R6_2.4.1               magick_2.4.0           generics_0.0.2        
[81] DBI_1.1.0              pillar_1.6.0           RCurl_1.98-1.2         tibble_3.1.0           crayon_1.4.1          
[86] utf8_1.2.1             rmarkdown_2.7          viridis_0.5.1          progress_1.2.2         locfit_1.5-9.4        
[91] grid_4.0.2             data.table_1.13.0      blob_1.2.1             digest_0.6.27          tidyr_1.1.2           
[96] gridGraphics_0.5-0     munsell_0.5.0          viridisLite_0.4.0      ggplotify_0.0.5

Enrichment Analysis

May, 2021

Abstract

Contents

1 Gettig started

2 Introduction

3 Gene expression-based enrichment analysis

3.1 A primer on terminology, existing methods & statistical theory

3.2 Analysis using `clusterProfiler`

3.3 Visualization

Enrichment Analysis

May, 2021

Abstract

Contents

1 Gettig started

2 Introduction

3 Gene expression-based enrichment analysis

3.1 A primer on terminology, existing methods & statistical theory

3.2 Analysis using clusterProfiler

3.3 Visualization

3.2 Analysis using `clusterProfiler`