Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Why enrichment analysis?

  • Human genome contains ~20,000-25,000 genes
  • Each gene has multiple functions
  • If 1,000 genes have changed in an experimental condition, it may be difficult to understand what they do

Birds of a feather flock together

  • Genes with similar expression patterns share similar functions
  • Similar (common) functions characterize a group of genes

Birds of a feather flock together

  • Genes with similar expression patterns share similar functions
  • Similar (common) functions characterize a group of genes

 

Why enrichment analysis?

  • High level understanding of the biology behind gene expression – Interpretation!
  • Translating changes of hundreds/thousands of differentially expressed genes into a few biological processes (reducing dimensionality)

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • Enrichment analysis
  • GENE ontology and pathways enrichment
  • GENOMIC REGIONS enrichment
  • Tools and references

What is enrichment analysis

  • Enrichment analysis - summarizing common functions associated with a group of objects

What is enrichment analysis? – statistical definition

Enrichment analysis – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance

Classification of genes

Gene set - a priori classification of genes into biologically relevant groups (sets)

  • Members of the same biochemical pathways
  • Genes annotated with the same molecular function
  • Transcripts expressed in the same cellular compartments
  • Co-regulated/co-expressed genes
  • Genes located on the same cytogenetic band

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Annotation databases and ontologies

  • An annotation database annotates genes with functions or properties - sets of genes with shared functions
  • Structured prior knowledge about genes

Gene ontology

  • An ontology is a formal (hierarchical) representation of concepts and the relationships between them.

  • The objective of GO is to provide controlled vocabularies of terms for the description of gene products.

  • These terms are to be used as attributes of gene products, facilitating uniform queries across them.

Gene ontology hierarchy

  • Terms are related within a hierarchy using "is-a", "part-of" and other connectors

Gene ontology structure

Gene ontology describes multiple levels of detail of gene function.

  • Molecular Function - the tasks performed by individual gene products; examples are transcription factor and DNA helicase
  • Biological Process - broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions
  • Cellular Component - subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex

Gene ontology database

Gene ontologies are not created equal

Gene ontologies are not created equal

Gene ontologies for model organisms

MSigDb - Molecular Signatures Database

http://software.broadinstitute.org/gsea/msigdb/

  • H, hallmark gene sets are coherently expressed signatures derived by aggregating many MSigDB gene sets to represent well-defined biological states or processes.
  • C1, positional gene sets for each human chromosome and cytogenetic band.
  • C2, curated gene sets from online pathway databases, publications in PubMed, and knowledge of domain experts.
  • C3, motif gene sets based on conserved cis-regulatory motifs from a comparative analysis of the human, mouse, rat, and dog genomes.
  • C4, computational gene sets defined by mining large collections of cancer-oriented microarray data.
  • C5, GO gene sets consist of genes annotated by the same GO terms.
  • C6, oncogenic signatures defined directly from microarray gene expression data from cancer gene perturbations.
  • C7, immunologic signatures defined directly from microarray gene expression data from immunologic studies.

Pathways

  • An ordered series of molecular events that leads to the creation new molecular product, or a change in a cellular state or process.
  • Genes often participate in multiple pathways – think about genes having multiple functions

http://biochemical-pathways.com/#/map/1

KEGG pathway database

  • KEGG: Kyoto Encyclopedia of Genes and Genomes is a collection of biological information compiled from published material = curated database.
  • Includes information on genes, proteins, metabolic pathways, molecular interactions, and biochemical reactions associated with specific organisms
  • Provides a relationship (map) for how these components are organized in a cellular structure or reaction pathway.

http://www.genome.jp/kegg/

KEGG pathway diagram

Reactome

  • Curated human pathways encompassing metabolism, signaling, and other biological processes.
  • Every pathway is traceable to primary literature.

http://www.reactome.org/

Reactome pathway diagram

Other pathway databases

Genes to networks

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Enrichment analysis

Null hypothesis

  • Self-contained \(H_0\): genes in the gene set do not have any association with the pheontype

  • Problem: restrictive, use information only from a gene set

Enrichment analysis

Null hypothesis

  • Competitive \(H_0\): genes in the gene set have the same level of association with a given phenotype as genes in the complement gene set

  • Problem: wrong assumption of independent gene sampling

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional group
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the group

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional group
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the group
Diff. exp. genes Not Diff. exp. genes Total
In gene set k j-k j
Not in gene set n-k m-n-j+k m-j
Total n m-n m

Approach 1

Overrepresentation analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional group
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the group

What is the probability of having \(k\) or more genes from the group in the selected \(n\) genes?

\[P = \sum_{i=k}^n{ \frac{{m-j \choose n-i}{j \choose i}}{{m \choose n}} }\]

Approach 1

Overrepresentation analysis (ORA)

  1. Find a set of differentially expressed genes (DEGs)
  2. Are DEGs in a set more common than DEGs not in a set?
  • Fisher test stats::fisher.test()
  • Conditional hypergeometric test, to account for directed hierachy of GO GOstats::hyperGTest()

 

Example: https://github.com/mdozmorov/MDmisc/blob/master/R/gene_enrichment.R

Approach 1

Problems

  • Results significantly affected by the selected threshold

  • Many genes with moderate but meaningful expression changes are discarded

  • Wrong assumption that genes are independent

Approach 2

Functional Class Scoring (FCS)

  • Gene set analysis (GSA). Mootha et al., 2003; modified by Subramanian et al., 2005.

  • Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells

  • Aims to identify gene sets with "subtle but coordinated" expression changes that would be missed by DEGs threshold selection

Approach 2

  1. Sort genes by log fold change
  2. Calculate running sum - increment when gene in a set, decrement when not
  3. Maximum of the runnig sum is the enrichment score - larger means genes in a set are toward top of the sorted list
  4. Permute subject labels to calculate significance p-value

Other approaches

Linear model-based

  • CAMERA (Wu and Smyth 2012)
  • Correlation-Adjusted MEan RAnk gene set test
  • Estimating the variance inflation factor associated with inter-gene correlation, and incorporating this into parametric or rank-based test procedures

Other approaches

Linear model-based

  • ROAST (Wu et.al. 2010)
  • Under the null hypothesis (and assuming a linear model) the residuals are independent and identically distributed \(N(0,\sigma_g^2)\).
  • We can rotate the residual vector for each gene in a gene set, such that gene-gene expression correlations are preserved.

Other approaches

Impact analysis - incorporates topology of the pathway.

  • Gene's fold change
  • Classical enrichment statistics
  • The topology of the signaling pathway

Other approaches

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Multiple testing problem

  • With thousands of pathways to test for enrichment we’re not testing one hypothesis, but many hypotheses – one for each pathway

  • Analysis of 2,000 pathways using commonly accepted significance level \(\alpha=0.05\) will identify 100 enriched pathways simply by chance

  • If probability of making an error in one test is 0.05, probability of making at least one error in ten tests is \[1-(1-0.05)^{10}=0.40126\]

Error rates

False Discovery rate (FDR)

\[E \left[ \frac{False \; Discoveries}{True \; Discoveries} \right]\]

Family wise error rate (FWER)

\[Pr(Number \; of \; False \; positives \ge 1)\]

Expected number of false positives

\[E[Number \; of \; False \; positives]\]

Interpretation

Suppose 550 out of 10,000 genes are significant at \(\alpha = 0.05\)

P-value < 0.05

  • Expect \(0.05*10,000=500\) false positives

False Discovery Rate < 0.05

  • Expect \(0.05*550=27.5\) false positives

Family Wise Error Rate < 0.05

  • The probability of at least 1 false positive is \(\le 0.05\)

Permutation based methods

Permutation based adjusted p-values

  • Under the \(H_0\), the joint distribution of the test statistics can be estimated by permutation
  1. Permute genes \(b\) times, \(b=1, ..., B\)
  2. Select random gene set
  3. Compute enrichment test statistics \(t_{b}\)
  4. The permutation distribution of the test statistics \(T\) for the hypothesis \(H_A\) is given by the empirical distribution of \(t_1, ..., t_B\)

Permutation based methods

  • For two-sided alternative hypotheses, the permutation p-value for hypothesis \(H_j\) is

\[p = \frac{\sum_{b=1}^B{I(\vert{t_b}\vert \ge \vert{t}\vert)}}{B}\]

where \(I(*)\) is the indicator function, equaling 1 if the condition in parentheses is true and 0 otherwise. \(t\) is the observed t-statistics.

Multiple Hypothesis Testing

Bonferroni procedure controls Family Wise Error Rate (FWER)

  • Testing \(g\) null hypothesis
  • Reject any \(H_i\) with \(p_i \le \alpha / g\)
  • Example: \(0.05/10,000 = 0.000005\)

  • Controls the FWER to be \(\le \alpha\) and to be equal to \(\alpha\) if all hypotheses are true.
  • As the number of hypotheses increases, the average power for an individual hypothesis decreases
  • Very conservative; no attempt to incorporate dependence between tests

False discovery rates

  • It may be more appropriate to emphasize the proportion of false positives among the differentially expressed genes.

  • The expectation of this proportion is the false discovery rate (FDR) (Benjamini & Hochberg, 1995)

Q-value

  • q-value is defined as the minimum FDR that can be attained when calling a "feature" significant (i.e., expected proportion of false positives incurred when calling that feature significant)

  • The estimated q-value is a function of the p-value for that test and the distribution of the entire set of p-values from the family of tests being considered (Storey and Tibshiriani, PNAS, 2003)

  • Thus, in the enrichment analysis, if a pathway X has a q-value of 0.013 it means that 1.3% of pathways that show pvalues at least as small as pathway X are false positives

Q-value

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Gene enrichment vs. genome enrichment

  • Gene set enrichment analysis - summarizing many genes of interest, such as differentially expressed genes, with a few common gene annotations (molecular functions, canonical pathways)

 

  • Epigenomic enrichment analysis - summarizing many genomic regions of interest, such as disease-associated genomic variants, with a few common genome annotations (chromatin states, transcription factor binding sites)

Genomic regions

  • Gene/exon boundaries, promoters
  • Single Nucleotide Polymorphisms (SNPs)
  • Transcription Factor Binding Sites (TFBS)
  • Differentially methylated regions
  • CpG islands

Each genomic region has coordinates (unique IDs):

Chromosome, Start, End

Annotations of genomic regions

  • Epigenomic (regulatory) regions - genomic regions annotated as carrying functional and/or regulatory potential

  • DNaseI hypersensitive sites
  • Histone modification marks
  • Transcription Factor Binding Sites
  • DNA methylation
  • Enhancers

Genome annotation consortia

Why "genomic region enrichment analysis"?

Enrichment = functional impact

  • Hypothesis: SNPs in epigenomic regions may disrupt regulation
  • More significant enrichment = more SNPs in epigenomic regions = more regulation is disrupted (SNP burden)

 

Statistics of epigenomic enrichments

 

  • 6 out of 7 disease-associated SNPs overlap with epigenomic marks
  • How likely this to be observed by chance? (Chi-square test/Binomial test/Permutation test)

Overview

  • Why enrichment analysis?
  • What is enrichment analysis?
  • Gene ontology and pathways
  • GENE ontology and pathways enrichment
  • Multiple testing correction
  • GENOMIC REGIONS enrichment
  • Tools and references

Gene set enrichment analysis

Web

Gene set enrichment analysis

DIY

Gene annotation databases

Genomic regions enrichment analysis

Learn more

Thank you