- Why enrichment analysis?
- What is enrichment analysis?
- Gene ontology and pathways
- GENE ontology and pathways enrichment
- Multiple testing correction
- GENOMIC REGIONS enrichment
- Tools and references
People with similar genetic patterns are likely friends
Christakis NA, Fowler JH. "Friendship and natural selection." PNAS 2014 https://www.ncbi.nlm.nih.gov/pubmed/25024208
Enrichment analysis – detection whether a group of objects has certain properties more (or less) frequent than can be expected by chance
Gene set - a priori classification of genes into biologically relevant groups (sets)
An ontology is a formal (hierarchical) representation of concepts and the relationships between them.
The objective of GO is to provide controlled vocabularies of terms for the description of gene products.
These terms are to be used as attributes of gene products, facilitating uniform queries across them.
Gene ontology describes multiple levels of detail of gene function.
https://www.ebi.ac.uk/QuickGO/
Different levels of evidence:
http://software.broadinstitute.org/gsea/msigdb/
Self-contained \(H_0\): genes in the gene set do not have any association with the pheontype
Problem: restrictive, use information only from a gene set
Competitive \(H_0\): genes in the gene set have the same level of association with a given phenotype as genes in the complement gene set
Problem: wrong assumption of independent gene sampling
Overrepresentation analysis, Hypergeometric test
Overrepresentation analysis, Hypergeometric test
Diff. exp. genes | Not Diff. exp. genes | Total | |
---|---|---|---|
In gene set | k | j-k | j |
Not in gene set | n-k | m-n-j+k | m-j |
Total | n | m-n | m |
Overrepresentation analysis, Hypergeometric test
What is the probability of having \(k\) or more genes from the group in the selected \(n\) genes?
\[P = \sum_{i=k}^n{ \frac{{m-j \choose n-i}{j \choose i}}{{m \choose n}} }\]
Overrepresentation analysis (ORA)
stats::fisher.test()
GOstats::hyperGTest()
Example: https://github.com/mdozmorov/MDmisc/blob/master/R/gene_enrichment.R
Problems
Results significantly affected by the selected threshold
Many genes with moderate but meaningful expression changes are discarded
Wrong assumption that genes are independent
Functional Class Scoring (FCS)
Gene set analysis (GSA). Mootha et al., 2003; modified by Subramanian et al., 2005.
Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells
Aims to identify gene sets with "subtle but coordinated" expression changes that would be missed by DEGs threshold selection
Linear model-based
Linear model-based
Impact analysis - incorporates topology of the pathway.
Sorin Draghici et al., “A Systems Biology Approach for Pathway Level Analysis,” Genome Research. 2007. https://www.ncbi.nlm.nih.gov/pubmed/17785539
Adi Laurentiu Tarca et al., “A Novel Signaling Pathway Impact Analysis,” Bioinformatics. 2009
With thousands of pathways to test for enrichment we’re not testing one hypothesis, but many hypotheses – one for each pathway
Analysis of 2,000 pathways using commonly accepted significance level \(\alpha=0.05\) will identify 100 enriched pathways simply by chance
If probability of making an error in one test is 0.05, probability of making at least one error in ten tests is \[1-(1-0.05)^{10}=0.40126\]
False Discovery rate (FDR)
\[E \left[ \frac{False \; Discoveries}{True \; Discoveries} \right]\]
Family wise error rate (FWER)
\[Pr(Number \; of \; False \; positives \ge 1)\]
Expected number of false positives
\[E[Number \; of \; False \; positives]\]
Suppose 550 out of 10,000 genes are significant at \(\alpha = 0.05\)
P-value < 0.05
False Discovery Rate < 0.05
Family Wise Error Rate < 0.05
Permutation based adjusted p-values
\[p = \frac{\sum_{b=1}^B{I(\vert{t_b}\vert \ge \vert{t}\vert)}}{B}\]
where \(I(*)\) is the indicator function, equaling 1 if the condition in parentheses is true and 0 otherwise. \(t\) is the observed t-statistics.
Bonferroni procedure controls Family Wise Error Rate (FWER)
Example: \(0.05/10,000 = 0.000005\)
Very conservative; no attempt to incorporate dependence between tests
It may be more appropriate to emphasize the proportion of false positives among the differentially expressed genes.
The expectation of this proportion is the false discovery rate (FDR) (Benjamini & Hochberg, 1995)
q-value is defined as the minimum FDR that can be attained when calling a "feature" significant (i.e., expected proportion of false positives incurred when calling that feature significant)
The estimated q-value is a function of the p-value for that test and the distribution of the entire set of p-values from the family of tests being considered (Storey and Tibshiriani, PNAS, 2003)
Thus, in the enrichment analysis, if a pathway X has a q-value of 0.013 it means that 1.3% of pathways that show pvalues at least as small as pathway X are false positives
Martin Krzywinski & Naomi Altman "Points of significance: Comparing samples—part II" Nature Methods 2016 http://www.nature.com/nmeth/journal/v11/n4/full/nmeth.2900.html
Each genomic region has coordinates (unique IDs):
Chromosome
, Start
, End
Epigenomic (regulatory) regions - genomic regions annotated as carrying functional and/or regulatory potential
…
Enrichment = functional impact
goana
, camera
, roast
, romer
Questions?
This presentation on GitHub:
Mikhail Dozmorov, Ph.D.
Assistant professor, Department of Biostatistics, VCU