Introduction to the theory and practice of RNA sequencing (RNA-seq) analysis
- Goals for RNA sequencing
- RNA-seq technology
- Experimental design
- RNA-seq analysis workflow
- Gene expression analysis
- Functional interpretation analysis
- Alternative splicing
Introduction to the theory and practice of RNA sequencing (RNA-seq) analysis
And many more, http://journals.plos.org/ploscompbiol/article/file?type=supplementary&id=info:doi/10.1371/journal.pcbi.1004393.s003
Commercially available
Experimental
Agilent 2100 bioanalyzer. RIN - RNA integrity number (should be >7)
Unstranded: Random hexamer priming to reverse-transcribe mRNA
Stranded: dUTP method - incorporating UTP nucleotides during the second cDNA synthesis, followed by digestion of the strand containing dUTP
Replication. It allows the experimenter to obtain an estimate of the experimental error
Randomization. It requires the experimenter to use a random choice of every factor that is not of interest but might influence the outcome of the experiment. Such factors are called nuisance factors
Blocking. Creating homogeneous blocks of data in which a nuisance factor is kept constant while the factor of interest is allowed to vary. Used to increase the accuracy with which the influence of the various factors is assessed in a given experiment
Block what you can, randomize what you cannot
In RNA-seq, we have multiple levels of randomness:
Auer, P.,RW Doerge. "Statistical Design and Analysis of RNA Sequencing Data." Genetics, 2010 http://www.genetics.org/content/185/2/405.long
Statistical power to detect differential expression varies with effect size, sequencing depth and number of replicates
Replicates per group | 3 | 5 | 10 |
---|---|---|---|
Effect size (fold change) | |||
1.25 | 17 % | 25 % | 44 % |
1.5 | 43 % | 64 % | 91 % |
2 | 87 % | 98 % | 100 % |
Statistical power to detect differential expression varies with effect size, sequencing depth and number of replicates
Replicates per group | 3 | 5 | 10 |
---|---|---|---|
Sequencing depth (millions of reads) | |||
3 | 19 % | 29 % | 52 % |
10 | 33 % | 51 % | 80 % |
15 | 38 % | 57 % | 85 % |
Scotty - Power Analysis for RNA Seq Experiments, http://scotty.genetics.utah.edu/
powerSampleSizeCalculator - R scripts for power analysis and sample size estimation for RNA-Seq differential expression, http://www2.hawaii.edu/~lgarmire/RNASeqPowerCalculator.htm
RnaSeqSampleSize - R package and a Shiny app for RNA sequencing data sample size estimation, https://cqs.mc.vanderbilt.edu/shiny/RNAseqPS/
Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. "Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression". Bioinformatics 2013 https://www.ncbi.nlm.nih.gov/pubmed/23314327
Travers C. et.al. "Power analysis and sample size estimation for RNA-Seq differential expression" RNA 2014 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201821/
Guo et.al. "RNAseqPS: A Web Tool for Estimating Sample Size and Power for RNAseq Experiment" Cancer Informatics 2014 http://insights.sagepub.com/rnaseqps-a-web-tool-for-estimating-sample-size-and-power-for-rnaseq-ex-article-a4433
Svensson, V. et.al. "Power Analysis of Single-Cell RNA-Sequencing Experiments." Nature Methods 2017 http://www.nature.com/nmeth/journal/v14/n4/pdf/nmeth.4220.pdf
FASTA: text-based representation of nucleotide sequence. http://zhanglab.ccmb.med.umich.edu/FASTA/
FASTQ: sequence and quality info
Generally, do not remove duplicates from RNA-seq data
MarkDuplicates
command, https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicatesAlign to the reference genome is the most common for transcript quantification
subjunc
for junction detection, and featureCounts
for extracting read counts per gene from aligned SAM/BAM files, http://subread.sourceforge.net/Timeline and extensive comparison of aligners: https://www.ebi.ac.uk/~nf/hts_mappers/
– SAM stands for Sequence Alignment/Map format. The SAM format consists of two sections:
Header section
Alignment section
SAM format specification https://samtools.github.io/hts-specs/SAMv1.pdf
Other BAM compression strategies are a subject of research. See ‘CRAM’ format for example, http://www.internationalgenome.org/faq/what-are-cram-files/
@HD The header line VN: format version SO: Sorting order of alignments @SQ Reference sequence dictionary SN: reference sequence name LN: reference sequence length SP: species @RG Read group ID: read group identifier CN: name of sequencing center SM: sample name @PG Program PN: program name VN: program version
SAM/BAM files
samtools
, view, sort, index, QC, stats on SAM/BAM files, and more, https://github.com/samtools/samtoolssambamba
, view, sort, index, merge, stats, mark duplicates. fast laternative to samtools
, https://lomereiter.github.io/sambamba/index.htmlpicard
, QC, validation, duplicates removal and many more utility tools, https://broadinstitute.github.io/picard/BED files
bedtools
, universal tools for manipulating genomic regions, https://bedtools.readthedocs.io/en/latest/bedops
, complementary to bedtools
, providing additional functionality and speedup, https://bedtools.readthedocs.io/en/latest/Integrative Genomics Viewer (IGV), http://software.broadinstitute.org/software/igv/
Features
Automation of specific tasks using command-line interface
Tutorial: https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial
HTSeq (htseq-count), http://www-huber.embl.de/HTSeq/doc/count.html
htseq-count --mode intersec=on-strict --stranded no --minaqual 1 --type exon --ida_r transcript_id accepted_hits.sam chr22.gff > transcript_read_counts_table.tsv
htseq-count
: http://seqanswers.com/forums/showthread.php?t=18068featureCounts, http://bioinf.wehi.edu.au/featureCounts/
featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt library1.bam library2.bam library3.bam
\[RPKM\ (or\ FPKM)=(10^9*C)/(N*L)\]
\(L\) - number of base pairs in the gene/transcript/exon/etc.
RSEM: RNA-Seq by Expectation-Maximization, https://www.ncbi.nlm.nih.gov/pubmed/21816040
https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/
FPKM is calculated as
TPM reverses the order - length first, library size second
Batch effects are widespread in high-throughput biology. They are artifacts not related to the biological variation of scientific interests.
For instance, two experiments on the same technical replicates processed on two different days might present different results due to factors such as room temperature or the two technicians who did the two experiments.
Batch effects can substantially confound the downstream analysis, especially meta-analysis across studies.
ComBat - Location-scale method
The core idea of ComBat was that the observed measurement \(Y_{ijg}\) for the expression value of gene \(g\) for sample \(j\) from batch \(i\) can be expressed as
\[Y_{ijg}=\alpha_g+X\beta_g+\gamma_{ig}+\delta_{ig}\epsilon_{ijg}\]
where \(X\) consists of covariates of scientific interests, while \(\gamma_{ig}\) and \(\delta_{ig}\) characterize the additive and multiplicative batch effects of batch \(i\) for gene \(g\).
After obtaining the estimators from the above linear regression, the raw data \(Y_{ijg}\) can be adjusted to \(Y_{ijg}^*\):
\[Y_{ijg}^*=\frac{Y_{ijg}-\hat{\alpha_g}-X\hat{\beta_g}-\hat{\gamma_{ig}}}{\hat{\delta_{ig}}}+\hat{\alpha_g}+X\hat{\beta_g}\]
For real application, an empirical Bayes method was applied for parameter estimation.
When batches were unknown, the surrogate variable analysis (SVA) was developed.
The main idea was to separate the effects caused by covariates of our primary interests from the artifacts not modeled.
Now the raw expression value \(Y_{jg}\) of gene \(g\) in sample \(j\) can be formulated as:
\[Y_{jg}=\alpha_g+X\beta_g+\sum_{k=1}^K{\lambda_{kg}\eta_{kj}}+\epsilon_{jg}\]
where \(\eta_{kj}\)s represent the unmodeled factors and are called as “surrogate variables”.
Once again, the basic idea was to estimate \(\eta_{kj}\)s and adjust them accordingly.
An iterative algorithm based on singular value decomposition (SVD) was derived to iterate between estimating the main effects \(\hat{\alpha_g}+X\hat{\beta_g}\) given the estimation of surrogate variables and estimating surrogate variables from the residuals \(r_{jg}=Y_{jg}-\hat{\alpha_g}-X\hat{\beta_g}\)
sva
package in BioconductorComBat
function for removing effects of known batches.edata
: a matrix for raw expression valuesbatch
: a vector named for batch numbers.modcombat = model.matrix(~1, data=as.factor(batch)) combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plot=FALSE)
https://bioconductor.org/packages/release/bioc/html/sva.html
For sequencing data, svaseq
, the generalized version of SVA, suggested applying a moderated log transformation to the count data or fragments per kilobase of exon per million fragments mapped (FPKM) first to account for the nature of discrete distributions
Instead of a direct transformation on the raw counts or FPKM, remove unwanted variation (RUV) adopted a generalized linear model for \(Y_{jg}\)
A Bioconductor package with a GUI (shiny app).
DESeq2
- https://bioconductor.org/packages/release/bioc/html/DESeq.html, https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0550-8edgeR
- https://bioconductor.org/packages/release/bioc/html/edgeR.html, https://academic.oup.com/bioinformatics/article-lookup/doi/10.1093/bioinformatics/btp616
limma
- Linear Models for Microarray Data, https://bioconductor.org/packages/release/bioc/html/limma.htmlvoom
- variance modeling at the observational level transformation. Uses the variance of genes to create weights for use in linear models. https://genomebiology.biomedcentral.com/articles/10.1186/gb-2014-15-2-r29
voom
transformation, the RNA-seq data can be analyzed using limma
.The expected value of \(k\) would be \(k_e=(n/m)*j\).
If \(k > k_e\), functional category is said to be enriched, with a ratio of enrichment \(r=k/k_e\)
Diff. exp. genes | Not Diff. exp. genes | Total | |
---|---|---|---|
In gene set | k | j-k | j |
Not in gene set | n-k | m-n-j+k | m-j |
Total | n | m-n | m |
What is the probability of having \(k\) or more genes from the category in the selected \(n\) genes?
\[P = \sum_{i=k}^n{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]
\(k < (n/m)*j\) - underrepresentation. Probability of \(k\) or less genes from the category in the selected \(n\) genes?
\[P = \sum_{i=0}^k{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]
stats::fisher.test()
GOstats::hyperGTest()
Example: https://github.com/mdozmorov/MDmisc/blob/master/R/gene_enrichment.R
The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed.
Functional categories in which many genes exhibit small changes may go undetected.
Genes are not independent, so a key assumption of the Fisher’s exact tests is violated.
Gene set analysis (GSA). Mootha et al., 2003; modified by Subramanian, et al. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles." PNAS 2005 http://www.pnas.org/content/102/43/15545.abstract
Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells
Aims to identify gene sets with "subtle but coordinated" expression changes that would be missed by DEGs threshold selection
The null hypothesis is that the rank ordering of the genes in a given comparison is random with regard to the case-control assignment.
The alternative hypothesis is that the rank ordering of genes sharing functional/pathway membership is associated with the case-control assignment.
Linear model-based
Linear model-based
List of other genome analysis plaforms - https://docs.google.com/spreadsheets/d/1o8iYwYUy0V7IECmu21Und3XALwQihioj23WGv-w0itk/pubhtml
Alamancos, G. et.al. "Methods to Study Splicing from High-Throughput RNA Sequencing Data." Spliceosomal Pre-mRNA Splicing: Methods and Protocols, 2014 https://www.ncbi.nlm.nih.gov/pubmed/24549677
Best approach to predict novel and alternative splicing events from RNA-seq data
Alternative splicing detection
Identifying genes that express different isoforms in cancer vs normal RNA-seq data
Visualization of alternative splicing events using RNA-seq data
Griffith, M. et.al. "Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud." PLoS Computational Biology 2015 - RNA-seq overview and extensive supplementary material http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393. The complete practical RNA-seq tutorial https://github.com/griffithlab/rnaseq_tutorial
Conesa, A. et.al. "A Survey of Best Practices for RNA-Seq Data Analysis."" Genome Biology 2016. http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
Law, C. et.al. "RNA-Seq Analysis Is Easy as 1-2-3 with Limma, Glimma and edgeR." F1000Research 2016. - Latest Rsubread-limma plus pipeline https://f1000research.com/articles/5-1408/v2. The complete R code for RNA-seq analysis tutorial https://www.bioconductor.org/help/workflows/RNAseq123/
Questions?
This presentation on GitHub:
Mikhail Dozmorov, Ph.D.
Assistant professor, Department of Biostatistics, VCU