Objectives

Introduction to the theory and practice of RNA sequencing (RNA-seq) analysis

  • Goals for RNA sequencing
  • RNA-seq technology
  • Experimental design
  • RNA-seq analysis workflow
  • Gene expression analysis
  • Functional interpretation analysis
  • Alternative splicing

Goals for RNA sequencing

Gene expression

Why RNA-sequencing?

Advantage of RNA-Seq over Microarray

  • Much richer information beyond quantitation
    • Boundary of gene transcripts: both 5' and 3' end, to nucleotide level
    • Alternative exon usage, novel splicing junction detection
    • SNP/indel discovery in transcripts: both coding and UTRs
    • Allele specific expression: critical in imprinting, cancer

 

  • Not relying on gene annotation by mapping to the whole genome
    • No longer biased by probe design
    • Novel gene and exon discovery enabled

Advantage of RNA-Seq over Microarray

  • Better performance at quantitation
    • Unlimited dynamic range: by increasing depth as needed
    • Higher specificity and accuracy: digital counts of transcript copies, very low background noise
    • Higher sensitivity: more transcripts and more differential genes detected

 

  • Re-analysis easily done by computation, as gene annotation keeps evolving
  • De novo assembly possible, not relying on reference genome sequence
  • Comparable cost, continuing to drop

RNA-seq technology

Sequencing technologies

Commercially available

  • Illumina/Solexa - short reads, sequencing-by-synthesis
  • Life Technologies Ion Torrent/Proton - short reads, Ion Semiconductor sequencing
  • Pacific Biosciences - long reads, Single Molecule Real Time sequencing

Experimental

  • Nanopore sequencing - continuous sequencing (very long reads), fluctuations of the ionic current from nucleotides passing through the nanopore

Sequencing technologies

  • Specifications depend on the library preparation kits, Single or Dual flow cells, High-Output or Rapid-Run modes

Sequencing technologies

Illumina sequencing workflow

Overview of RNA sequencing technology

RNA-Seq Limitations

  • Quantitation influenced by many confounding factors
    • "Sequenceability" - varying across genomic regions, local GC content and structure related
    • Varying length of gene transcripts and exons
    • Bias in read ends due to reverse transcription, subtle but consistent
    • Varying extent of PCR amplification artifacts
    • Effect of RNA degradation in the real world
    • Computational bias in aligning reads to genome due to aligners

RNA-Seq Limitations

  • SNP discovery in RNA-seq is more challenging than in DNA
    • Varying levels of coverage depth
    • False discovery around splicing junctions due to incorrect mapping
  • De novo assembly of transcripts without genome sequence: computationally intensive but possible, technical improvements will help
    • longer read length
    • lower error rate
    • more uniform nucleotide coverage of transcripts - more equalized transcript abundance

Library preparation

  • RNA isolation
    • \(0.1-1\mu g\) original total RNA
  • Ribosomal RNA (rRNA) depletion
    • rRNAs constitute over 90 % of total RNA in the cell, leaving the 1–2 % comprising messenger RNA (mRNA) that we are normally interested in.
    • Enriches for mRNA + long noncoding RNA.
    • Hybridization to bead-bound rRNA probes

Library preparation

RNA quality

Agilent 2100 bioanalyzer. RIN - RNA integrity number (should be >7)

Library preparation steps

  • Fragmentation, to recover short reads across full length of long genes
  • Size selection, suitable for RNA sequencing. 300-500bp - mRNA, 20-150bp - small/miRNA
  • Amplification, typically by PCR. Up to \(0.5-10ng\) of RNA
  • Library normalization/Exome capture

Unstranded vs. Strand-specific library

Library preparation steps

Sequencing length/depth

  • Longer reads improve mappability and transcript quantification
  • More transcripts will be detected and their quantification will be more precise as the sample is sequenced to a deeper level
  • Up to 100 million reads is needed to precisely quantify low expressed transcripts. In reality, 20-30 million reads is OK for human genome.

Overview of RNA sequencing technology

Experimental design

Experimental design

  • Replication. It allows the experimenter to obtain an estimate of the experimental error

  • Randomization. It requires the experimenter to use a random choice of every factor that is not of interest but might influence the outcome of the experiment. Such factors are called nuisance factors

  • Blocking. Creating homogeneous blocks of data in which a nuisance factor is kept constant while the factor of interest is allowed to vary. Used to increase the accuracy with which the influence of the various factors is assessed in a given experiment

  • Block what you can, randomize what you cannot

Experimental design

In RNA-seq, we have multiple levels of randomness:

  • Biological variability in samples
  • Stochasticity of RNA content
  • Randomness of fragments being sequenced
  • Technical variability

 

Auer, P.,RW Doerge. "Statistical Design and Analysis of RNA Sequencing Data." Genetics, 2010 http://www.genetics.org/content/185/2/405.long

Experimental design: Multiplexing balances technical variability

Number of replicates

Number of replicates

Power calculations

Busby MA, Stewart C, Miller CA, Grzeda KR, Marth GT. "Scotty: a web tool for designing RNA-Seq experiments to measure differential gene expression". Bioinformatics 2013 https://www.ncbi.nlm.nih.gov/pubmed/23314327

Travers C. et.al. "Power analysis and sample size estimation for RNA-Seq differential expression" RNA 2014 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4201821/

Guo et.al. "RNAseqPS: A Web Tool for Estimating Sample Size and Power for RNAseq Experiment" Cancer Informatics 2014 http://insights.sagepub.com/rnaseqps-a-web-tool-for-estimating-sample-size-and-power-for-rnaseq-ex-article-a4433

Svensson, V. et.al. "Power Analysis of Single-Cell RNA-Sequencing Experiments." Nature Methods 2017 http://www.nature.com/nmeth/journal/v14/n4/pdf/nmeth.4220.pdf

RNA-seq analysis workflow

FASTA/FASTQ format

Quality control

Quality of base calling

  • Phred quality score is widely used to characterize the quality of base calling
  • Phred quality score = \(-10*log_{10}(P)\), where P is probability that base-calling is wrong
  • Phred score of 30 means there is 1/1000 chance that the base-calling is wrong
  • The quality of the bases tend to drop at the end of the read, a pattern observed in sequencing-by-synthesis techniques

Adapter trimming

Alignment

  • RNA-seq aligners face an additional problem, not encountered in DNA-only alignment: many RNA-seq reads will span introns
  • The average human intron length is >6,000 bp (some are >1 Mbp in length)
  • In a typical human RNA-seq experiment using 100-bp reads, >35% of the reads will span multiple exons - align over splice junctions
  • Aligners must be splice-aware, especially when aligning longer (>50bp) reads

Duplicates removal

  • Duplicates may correspond to biased PCR amplification of particular fragments
  • For highly expressed, short genes, duplicates are expected even if there is no amplification bias
  • Removing them may reduce the dynamic range of expression estimates

Generally, do not remove duplicates from RNA-seq data

Alignment strategies

Align to the reference genome is the most common for transcript quantification

Alignment - Mapping RNA-seq reads to the genome

  • BWA: general purpose algorithms tuned for different tasks, http://bio-bwa.sourceforge.net/
  • STAR: fast and accurate aligner, https://github.com/alexdobin/STAR
  • HISAT: (hierarchical indexing for spliced alignment of transcripts) uses two types of indexes for alignment: a global, whole-genome index and tens of thousands of small local indexes. Can detect novel splice sites, transcription initiation and termination sites. A part of the new "Tuxedo suite", including StringTie and Ballgown, http://ccb.jhu.edu/software/hisat2/index.shtml.
  • subread: a fast and accurate aligner, R and command line. The whole package includes subjunc for junction detection, and featureCounts for extracting read counts per gene from aligned SAM/BAM files, http://subread.sourceforge.net/

Timeline and extensive comparison of aligners: https://www.ebi.ac.uk/~nf/hts_mappers/

SAM format of aligned data

– SAM stands for Sequence Alignment/Map format. The SAM format consists of two sections:

Header section

  • Used to describe source of data, reference sequence, method of alignment, etc.

Alignment section

  • Used to describe the read, quality of the read, and nature alignment of the read to a region of the genome

SAM format specification https://samtools.github.io/hts-specs/SAMv1.pdf

BAM file format of aligned data

  • BAM is the binary version of a SAM file. Smaller, but not easily readable.
  • Compressed using lossless BGZF format
  • Other BAM compression strategies are a subject of research. See ‘CRAM’ format for example, http://www.internationalgenome.org/faq/what-are-cram-files/

  • BAM files are usually indexed. An index is stored alongside the BAM file with a ".bai" extension
  • Indexing aims to achieve fast retrieval of alignments overlapping a specified region without going through the whole alignments.
  • BAM must be sorted before indexing. Depending on the downstream tools, sort by
    • Name
    • Coordinate

SAM/BAM header section

  • Used to describe source of data, reference sequence, method of alignment, etc.
  • Each section begins with character ‘@’ followed by a two-letter record type code. These are followed by two-letter tags and values
@HD The header line
VN: format version
SO: Sorting order of alignments
@SQ Reference sequence dictionary
SN: reference sequence name
LN: reference sequence length
SP: species
@RG Read group
ID: read group identifier
CN: name of sequencing center
SM: sample name
@PG Program
PN: program name
VN: program version

SAM/BAM alignment section

Using SAM flags to filter subsets of reads

  • 12 bitwise flags describing the alignment
  • These flags are stored as a binary string of length 11
  • Value of ‘1’ indicates the flag is set. e.g. 00100000000
  • All combinations can be represented as a number from 1 to 2048 (i.e. \(2^{11}-1\)). This number is used in the BAM/SAM file. You can specify "required" or "filter" flags in samtools view using the '-f' and '-F' options, respectively
  • https://broadinstitute.github.io/picard/explain-flags.html

Source: https://samtools.github.io/hts-specs/SAMv1.pdf

CIGAR string

  • The CIGAR string is a sequence of base lengths and associated "operations" that are used to indicate which bases align to the reference (either a match or mismatch), are deleted, are inserted, represent introns, etc.
    • e.g. 81M859N19M
    • Read as: A 100 bp read consists of: 81 bases of alignment to reference, 859 bases skipped (an intron), 19 bases of alignment

Source: https://samtools.github.io/hts-specs/SAMv1.pdf

Browser Extensible Data (BED) format

  • When working with BAM files, it is very common to want to examine reads aligned to a focused subset of the reference genome, e.g. the exons of a gene
  • Focus on location - genomic coordinates
  • Basic BED format (plain text, tab separated):
  • Chromosome name, start position, end position
  • Coordinates in BED format are 0 based

https://genome.ucsc.edu/FAQ/FAQformat#format1

GFF/GTF file format

Tools to work with SAM/BAM/BED files

SAM/BAM files

BED files

Visualization

IGV

Features

  • Explore large genomic datasets with an intuitive, easy-to-use interface.
  • Integrate multiple data types with clinical and other sample information.
  • View data from multiple sources:
    • local, remote, and "cloud-based".
    • Intelligent remote file handling - no need to download the whole dataset
  • Automation of specific tasks using command-line interface

  • Tutorial: https://github.com/griffithlab/rnaseq_tutorial/wiki/IGV-Tutorial

Gene expression analysis

Expression estimation for known genes and transcripts

HTSeq (htseq-count), http://www-huber.embl.de/HTSeq/doc/count.html

featureCounts, http://bioinf.wehi.edu.au/featureCounts/

  • Summarize multiple datasets at the same time:
  • featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt library1.bam library2.bam library3.bam

Expression estimation for known genes and transcripts

  • Counts of reads: The relative expression of a transcript is proportional to the number of cDNA fragmets that originate from it ~ number of aligned reads. Disadvantages: longer gene produce more reads, library depth (total counts) influence counts of individual transcripts
  • Counts per million: counts scaled by the library depth in million units. \(CPM=C * 10^6 / N\)
  • RPKM: Reads Per Kilobase of transcript per Million mapped reads.
  • FPKM: Fragments Per Kilobase of transcript per Million mapped reads.

Expression estimation for known genes and transcripts

TPM: Transcript per Kilobase Million

Scripts for RNA-seq data analysis

Batch effects

  • Batch effects are widespread in high-throughput biology. They are artifacts not related to the biological variation of scientific interests.

  • For instance, two experiments on the same technical replicates processed on two different days might present different results due to factors such as room temperature or the two technicians who did the two experiments.

  • Batch effects can substantially confound the downstream analysis, especially meta-analysis across studies.

Batch sources

ComBat

ComBat - Location-scale method

The core idea of ComBat was that the observed measurement \(Y_{ijg}\) for the expression value of gene \(g\) for sample \(j\) from batch \(i\) can be expressed as

\[Y_{ijg}=\alpha_g+X\beta_g+\gamma_{ig}+\delta_{ig}\epsilon_{ijg}\]

where \(X\) consists of covariates of scientific interests, while \(\gamma_{ig}\) and \(\delta_{ig}\) characterize the additive and multiplicative batch effects of batch \(i\) for gene \(g\).

https://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html

ComBat

After obtaining the estimators from the above linear regression, the raw data \(Y_{ijg}\) can be adjusted to \(Y_{ijg}^*\):

\[Y_{ijg}^*=\frac{Y_{ijg}-\hat{\alpha_g}-X\hat{\beta_g}-\hat{\gamma_{ig}}}{\hat{\delta_{ig}}}+\hat{\alpha_g}+X\hat{\beta_g}\]

For real application, an empirical Bayes method was applied for parameter estimation.

https://www.bu.edu/jlab/wp-assets/ComBat/Abstract.html

SVA

When batches were unknown, the surrogate variable analysis (SVA) was developed.

The main idea was to separate the effects caused by covariates of our primary interests from the artifacts not modeled.

Now the raw expression value \(Y_{jg}\) of gene \(g\) in sample \(j\) can be formulated as:

\[Y_{jg}=\alpha_g+X\beta_g+\sum_{k=1}^K{\lambda_{kg}\eta_{kj}}+\epsilon_{jg}\]

where \(\eta_{kj}\)s represent the unmodeled factors and are called as “surrogate variables”.

SVA

Once again, the basic idea was to estimate \(\eta_{kj}\)s and adjust them accordingly.

An iterative algorithm based on singular value decomposition (SVD) was derived to iterate between estimating the main effects \(\hat{\alpha_g}+X\hat{\beta_g}\) given the estimation of surrogate variables and estimating surrogate variables from the residuals \(r_{jg}=Y_{jg}-\hat{\alpha_g}-X\hat{\beta_g}\)

sva package in Bioconductor

  • Contains ComBat function for removing effects of known batches.
  • Assume we have:
    • edata: a matrix for raw expression values
    • batch: a vector named for batch numbers.
modcombat = model.matrix(~1, data=as.factor(batch)) 

combat_edata = ComBat(dat=edata, batch=batch, mod=modcombat, par.prior=TRUE, prior.plot=FALSE)

https://bioconductor.org/packages/release/bioc/html/sva.html

SVASEQ

For sequencing data, svaseq, the generalized version of SVA, suggested applying a moderated log transformation to the count data or fragments per kilobase of exon per million fragments mapped (FPKM) first to account for the nature of discrete distributions

Instead of a direct transformation on the raw counts or FPKM, remove unwanted variation (RUV) adopted a generalized linear model for \(Y_{jg}\)

BatchQC - Batch Effects Quality Control

Differential expression analysis

  • Many tools for differential expression analysis design their statistics around raw read counts
  • Poisson distribution (single-parameter model, mean = variance)?
  • Genes with larger average expression (counts) have on average larger variance across samples.
  • Negative Binomial is a good approximation, https://www.ncbi.nlm.nih.gov/pubmed/23975260
  • Variability is modeled by the dispersion parameter

Differential expression analysis

Differential expression analysis

 

Functional interpretation analysis

Interpretation

Enrichment analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

Enrichment analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

The expected value of \(k\) would be \(k_e=(n/m)*j\).

If \(k > k_e\), functional category is said to be enriched, with a ratio of enrichment \(r=k/k_e\)

Enrichment analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category
Diff. exp. genes Not Diff. exp. genes Total
In gene set k j-k j
Not in gene set n-k m-n-j+k m-j
Total n m-n m

Enrichment analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

What is the probability of having \(k\) or more genes from the category in the selected \(n\) genes?

\[P = \sum_{i=k}^n{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]

Enrichment analysis, Hypergeometric test

  • \(m\) is the total number of genes
  • \(j\) is the number of genes are in the functional category
  • \(n\) is the number of differentially expressed genes
  • \(k\) is the number of differentially expressed genes in the category

\(k < (n/m)*j\) - underrepresentation. Probability of \(k\) or less genes from the category in the selected \(n\) genes?

\[P = \sum_{i=0}^k{\frac{\binom{m-j}{n-i}\binom{j}{i}}{{m \choose n}}}\]

Enrichment analysis, Hypergeometric test

Problems with Fisher's exact test

  • The outcome of the overrepresentation test depends on the significance threshold used to declare genes differentially expressed.

  • Functional categories in which many genes exhibit small changes may go undetected.

  • Genes are not independent, so a key assumption of the Fisher’s exact tests is violated.

Many GO enrichment tools

Enrichment analysis, Functional Class Scoring (FCS)

  • Gene set analysis (GSA). Mootha et al., 2003; modified by Subramanian, et al. "Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles." PNAS 2005 http://www.pnas.org/content/102/43/15545.abstract

  • Main rationale – functionally related genes often display a coordinated expression to accomplish their roles in the cells

  • Aims to identify gene sets with "subtle but coordinated" expression changes that would be missed by DEGs threshold selection

GSEA: Gene set enrichment analysis

  • The null hypothesis is that the rank ordering of the genes in a given comparison is random with regard to the case-control assignment.

  • The alternative hypothesis is that the rank ordering of genes sharing functional/pathway membership is associated with the case-control assignment.

GSEA: Gene set enrichment analysis

  1. Sort genes by log fold change
  2. Calculate running sum - increment when gene in a set, decrement when not
  3. Maximum of the runnig sum is the enrichment score - larger means genes in a set are toward top of the sorted list
  4. Permute subject labels to calculate significance p-value

Other approaches

Linear model-based

  • CAMERA (Wu and Smyth 2012)
  • Correlation-Adjusted MEan RAnk gene set test
  • Estimating the variance inflation factor associated with inter-gene correlation, and incorporating this into parametric or rank-based test procedures

Other approaches

Linear model-based

  • ROAST (Wu et.al. 2010)
  • Under the null hypothesis (and assuming a linear model) the residuals are independent and identically distributed \(N(0,\sigma_g^2)\).
  • We can rotate the residual vector for each gene in a gene set, such that gene-gene expression correlations are preserved.

Genome analysis platform: Galaxy

Alternative splicing

Alternative splicing

Alternative splicing

References

References

Thank you