1 Background
2 RNA-Seq Analysis using RcwlPipeline
- 2.1 Example: RNA-Seq alignment

1 Background

1.1 DNA (Deoxyribo-nucleic acid)

DNA is a double-stranded, helical molecule composed of four different kind of nucleotides (Figure 1), each of which contains a phosphate group, a sugar molecule, and a nitrogenous base. Carbon atoms in the sugar ring of a nucleotide are numbered from 1’ to 5’; For a single-stranded DNA (ssDNA, and also for an RNA strand), one end that has a free hydroxyl (or phosphate) on a 5’ carbon is called a “5 prime” end, whereas the other end that has one on a 3’ carbon is called a “3 prime” end. DNA carries genetic information of the development and functioning of every organism.

Figure 1: Adenine nucleotide.

Because there are four naturally occurring nitrogenous bases, there are four different types of nucleotides: adenine (A), thymine (T), guanine (G), and cytosine (C). Within a double-stranded DNA, the nitrogenous bases on one strand pair with complementary bases along the other strand; in particular, A always pairs with T, and C always pairs with G. Then, during DNA replication, two strands in the double helix separate. This allows an enzyme called DNA polymerase to access each strand individually, creating a pair of replicated DNAs, each of which contains one “old” strand and one “new” strand of DNA.

1.2 RNA (Ribo-nucleic acid)

RNA is assembled as a chain of nucleotides like DNA, but RNA is found in nature as a single strand folded onto itself, rather than a paired double strand. Three of the four nitrogenous bases that make up RNA — adenine (A), cytosine (C), and guanine (G) — are also found in DNA. In RNA, however, a base called uracil (U) replaces thymine (T) as the complementary nucleotide to adenine. Furthermore, RNA contains ribose sugar molecules, which are slightly different than the deoxyribosemolecules found in DNA. As its name suggests, ribose has more oxygen atoms than deoxyribose.
RNA is the main carrier of genetic information which is responsible for organism’s phenotype. The process of copying a segment of DNA sequence into a protein-encoding messenger RNA (mRNA) is known as transcription, in which an enzyme called RNA polymerase reads a DNA sequence and produces a complementary, anti-parallel RNA strand called a primary transcript. The mRNA, in turn, serves as a template for the protein synthesis known as translation. There are so-called non-coding RNAs (“ncRNA”) that can be copied from their own genes (RNA genes) but can also derive from mRNA introns. The most prominent examples of non-coding RNAs are transfer RNA (tRNA) and ribosomal RNA (rRNA), both of which are involved in the process of translation.

For most eukaryotic genes (and some prokaryotic ones), the initial RNA that is transcribed from a gene’s DNA template is pre-processed before it becomes a mature messenger RNA (mRNA) that transcribes the synthesis of protein. One of the steps in this pre-processing, called RNA splicing, involves the removal or “splicing” of certain sequences referred to as intervening sequences, or introns. The final mRNA thus consists of the remaining sequences, called exons, which are connected to one another through the splicing process. Then, a sequence of adenine nucleotides called a poly-A tail is added to the 3’ end of the mRNA molecule, which signals to the cell that the mRNA molecule is ready to leave the nucleus and enter the cytoplasm.

Transcriptome is the set of all RNA transcripts, both coding and non-coding. The study of transcriptome is called transcriptomics, which bridges the gap between the genetic code stored in DNA and the gene expression into organisms. RNA-Seq and microarray are two techniques used in transcriptomics.

1.3 cDNA (complementary DNA)

All nucleated cells in an individual organism share the same genetic material; what differentiates these cells to function separate roles is the specific genes that are expressed in each cell to turn on/off. The genes involved in tissue-specific or developmental processes traditionally have been studied by making libraries of all expressed genes for the development of an organ. Complementary DNA (cDNA) libraries give a snapshot of actively expressed genes. By definition, cDNA is double-stranded DNA that can be copied from mRNA using an enzyme called reverse transcriptase. The poly-A tail in mRNA distinguishes it from other RNA transcripts and can therefore be used as a primer site for reverse transcription.

cDNA is a more convenient way to work with than mRNA when analyzing a coding sequence because RNA is very easily degraded by omnipresent ribo-nuclease (RNases). This is the main reason why cDNA is sequenced rather than mRNA. Likewise, investigators conducting DNA microarrays often convert the mRNA into cDNA in order to produce their probes.

To make a library of transcribed sequences, scientists isolate all the RNA from their cells of interest and use a single-stranded primer complementary to the unique poly-A tail. Because they are produced from transcribed mRNA found in the nucleus, cDNA libraries contain primarily the protein-encoding regions of the genome. Once a cDNA has been at least partially sequenced, unique polymerase chain reaction (PCR) primer pairs that identify short stretches of each cDNA can be designed. These regions, called expressed sequence tags or ESTs, can then be used to produce probes to determine the presence or absence of similar transcripts in other tissues. The identification of ESTs has proceeded rapidly, with approximately 52 million ESTs now available in public databases (e.g., GenBank). Moreover, current methods allow expressed RNAs to be made into cDNA or cRNA and sequenced en masse using pyrosequencing, which promises to accelerate the rate at which new EST data is added to these databases.

Exercise

Read an article: Mapping and quantifying mammalian transcriptomes by RNA-Seq by Ali Mortazavi et.al. [Nature Methods volume 5, 621–628 (2008)].

2 RNA-Seq Analysis using RcwlPipeline

RNA-Seq is most often used for analyzing differntial gene expression (DGE), that is, taking the normalised read count data and performing statistical analysis to discover quantitative changes in expression levels among different experimental groups, such as healthy and diseased cells. There are various R libraries for DGE, e.g., edgeR and DESeq2 which are based on negative binomial (NB) distributions, maSigPro which is based on regression analysis, or baySeq and EBSeq which are Bayesian approaches based on a negative binomial model. It is important to consider the experimental design when choosing analysis software. While some of the differential expression tools can only perform pair-wise comparison, others such as edgeR and DESeq can perform multiple comparisons. An example od DGE using edgeR is given here.

The workflow of RNA sequencing (RNA-Seq) begins with RNA extraction in a wet lab, followed by mRNA enrichment or ribosomal RNA depletion, cDNA synthesis and preparation of an adaptor-ligated sequencing library. The library is then sequenced to a read depth of 10–30 million reads per sample on a high-throughput platform (usually Illumina). The final steps are computational: aligning and/or assembling the sequencing reads to a transcriptome, quantifying reads that overlap transcripts, filtering and normalizing between samples, and statistical modelling of significant changes in the expression levels of individual genes and/or transcripts between sample groups.

2.1 Example: RNA-Seq alignment

Here is a flowchart for a typical RNA-Seq pipeline pl_rnaseq_Sf:

library(RcwlPipelines)

rnaseq_Sf <- cwlLoad("pl_rnaseq_Sf")

## fastqc loaded

## STAR loaded

## sortBam loaded

## samtools_index loaded

## samtools_flagstat loaded

## featureCounts loaded

## gtfToGenePred loaded

## genePredToBed loaded

## read_distribution loaded

## geneBody_coverage loaded

## gCoverage loaded

## STAR loaded

plotCWL(rnaseq_Sf)

This pipeline consists of 10 steps and involves the following external software, in addition to SAMtools; fastQC for quality control, STAR for sequence alignment, featureCounts for read-count summarization, and RSeQC for additional quality control.

outputs(rnaseq_Sf)

outputs:
out_fastqc:
  type: File[]
  outputSource: fastqc/QCfile
out_BAM:
  type: File
  outputSource: samtools_index/idx
out_Log:
  type: File
  outputSource: STAR/outLog
out_Count:
  type: File
  outputSource: STAR/outCount
out_stat:
  type: File
  outputSource: samtools_flagstat/flagstat
out_count:
  type: File
  outputSource: featureCounts/Count
out_distribution:
  type: File
  outputSource: r_distribution/distOut
out_gCovP:
  type: File
  outputSource: gCoverage/gCovPDF
out_gCovT:
  type: File
  outputSource: gCoverage/gCovTXT

RNA-Seq Data Analysis

Mitsuko Korobkin

2022-08-04