This tutorial outlines the basic steps involved in the process of analyzing gene expression data using RNA sequencing of messanger RNA (mRNA). This is a simply 1:1 analyses between a control and an experimental group (i.e. disease vs no-disease/healthy) with a total of three replicates per condition. This pipeline followes the general recommendations of the H3ABioNet SOP for RNAseq analyses

H3ABioNet

1. Introduction

RNA Seq (RNA sequencing), also called whole transcriptome shotgun sequencing (WTSS), uses next-generation sequencing (NGS) to reveal the presence and quantity of RNA in a biological sample at a given moment. The basic steps involved in the sequencing of mRNA is outlined in Figure 1 below.

The frist step is to select/isolate cells or tissue populations from which RNA is to be extracted. This in itself can be no trivial task and great care needs to be given to experimental design to reduce biases and batch effects downstream. Next all RNA (also called total RNA sometimes) is extracted from the cells or tissue sample. To mitigate batch effects it’s important that all laboratory procedures, for both the control and the experimental group(s), are done on the same day, in the same lab and performed by the same person. The next major step would be to isolate the spesific set of RNAs. If you are intrested in the messengar RNA, which would be the case in most studies, you would want to remove ribosomial RNA (rRNA) and novel classes of RNA (ncRNA) such as micro RNA (miRNA). The next step is to make DNA constructs of the RNA sequences through cDNA synthesis. cDNA synthesis is followed by library preperation and PCR amplification and NGS sequencing which will produce either single-end or paired-end reads.

Figure 1: A basic representation of the major steps and considerations in the sequencing of RNA. Adopted from Kukurba et al., 2015

Please note that it is possible to use long read MinIon sequences as well. However, for this tutorial we will focus on Illumina short read paired-end data.

The basic steps in the bioinformatic workflow for RNA analyses is outlined in Figure 2. The first step, though not shown in the figure, is to quility check the sequence reads, remove short reads, low quility reads and residual adaptor sequences. The next step is to align sequence reads into transcripts. This can be done either through reference based mapping or de novo assembly. As we will mostly work with RNA sequencing from humans or well characterized organisims (e.g. mouse, yeast, fruit flies) we will focus on reference based approaches here. Most mappers today require a reference genome (in FASTA format) and a gene annotation format file (GTF/GFF) for said genome.

Next we can quantify the abundance of reads that was mapped to each gene. This is often refere to as count generation which is nothing more than a big table containing the names of each gene or transcript and how many reads were mapped/assigned to each gene. Next would be to compare read counts between different samples and to identify possible outliers or batch effects. Batch effects are usually caused by obtaining or processing the samples in batches and can obscure detection of expression differences if not adjusted for statistically. The next step would be to remove low count genes and to normalize. The final step would be to perform the statistical analyses for differential gene expression.

Figure 2: A basic representation of the major steps involved in the processing of RNA Sequence data. Adopted from Kukurba et al., 2015

Please note that in this excersise we will focus on differential gene expression. RNA sequencing can be utalized to answer many other questions. For example, it can be used to assess the expression of a particular allele, identify new transcripts/genes, or splice variance.

2. Software

The processing of raw sequence reads, read mapping and count generation was performed on the KRISP2 server. Downstream count quility checks and differential gene expression was performed in a local R instance. The following software is needed to run this pipeline:

parallel
fastqc
Trimmomatic
multiqc
HISat2
kallisto
subreads (R/Bioconductor)
HTSeq (python)
tximport (R/Bioconductor)
DeSeq or edgeR (R/Bioconductor)

3. Pipeline

The pipeline can be broken down into five main steps: (1) quility checking of raw sequence reads, (2) read mapping to the reference genome or reference transcript annotation file, (3) count quantification, (4) quility check of counts, and (5) statstical analyses for differential gene expression.

3.1 Quility Check

Software: fastqc

Input: raw fastqc files (e.g. .fastq or .fq). Can also accept .gz compressed files.

First, on the command line navigate to the directory where the raw sequence files are located. Next we going to make a directory called pre.fastqc where we will write the output that fastqc produces to. Then we will run fastqc on the raw sequence.

$ mkdir pre.fastqc

$ fastqc -o pre.fastqc -f fastq *.fastq

Output: html report for each sequence

Next we will remove the read with low quility scores per base, trim off residual adaptor sequences, and remove short reads. This is done in trimmomatic.

Software: Trimmomatic

Input: raw fastqc files (e.g. .fastq or .fq). Can also accept .gz compressed files.

With the command below we tell trimmomatic that its paired-end (PE) data. We spesify the forward and reverse read and write paired and unpaired reads to folder. We spesify what adpators to look for (in this case NexteraPE). We use a sliding window approach that will scan each read in a 4bp window and exclude the read if the phred score drops below 25. Also any read shorter than 20bp is excluded.

$ parallel 'java -jar /PATH.TO.TRIMMOMATIC/trimmomatic-0.36.jar PE \
{}1.fastq {}2.fastq {}_pair_F.fastq {}_unpair_F.fastq {}_pair_R.fastq {}_unpair_R.fastq \
ILLUMINACLIP:/PATH.TO.TRIMMOMATIC/adapters/NexteraPE-PE.fa:2:30:10 \ 
SLIDINGWINDOW:4:25 MINLEN:20' ::: $(ls *_*.fastq | rev | cut -c 8- | rev | uniq)

Output: paired and unpaired forward and reverse reads of “good” quility

Its always useful to rerun fastqc on the paired sequence reads. To do this I first make a new directory called unpaired.reads and then move all unpaired reads that was produced by trimmomatic to this new directory. This cleans up the working directory a bit. Next make a new directory called post.fastqc to which fastqc will write the output of its analysis of the paired reads. Next we can actually run fastqc on the paired reads.

$ mkdir unpaired.reads

$ mv *_unpair_* unpaired.reads

$ mkdir post.fastqc

$ fastqc -o post.fastqc -f fastq *.fastq

3.2 Read Mapping

Now we have good quality reads which we can map/align to the reference. You have two options here. You can directly map the reads to a reference or you can perform pseudoalignments. The first option, and the one we will be using in this excersise, requires a reference genome in FASTA format and a gene annotation file or GTF/GFF. Please note that the same versions should be used. If you use hg38 as the reference then the annotation file should also be from hg38. For hg38 these files can be downloaded here:

Software: HISat2

First we need to index the reference with HISat2. Assuming HISat2 and all its programs are in path you can run.

$ hisat2-build hg38.fa hg38_index

Output: This should produce eight indexes. For example, hg38_index.1.ht2

HISat2 can import a pre-processed set of known splice juntions to aid the mapping of your data. It requires its own “special” format for these files, rather than the standard GTF file format. This can be done with a python script that comes with HISat2.

$ hisat2_extract_splice_sites.py hg38.gtf > hg38_splice_sites.txt

Now we are ready to run HISat2 on our reads.

Input: cleaned paired end reads in fastq format

hisat2 -x hg38_index --known-splicesites-infile hg38_splice_sites.txt \
-p 2 -1 <forward PE read> -2 <reverse PE read> | samtools \
view -bS - > <samplename.bam>

Output: A BAM file for each of the samples.

3.3 Count Quantification

Now that the reads have been mapped to the reference we need to quantify the number of reads mapped to each gene. For this we will need the genome annotation file (GTF). This will be done with the featureCounts program of Subreads in either base R, RStudio or Bioconductor. Open RStudio and a new project. First set the working directory to where the BAM files are located.

setwd("/PATH.TO.DIR.WITH.BAM.FILES")

Next make sure you have all the software and packages installed. Now we can run featureCounts:

/Applications/biotools/subread-1.6.3-source/bin/featureCounts \
-p -a gencode.v29.chr_patch_hapl_scaff.annotation.gtf \
-t exon -g gene_id -o sample42.txt sample42.hisat2.bam

This will create a .txt.summary file containing the counts. We are using the -g and -t option here for gene level and exon level respectifly. -p is used for paired-end data.

RNA Seq Analysis:

by Eduan Wilkinson

25 March, 2019