RNAseq workflow

0.1 RNAseq workflow via GUI

The mRNAseq workflow can be run using 4SeqGUI graphical interface (linux/MAC):

mRNAseq workflow

Sample quantification is made of these steps:

Creating a genome index for STAR (see end of this paragraph)
Running removing sequencing adapters
Mapping reads to the reference genome
Quantify gene and transcript expression level
Annotating genes.

All the parameters can be setup using 4SeqGUI

0.1.1 Creating a STAR index file for mRNAseq:

The index can be easily created using the graphical interface:

An index can be created for any of the genomes present in the ENSEMBL database

A detailed description of the parameters is given below.

0.1.1.1 Creating a STAR index file by line command

rsemstarIndex(group="docker",genome.folder="/data/scratch/hg38star",
ensembl.urlgenome="ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
ensembl.urlgtf="ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz")

In brief, rsemstarIndex uses ENSEMBL genomic data. User has to provide the URL (ensembl.urlgenome) for the file XXXXX_dna.toplevel.fa.gz related to the organism of interest, the URL (ensembl.urlgtf) for the annotation GTF XXX.gtf.gz and the path to the folder where the index will be generated (genome.folder). The parameter threads indicate the number of cores dedicated to this task.

Precompiled index folders are available:

0.1.2 Quantifying genes/isoforms:

Tutorial experiment downloadable here:

+ Three replicates for two experimental conditions, 

+ single-end mode sequencing, 

+ 1 million reads for each sample.

Experiment description:

+ 4T1 mouse cell line grown in standard DMEM medium (e) is compared with the same cells grown in low attachment medium (p)

The following data are available for download:

Fastq files for 3 samples grown in standard DMEM medium (e) and for 3 samples grown in low attachment medium (p) to be used to calculate samples counts.
- This data set allows running all the steps required to detect differentially expressed genes. The first step is the quantification of genes and isoforms via mapping reads to the reference genome via STAR and using this mapping information to quantify genes and transcripts using RSEM, i.e. this section.
- The RSEM counts table, generated for all samples, are combined in a unique table, see From samples to experiment section. This table is used for differential expression genes detection. For visualization purposes log2 FPKM and log2 TPM tables are also generated, see From samples to experiment section. More info on the FPKM and TPM are available here.
- Subsequently the overall characteristics of the dataset can be explore via PCA, Visualizing experiment data with PCA section.
- It is also possible to evaluate which is the statistical power of the experiment, i.e. identifying the fraction of genes/transcripts that can be identify giving the statistical structure of the experiment, or identify the optimal number of samples required to detect differentially expressed genes. More info in Evaluating sample size and experiment power section.
- Differential expression can be then evaluate using the DESeq2 module, Differential expression analysis with DESeq2.
Furthermore, also intermediate results are provided:
- The counts tables generated by STAR+RSEM analysis to be used to assemble counts, TPM and FPKM experiment tables
- The experiment tables that can be used to study samples organization via PCA or the statistical characteristics of the experiment, i.e. the statistical power and the optimal sample size of the experiment
- The above data set can be used also to detect differentially expressed genes and isoforms using the DESeq2 module.

Gene, Isoform counting

A detailed description of the parameters is given below.

0.1.2.1 Sample quantification by line command

The sample quantification can be also executed using R and it is completely embedded in a single function:

#Tutorial experiment: three replicates, sequenced in single-end mode,
# 1 million reads are available for each sample.
#Experiment description: 4T1 mouse cell line grown in standard DMEM medium (e) 
#is compared with the same cells grown in low attachment medium (p)
system("wget http://130.192.119.59/public/test.mrnaCounts_full.zip")
unzip("test.mrnaCounts_full.zip")
setwd("./raw_fastq)
#example of script that has to be run in each of the folders containing a fastq file
library(docker4seq)
rnaseqCounts(group="docker",fastq.folder=getwd(), scratch.folder=getwd(),
adapter5="AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT",
adapter3="AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT",
seq.type="se", threads=8,  min.length=40,
genome.folder="/data/scratch/mm10star", strandness="none", save.bam=FALSE,
org="mm10", annotation.type="gtfENSEMBL")

User needs to create the fastq.folder, where the fastq.gz file(s) for the sample under analysis are located. The scratch.folder is the location where temporary data are created. The results will be then saved in the fastq.folder.

User needs to provide also the sequence of the sequencing adapters, adapter5 and adapter3 parameters. In case Illumina platform the adapters sequences can be easily recovered here.

seq.type indicates if single-end (se) or pair-end (pe) data are provided, threads indicates the max number of cores used by skewer and STAR, all the other steps are done on a single core.

The min.length refers to the minimal length that a reads should have after adapters trimming. Since today the average read length for a RNAseq experiment is 50 or 75 nts would be better to bring to 40 nts the min.length parameter to increase the precision in assigning the correct position on the genome.

The genome.folder parameter refers to the location of the genomic index generated by STAR using the docker4seq function rsemstarIndex, see above paragraph.

strandness, is a parameter referring to the kit used for the library prep. If the kit does not provide strand information it is set to “none”, if provides strand information is set to “forward” for Illumina stranded kit and it set to “reverse” for Illumina ACCESS kit. save.bam set to TRUE indicates that genomic bam file and transcriotomic bam files are also saved at the end of the analysis. annotation.type refers to the type of available gene-level annotation. At the present time is only available ENSEMBL annotation defined by the gtf downloaded during the creation of the indexed genome files, see paragraph at the endCreating a STAR index file for mRNAseq*.

0.1.3 Sample quantification output files

The mRNAseq workflow produces the following output files:

+ XXXXX-trimmed.log, containing the information related to the adapters trimming
+ gtf_annotated_genes.results, the output of RSEM gene quantification with gene-level annotation
+ Log.final.out, the statistics of the genome mapping generated by STAR  
+ rsem.info, summary of the parameters used in the run
+ genes.results, the output of RSEM gene quantification
+ isoforms.results, the output of RSEM isoform quantification
+ run.info, some statistics on the run
+ skewerd_xxxxxxxxxxxx.log, log of the skewer docker container
+ stard.yyyyyyyyyyyy.log, log of the star docker container

The first column in gtf_annotated_genes.results is the ensembl gene id, the second is the biotype, the 3rd is the annotation source, the 4th contains the set of transcripts included in the ensembl gene id. Then there is the length of the gene, the lenght of the gene to which is subtracted the average length of the sequenced fragments, the expected counts are the counts to be used for differential expression analysis. TPM and FPM are normalized gene quantities to be used only for visualization purposes.

0.1.3.1 Why choosing STAR+RSEM for transcripts quantification?

Recently Zhang and coworkers (BMC Genomics 2017) compared, at transcript level, alignment-dependent tools (Salmon_aln, eXpress, RSEM and TIGAR2) and aligner-free methods (Salmon, Kallisto Sailfish). In their paper, STAR was used as mapping tool for all alignment-dependent tools. In terms of isoform quantification, the authors indicated that there is strong concordance among quantification results from RSEM, Salmon, Salmon_aln, Kallisto and Sailfish (R2 > 0.89), suggesting that the impact of mappers on isoform quantification is small. Furthermore, the paper of Teng and coworkers (Genome Biology 2016) reported that, in term of gene-level quantification, differences between alignment-dependent tools and aligner-free methods are shrinking with respect to transcripts level analysis. On the basis of the above papers it seems that from the quantification point of view the difference between alignment free and alignment-dependent tools is very limited. However, aligner-free methods have low memory requirements and we have added Salmon in the development version of docker4seq in github. We are planning to introduce Salmon in the stable version of docker4seq in the first quarter of 2018. Salmon implementation will allow to increase the sample throughput, by running multiple samples. Currently samples are run serially because of the high RAM requirement of STAR.

0.1.4 From samples to experiment

The RSEM output is sample specific, thus it is necessary to assemble the single sample in an experiment table including in the header of the column both the covariates and the batch, if any. The header sample name is separated by the covariate with an underscore, e.g. mysample1_Cov1, mysample2_Cov2.

counts table with covariates

In case also a batch is present also this is added to the sample name through a further underscore, e.g. mysample1_Cov1_batch1, mysample2_Cov_batch2.

counts table with covariates and batch

The addition of the covariates to the various samples can be done using the 4seqGUI using the button: From samples to experiment.

This function is particularly useful to reorganize samples in different subset of experiments or combine them in a unique one. This function also provides the ability to manipulate the experiment covariates and the batch effects that a user might be interested to add.

0.1.4.1 From samples to experiments by line command

#test example
system("wget http://130.192.119.59/public/test.samples2experiment.zip")
unzip("test.samples2experiment.zip")
setwd("test.samples2experiment")
library(docker4seq)
sample2experiment(sample.folders=c("./e1g","./e2g","./e3g",
"./p1g", "./p2g", "./p3g"),
covariates=c("Cov.1","Cov.1","Cov.1","Cov.2","Cov.2","Cov.2"),
bio.type="protein_coding", output.prefix=".")

User needs to provide the paths of the samples, sample.folder parameter, a vector of the covariates, covariates, and the biotype(s) of interest, bio.type parameter. The parameter output.prefix refers to the path where the output will be created, as default this is the current R working folder.

0.1.4.2 From samples to experiments output files

The samples to experiments produces the following output files:

+ _counts.txt: gene-level raw counts table for differential expression analysis
+ _isoforms_counts.txt: isoform-level raw counts table for differential expression analysis
+ _isoforms_log2TPM.txt: isoform-level log2TPM for visualization purposes
+ _log2TPM.txt: gene-level log2TPM for visualization purposes
+ _isoforms_log2FPKM.txt: isoform-level log2FPKM for visualization purposes
+ _log2FPKM.txt: gene-level log2FPKM for visualization purposes
+ XXXXX.Rout: logs of the execution

0.1.5 Visualizing experiment data with PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component accounts for as much of the variability in the data as possible, and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. 4SeqGUI provides an interface to the generation experiment samples PCA

$PCA$

The plot is saved in pca.pdf in the selected folder.

0.1.5.1 PCA by line command

#test example
system("wget http://130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
pca(experiment.table="_log2FPKM.txt", type="FPKM", legend.position="topleft", covariatesInNames=FALSE, principal.components=c(1,2), pdf = TRUE, output.folder=getwd())

User needs to provide the paths of experiment table, experiment.table parameter, i.e. the file generated using the samples2experiment function. The type parameter indicates if FPKM, TPM or counts are used for the PCA generation. The parameter legend.position defines where to locate the covariates legend. The parameter covariatesInNames indicates if the header of the experiment table contains or not covariate information. The parameter principal.components indicates which principal components should be plotted. output.folder indicates where to save the pca.pdf file.

pca.pdf

The values in parentesis on x and y axes are the amount of variance explained by each principal component.

0.1.6 Evaluating sample size and experiment power

Sample size estimation is an important issue in the design of RNA sequencing experiments. Furthermore, experiment power provides an indication of which is the fraction of differentially expressed genes that can be detected given a specific number of samples and differential expression detection thresholds. RnaSeqSampleSize Bioconductor package provides the possibility to calculate, from a pilot experiment, the statistical power and to define the optimal sample size. We have implemented wrapper functions to RnaSeqSampleSize sample size and experiment power estimation.

4SeqGUI provides an interface to sample size estimation and to statistical power estimation.

sample size estimation

stat power estimation

0.1.6.1 Sample size estimation by line command

#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
sampleSize(group="docker", filename="_counts.txt", power=0.80, FDR=0.1, genes4dispersion=200, log2fold.change=1)

The requested parameters are the path to the counts experiment table generated by samples2experiment function. The param power indicates the expected fraction of differentially expressed gene, e.g 0.80. FDR and log2fold.change are the two thresholds used to define the set of differentially expressed genes of interest.

The output file is sample_size_evaluation.txt is saved in the R working folder, below an example of the file content:

sample_size_evaluation.txt

0.1.6.2 Experiment statistical power estimation by line command

#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
experimentPower(group="docker", filename="_counts.txt",replicatesXgroup=7, FDR=0.1, genes4dispersion=200, log2fold.change=1)

The requested parameters are the path to the counts experiment table generated by samples2experiment function. The param replicatesXgroup indicates the number of sample associated with each of the two covariates. FDR and log2fold.change are the two thresholds used to define the set of differentially expressed genes of interest. genes4dispersion indicates the number of genes used in the estimation of read counts and dispersion distribution.

The output file is power_estimation.txt is saved in the R working folder, below an example of the file content:

power_estimation.txt

0.1.7 Differential expression analysis with DESeq2

A basic task in the analysis of count data from RNA-seq is the detection of differentially expressed genes. 4SeqGUI provides an interface to DESeq2 to simplify differential expression analysis.

The output files are:

DEfull.txt containing the full set of results generated by DESeq2

DEfull.txt

DEfiltered_log2fc_X_fdr_Y.Y.txt containing the set of differentially expressed genes passing the indicated thresholds

genes4david.txt a file containing only the gene symbols to be used as input for DAVID or ENRICHR

log2normalized_counts.txt, log2 library size normalized counts, calculated by DESeq2, that can be used for visualization purposes.

0.1.7.1 DESeq2 by line command

#test example
system("wget 130.192.119.59/public/test.analysis.zip")
unzip("test.analysis.zip")
setwd("test.analysis")
library(docker4seq)
wrapperDeseq2(output.folder=getwd(), group="docker", 
      experiment.table="_counts.txt", log2fc=1, fdr=0.1, 
      ref.covar="Cov.1", type="gene", batch=FALSE)

User has to provide experiment table, experiment.table param, i.e. the counts table generated with samples2experiment function, the thresholds for the differential expression analysis, log2fc and fdr params, the reference covariate, ref.covar param, i.e. the covariate that is used as reference for differential expression detection, the type param, which refers to the type of experiment table in use: gene, isoform, mirna, batch parameter that indicates, if it is set to TRUE that the header of the experiment table also contains the extra information for the batch effect (see above).

0.1.7.2 Why DESeq2 was selected as differential expression tool?

Love and coworker (Genome Biol. 2014; 15, 550), showed that DESeq2 had comparable sensitivity to edgeR and voom. We introduced in our workflow DESeq2 because it has some specific features which increase the strength of the differential expression analysis, features that are not available in other tools. One of these features is, the Empirical Bayes shrinkage for fold-change estimation, which shrinks log fold change estimates toward zero. This feature reduces the noise due to low expressed genes, since shrinkage is stronger when the available information for a gene is low, which may be because the read counts are low, dispersion is high or there are few degrees of freedom. The other peculiar feature of DESeq2 is the identification of counts outliers, which might represent a source of false positives. Specifically, DESeq2 flags genes characterized by the presence of counts outliers, which are estimated with the standard outlier diagnostic Cook’s distance Love and coworker (Genome Biol. 2014; 15, 550).