An Introduction to docker4seq package and 4SeqGUI

Raffaele A Calogero

Introduction

The docker4seq package was developed to facilitate the use of computing demanding applications in the field of NGS data analysis.

The docker4seq package uses docker containers that embed demanding computing tasks (e.g. short reads mapping) into isolated containers.

This approach provides multiple advantages:

Requirements

The minimal hardware requirements are a 4 cores 64 bits linux computer, 32 Gb RAM, one SSD 250GB, with a folder with read/write permission for any users (chmod 777), and docker installed.

Setup

docker4seq and its graphical interface (optional) 4SeqGUI can fit ideally in the NUC6I7KYK, Intel mini-computer equipped with Kingston Technology HyperX Impact 32GB Kit (2x16GB), 2133MHz DDR4 CL13 260-Pin SODIMM and Samsung 850 EVO - 250GB - M.2 SATA III Internal SSD.

MANDATORY: The first time docker4seq is installed the downloadContainers function needs to be executed to download, in the local repository, the docker images that are needed by docker4seq.

library(docker4seq)
downloadContainers(group="docker")

Dockers containers

At the present time all functions requiring some sort of calculation are embedded in the following docker images:

docker container nomenclature

In case of updates required to solve bugs, which do not affect the calculation docker.io/rcaloger/XXXXX.YYYY.ZZ the fiels ZZ will be updated.

In case of updates which affect the calculation, e.g. new release of Bioconductor libraries, the field YYYY will be updated. Previous versions will be maintained to guarantee the reproducibility of any previous analisys.

Reproducibility

The file containers.txt, which indicates the Docker images available in the local release of docker4seq is saved within any folder generated with docker4seq functions.

In case, user would like to download a set of dockers images different from those provided as part of the package, then these images must be specified in a file with the following format docker.repository/user/docker.name, which has to be passed to downloadContainers function:

downloadContainers(group="docker", containers.file="my_containers.txt")
#an example of the my_containers.txt file content
docker.io/rcaloger/bwa.2017.01
docker.io/rcaloger/chipseq.2017.01
docker.io/rcaloger/r340.2017.01

Available workflows

At the present time are available the following workflows:

The most expensive computing steps of the analyses are embedded in the following docker4seq functions: rnaseqCounts, mirnaCounts, chipseqCounts. These functions are also the only having RAM and computing power requirements not usually available in consumer computers. Hereafter it is shown the time required to run the above three functions increasing the number of sequenced reads.

testSeqbox

In docker4seq is now present the function testSeqbox, which allows to evaluate if the software required for docker4seq functionalities is properly installed. Results of the tests are saved in testSeqBox.out file.

rnaseqCounts performances

Counts generation from fastq files is the most time consuming step in RNAseq data analysis and it is usually calculated using high-end servers. We compare the behabiour of rnaseqCounts on SeqBox and on a high-end server:

+ SeqBox: NUC6I7KYK CPU i7-6770HQ 3.5 GHz (1 core, 8 threads), 32 Gb RAM, HD 250 Gb SSD
+ SGI UV200 server: CPU E5-4650 v2 2.40GHz (8 cores, 160 threads), 1 Tb RAM, RAID 6, 100 Tb SATA

We run respectively 26, 52, 78, and 105 million reads using different number of threads, values shown in parenthesis in Figure 1. It is notable that SeqBox, mapping in 5 hours more than 100 milion reads, it is able to handle in 20 hours the throughput of the Illumina benchtop sequencer NextSeq 500, which produces up to 400 milion reads in a run of 30 hours.

rnaseqCounts overall performance

rnaseqCounts overall performance

mirnaCounts performances

We run resepctively 3, 6, 12, and 24 miRNA samples in parallel using mirnaCounts, with different number of threads, values shown in parenthesis in Figure 2.

mirnaCounts overall performance

mirnaCounts overall performance

chipseqCounts performances

We run respectively 37, 70, 111, and 149 million reads using different number of threads, values shown in parenthesis in Figure 3.

chipseqCounts overall performance

chipseqCounts overall performance

From the point of view of parallelization the rnaseqCounts is the one that embeds the most computing demanding tools: i) mapping with STAR and ii) quantifying transcripts with RSEM. Both these tools were design to take advantage of multiple cores hardware architecture and they also require massive I/O. On the basis of the results shown in Figure 1 parallelization does not improve very much the overall performances, even if it can mitigate the gap w.r.t. SeqBox due to the poor I/O performance of the SATA disk array. On the other side the presence of a SSD with very high I/O performance can remedy the limited amount of cores of SeqBox.

In the case of mirnaCounts and chipseqCounts the parallelization is very little and it is only available for the reads mapping procedure. Moreover, both functions have a massive I/O. The reduced parallelization of these two analyses combined with the higher I/O throughput of the SSD with respect to the SATA array makes SeqBox extremely effective even with very high number of reads to be processed, Figure 2 and 3.

Test sets

A folder including a set do dataset to test each of the workflows available in docker4seq/4SeqGUI can be found here

RNAseq workflow: Howto

Demultiplexing

demultiplexing function is used to convert in fastq bcl files generated by an Illumina sequencer. The function requires that in the bcl folder the SampleSheet.csv is present. An example fo SampleSheet for a pair-end run is present in docker4seq examples folder. The function require that the full path to the bcl file folder is provided, data.folder, the scratch folder where temporary analysis is run and the number of cores that will be used by the program.

demultiplexing(group="docker",
      data.folder="/home/calogero/Documents/data/lollini/3a_run/170712_NB501050_0097_AH3FGNBGX3",
      scratch.folder="/data/scratch", threads=24)

The function will return, in the folder containing the bcl files folder, e.g. /home/calogero/Documents/data/lollini/3a_run/, the fastq files generated by the analysis.

This function is not implemented in 4SeqGUI because this step is generally done by a core lab. Thus only a limited group of users require the use of this function.

The mRNAseq workflow, that can be handled using 4SeqGUI graphical interface (linux/MAC), starts from the availability of fastq files.

mRNAseq workflow

mRNAseq workflow

Sample quantification is made of these steps:

All the parameters can be setup using 4SeqGUI

Creating a STAR index file for mRNAseq

The index can be easily created using the graphical interface:

Creating a STAR genome index

Creating a STAR genome index

A detailed description of the parameters is given below.

Creating a STAR index file by line command

rsemstarIndex(group="docker",genome.folder="/data/scratch/hg38star",
ensembl.urlgenome="ftp://ftp.ensembl.org/pub/release-87/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.toplevel.fa.gz",
ensembl.urlgtf="ftp://ftp.ensembl.org/pub/release-87/gtf/homo_sapiens/Homo_sapiens.GRCh38.87.gtf.gz")

In brief, rsemstarIndex uses ENSEMBL genomic data. User has to provide the URL (ensembl.urlgenome) for the file XXXXX_dna.toplevel.fa.gz related to the organism of interest, the URL (ensembl.urlgtf) for the annotation GTF XXX.gtf.gz and the path to the folder where the index will be generated (genome.folder). The parameter threads indicate the number of cores dedicated to this task.

Precompiled index folders are available:

Quantifying genes/isoforms

Gene, Isoform counting

Gene, Isoform counting

A detailed description of the parameters is given below.

Sample quantification by line command

The sample quantification can be also executed using R and it is completely embedded in a single function:

#test example
system("wget http://130.192.119.59/public/test.mrnaCounts.zip")
unzip("test.mrnaCounts.zip")
setwd("./test.mrnaCounts")
library(docker4seq)
rnaseqCounts(group="docker",fastq.folder=getwd(), scratch.folder=getwd(),
adapter5="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA",
adapter3="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT",
seq.type="se", threads=8,  min.length=40,
genome.folder="/data/scratch/mm10star", strandness="none", save.bam=FALSE,
org="mm10", annotation.type="gtfENSEMBL")

User needs to create the fastq.folder, where the fastq.gz file(s) for the sample under analysis are located. The scratch.folder is the location where temporary data are created. The results will be then saved in the fastq.folder.

User needs to provide also the sequence of the sequencing adapters, adapter5 and adapter3 parameters. In case Illumina platform the adapters sequences can be easily recovered here.

seq.type indicates if single-end (se) or pair-end (pe) data are provided, threads indicates the max number of cores used by skewer and STAR, all the other steps are done on a single core.

The min.length refers to the minimal length that a reads should have after adapters trimming. Since today the average read length for a RNAseq experiment is 50 or 75 nts would be better to bring to 40 nts the min.length parameter to increase the precision in assigning the correct position on the genome.

The genome.folder parameter refers to the location of the genomic index generated by STAR using the docker4seq function rsemstarIndex, see above paragraph.

strandness, is a parameter referring to the kit used for the library prep. If the kit does not provide strand information it is set to “none”, if provides strand information is set to “forward” for Illumina stranded kit and it set to “reverse” for Illumina ACCESS kit. save.bam set to TRUE indicates that genomic bam file and transcriotomic bam files are also saved at the end of the analysis. annotation.type refers to the type of available gene-level annotation. At the present time is only available ENSEMBL annotation defined by the gtf downloaded during the creation of the indexed genome files, see paragraph at the endCreating a STAR index file for mRNAseq*.

Salmon, reference free alignment

Recently Zhang and coworkers (BMC Genomics 2017, 18,583) compared, at transcript level, alignment-dependent tools (Salmon_aln, eXpress, RSEM and TIGAR2) and aligner-free methods (Salmon, Kallisto Sailfish). In their paper, STAR was used as mapping tool for all alignment-dependent tools. In terms of isoform quantification, the authors indicated that there is strong concordance among quantification results from RSEM, Salmon, Salmon_aln, Kallisto and Sailfish (R2 > 0.89), suggesting that the impact of mappers on isoform quantification is small. Furthermore, the paper of Teng and coworkers (Genome Biology 2016, 17,74) reported that,in term of gene-level quantification, differences between alignment-dependent tools and aligner-free methods are shrinking with respect to transcripts level analysis. On the basis of the above papers it seems that from the quantification point of view the difference between alignment free and alignment-dependent tools is very limited. However, aligner-free methods have low memory requirements and we added Salmon in docker4seq.

Creating a Salmon quasi-reference file for mRNAseq

The quasi-reference can be created using cDNA fasta files available at ENSEMBL. The corresponding genomics GTF is required for the gene level annotation.