You may visit docs.flowr.space for more details.
If you face any issues, please feel free to raise a issue on github.
Requirements:
#install.packages("params", repos = "http://cran.rstudio.com")
## for a latest stable version (updated every few days):
install.packages("flowr", repos = c(CRAN= "http://cran.rstudio.com", DRAT="http://sahilseth.github.io/drat"))
After installation run setup(), this will copy the flowr’s helper script to ~/bin. Please make sure that this folder is in your $PATH variable.
library(flowr)
setup()
Then we need to test whether we are able to submit jobs to the cluster properly.
## run a test on the local platform
run(x='sleep_pipe', platform='local', execute=FALSE)
## run a test on the HPCC platform (torque, sge, moab, slurm, lsf)
run(x='sleep_pipe', platform='torque', execute=TRUE)
NOTE: In case the test is not successful, please follow the advanced configuration page for more details on how to solve the issues.
Next, we will download a pipeline which processes multiple fastq files of a sample into a single aligned and merged BAM file.
cd ~/flowr/pipelines
base=https://raw.githubusercontent.com/sahilseth/flowr/devel/inst/pipelines
wget $base/fastq_bam_bwa.R
wget $base/fastq_bam_bwa.conf
wget $base/fastq_bam_bwa.def
One can download the reference genome including indexes of various alignment tools from Illumina’s iGenomes website.
You may skip this step, if you already have the genome fasta and related files.
mkdir ~/flowr/genomes; cd ~/flowr/genomes
url=ussd-ftp.illumina.com/Homo_sapiens/NCBI/build37.2/Homo_sapiens_NCBI_build37.2.tar.gz
ftp ftp://igenome:G3nom3s4u@$url
tar -zxvf Homo_sapiens_NCBI_build37.2.tar.gz
A typical NGS pipeline consists of many steps, each with several parameters. You can modify fastq_bam_bwa.conf, specifying paths to various tools and their default options (samtools, bwa, picard and reference genome indexes).
Note: All parameters of this pipeline are conveniently specified in a tab-delimited format in the fastq_bam_bwa.conf file.
## customize parameters, including paths to samtools, bwa, reference genomes etc.
vi fastq_bam_bwa.conf
You may skip this step if you already have raw reads for a sample, in fastq format.
mkdir ~/flowr/genomes; cd ~/flowr/genomes
## for testing puposes one may download example fastq files:
wget http://omixon-download.s3.amazonaws.com/target_brca_example.zip
unzip target_brca_example.zip
Next, we need to customize the resource requirements based on the computing platform. You may refer to the flow definition format for more details.
## customize the resource requirements in flowdef:
- need to change: queue, platform
- may change: walltime, memory, CPUs etc.
vi fastq_bam_bwa.def
## read check flowdef (shell)
flowr as.flowdef x=fastq_bam_bwa.def
## OR from R
as.flowdef(x='fastq_bam_bwa.def')
Read and check flowdef
A flow definition with default values has already been supplied, briefly,
merging may have multiple subprocess (each of which can run in parallel). Thus, we spread (scatter) them across the cluster.aln steps of bwa may be run in parallel, and its subsequent sampe would wait for both. Spcecifically, in case of multiple fastq files \(i^{th}\) sampe step would wait for \(i^{th}\) aln1 and aln2 steps.aln can use multiple cores, we provide it 12 cores, and for rest of the steps, 1 core each.hh:mm:ss and others prefer hh:mm, you may need to check with your system admin.medium queue, since if usually exists; please change as needed.Tip: Once we define the flow definition correctly, we may not need to change it any further (one time effort).
Note: Assuming that the pipeline along with its .def and .conf files is available in ~/flowr/pipelines.
Also, .conf files should have all the correct paths and .def file should have resource requirements specified correctly.
## get input fastqs
fqs1=~/flowr/genomes/target_brca_example/brca.example.illumina.0.1.fastq
fqs2=~/flowr/genomes/target_brca_example/brca.example.illumina.0.2.fastq
## submit to the cluster
flowr run x=fastq_bam_bwa fqs1=$fqs1 fqs2=$fqs2 samplename=samp execute=TRUE
## change the platform specified in flowdef
flowr run x=fastq_bam_bwa fqs1=$fqs1 fqs2=$fqs2 samplename=samp execute=TRUE platform=slurm
library(flowr)
fqpath = "~/flowr/genomes/target_brca_example"
## demonstrating that multiple fqs can be used here...
fobj = run(x = "fastq_bam_bwa", samplename = "samp1", execute = TRUE,
fqs1 = rep(paste0(fqpath, "/brca.example.illumina.0.1.fastq"), 2),
fqs2 = rep(paste0(fqpath, "/brca.example.illumina.0.2.fastq"), 2))
OR from R using:
Refer to the help pages for more details on the run function.
The run function performs several steps, finally submitting the commands to the cluster. It may be useful to go through these steps to understand the details.
1. Get user inputs
Using the name of the pipeline, run fetches it in various places inclusing ~/flowr/pipelines.
library(flowr)
setwd("~/flowr/pipelines")
source("fastq_bam_bwa.R")
## this may throw a warning if paths do not exist
## if you have used modules instead of full paths please ignore the warnings
load_opts("fastq_bam_bwa.conf")
## Get example input
## these can be a vector of multiple paired-end files
## OR multiple single-end files
fqs1 = "~/flowr/genomes/target_brca_example/brca.example.illumina.0.1.fastq"
fqs2 = "~/flowr/genomes/target_brca_example/brca.example.illumina.0.2.fastq"
samp = "samplename"
## optionally specify the center, lane, platform etc.
set_opts(rg_center = "the_institute", rg_lane = "1")
## **Note:** load_opts checks if variables ending with
## _exe, _path, _dir etc. exist or not.
## make sure they are all correct.
## Ignore the warnings, if instead of specifying full path to a tool
## you are using the module command.
Refer to the help pages of fetch_pipes and fetch_pipes for more details.
2. Read flow definition
def = as.flowdef("fastq_bam_bwa.def")
## def seems to be a file, reading it...
##
## checking if required columns are present...
## checking if resources columns are present...
## checking if dependency column has valid names...
## checking if submission column has valid names...
## checking for missing rows in def...
## checking for extra rows in def...
## checking submission and dependency types...
The plot would work only if you have X11 etc enabled, i.e. if you logged into the cluster using ssh -X (or ssh -Y).
Optionally, one can edit all config files on their own machine, debug and sort issues; when done, upload them to the cluster and submit.
plot_flow(def) ## on a cluster, only works if graphics X11 is enabled. ssh -X
3. Create a table with all commands to run
We use the function fastq_bam_bwa to create a flow mat.
## run the module and create a flow mat, with all the commands
out = fastq_bam_bwa(fqs1, fqs2, samplename = samp)
## optionally, write this to a file (a simple tab delimited table)
write_sheet(out$flowmat, "fastq_bam_bwa.tsv")
4. Executing on the computing cluster
Now we can submit this to the cluster using:
fobj2 = to_flow(x='~/flowr/pipelines/fastq_bam_bwa.tsv',
def='~/flowr/pipelines/fastq_bam_bwa.def',
name = "fastq_bam_bwa",
execute=TRUE)
OR from the terminal using:
flowmat=~/flowr/pipelines/fastq_bam_bwa.tsv
flowdef=~/flowr/pipelines/fastq_bam_bwa.def
flowr to_flow x=$flowmat def=$flowdef name=fastq_bam_bwa execute=TRUE
Tip: This example shows a single sample, but you may have as many samples in the flowmat. In case of multiple samples, the samplename column is used to group commands and each set if submitted as a individual flow.
Several other functions, one may use after submission:
checking the status:
## from R:
status(x="~/flowr/runs/fastq_bam_bwa*")
## OR from terminal using:
flowr status x=~/flowr/runs/fastq_bam_bwa*
| | total| started| completed| exit_status|status |
|:---------|-----:|-------:|---------:|-----------:|:----------|
|001.aln1 | 1| 1| 0| 0|processing |
|002.aln2 | 1| 1| 0| 0|processing |
|003.sampe | 1| 0| 0| 0|pending |
|004.fixrg | 1| 0| 0| 0|pending |
|005.merge | 1| 0| 0| 0|pending |
Additionally, you may kill or rerun the flow as well.
flowr kill x=~/flowr/runs/fastq_bam_bwa*
flowr rerun x=~/flowr/runs/<full path of the flow> start_from=fixrg
Please use the respective help pages for more details.