1 Installation

1.1 Prerequisites

Before installing the RcwlPipeline package of R, we need:

  1. Programming Environment
  2. Python3
  3. pip3
  4. nodejs
  5. Docker
  6. R and Rstudio
  7. Bioconductor
  8. Anaconda and conda-forge
  9. cwltool

Caution: in the conda-forge webpage, it is instructed to set the channel_priority parameter be strict, but this will raise errors during the cwltool installation. So, set it be flexible instead, as follows:

conda config --set channel_priority flexible

Also, you may want to remove the auto_activate_base option in the conda configuration file. Otherwise, (base) will appear on the left of a terminal prompt each time you open a console.

conda config --set auto_activate_base False

1.2 How to Install RcwlPipeline

1.2.1 To a personal PC/Mac

Inside R, type:

BiocManager::install(c("Rcwl", "RcwlPipelines"))

Alternatively, using conda in a regular console:

conda install -c bioconda bioconductor-rcwlpipelines

If the installation went successful, then we should be able to load it inside R by typing

library(RcwlPipelines)

Make sure the above command doesn’t return any error messages. If the installation using BiocManager::install fails because of some missing packages which RcwlPiplines depend on, such as rlang, then it may help to reset the system environment of R as follows.

Sys.setenv(R_INSTALL_STAGED=FALSE)

Then proceed with the source installation, e.g.,

install.packages("rlang",type="source")

1.2.2 To a project space in the CCR cluster

For a general guideline for installing an R package in the CCR cluster, see their knowlegde base article.

Create an R-library directory and a temporary directory under your /projects directory; for example,

mkdir /projects/rpci/songliu/mkorobkin/rlibs
mkdir /projects/rpci/songliu/mkorobkin/tmp

Then, load R module by typing:

module load R/[version]

To begin, you need to specify the temporary directory to use during the installation, as follows.

Sys.setenv(TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")

First, install the Bioconductor package in the personal R-library as follows.

install.packages("BiocManager", lib="/projects/rpci/songliu/mkorobkin/rlibs")

Next, install the RcwlPipeline package using BiocManager as follows.

BiocManager::install(c("Rcwl","RcwlPipelines"),lib="/projects/rpci/songliu/mkorobkin/rlibs")

Once you successfully install the RcwlPipelines, it can be loaded by typing:

library(RcwlPipelines, lib.loc="/projects/rpci/songliu/mkorobkin/rlibs")

1.2.3 Install R under one’s home directory in the CCR cluster

wget https://cran.r-project.org/src/base/R-4/R-4.2.1.tar.gz  # pick the version you like
tar -xvzf R-4.2.1.tar.gz
cd R-4.2.1
./configure --prefix=/projects/rpci/songliu/mkorobkin/R  # Use ones own /projects directory
make && make install

Then, proceed inside this private version of R as usual, i.e.,

dir.create("/projects/rpci/songliu/mkorobkin/tmp/")
Sys.setenv(TMPDIR="/projects/rpci/songliu/mkorobkin/tmp") 
install.packages("BiocManager")
BiocManager::install(c("Rcwl","RcwlPipelines"))

Notice that you need not to designate a personalized library (e.g. /rlibs) unless using R in a public module.

Note: If your installation halts because of the dependency: RcppTOML, then download its older version and compile it outside R as follows.

wget http://lib.stat.cmu.edu/R/CRAN/src/contrib/Archive/RcppTOML/RcppTOML_0.1.3.tar.gz
R CMD INSTALL RcppTOML_0.1.3.tar.gz

Then, proceed in R using force = TRUE option:

BiocManager::install(c("Rcwl","RcwlPipelines"), force = TRUE)

Also note: If your installation fails with a message:

ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual'

then you may need to set the TMPDIR environment variable which R will use as the compilation directory:

mkdir ~/tmp
export TMPDIR=~/tmp

Then, proceed in R using force = TRUE option.

2 Common Workflow Language (CWL)

In order to use RcwlPipeline, we need to know its language: CWL. The official userguide http://www.commonwl.org/user_guide/ gives an introduction. (Note that we need the cwltool software in order to execute their example programs.)

In order to check if cwltool has been installed, type

cwltool --version

If this command returns an error: cwltool: command not found, despite the fact that you have installed it previously, then you may first have to open a python virtual environment as follows.

python3 -m venv env
source env/bin/activate

Once you do, type cwltool --version again and see. After you’re done working on cwltool, you can close the virtual environment by typing:

deactivate

2.1 Example: Hello World

Consider the echo command in Bash:

echo "Hello World"
## Hello World

It accepts an input string and prints it as is to a screen output. In CWL, this process can be programmed under a class: CommandLineTool as follows.

cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
inputs:
  message: # this is not a keyword but a user-defined name for an input parameter
    type: string
    inputBinding: # specifies the input parameters following the base command
      position: 0 # 1st field of an input
outputs:
 hello_output: # this is also a user-defined name for an output
   type: stdout
stdout: hello-out.txt # output file name (user defined)

This program (hello.cwl) can be executed using cwltool as

cwltool hello.cwl --message 'Hello World'

which outputs the given message to a file (hello-out.txt):

cat hello-out.txt
## Hello World

Notice that the input string (“Hello World”) had to be preceded by a user-defined input ID: --message. This is how you pass input parameters in CWL. The syntax to program a CommandLineTool is given in https://www.commonwl.org/v1.2/CommandLineTool.html.

2.2 Example: Word-count workflow

Suppose you are given a task to count the number of a word ‘one’ in the following file (mock.txt).

one
two
one
three

Here is a solution using shell scripts:

grep one mock.txt > out.txt
wc -l out.txt > out2.txt
cat out2.txt
##        2 out.txt

In order to express this workflow using CWL, we may first create two files: grep.cwl and countline.cwl as follows, which define the shell commands grep and wc -l, respectively.

cwlVersion: v1.0
class: CommandLineTool
baseCommand: grep # Linux shell command
inputs:
  pattern: # user-defined input parameter name
    type: string
    inputBinding:
      position: 0  # 1st field of a screen input, e.g., one
  file_to_search: # user-defined input parameter name
    type: File
    inputBinding:
      position: 1 # 2nd field of a screen input, e.g., mock.txt
outputs: 
  results: # user-defined id for stdout
    type: stdout
stdout: grepout.txt # user-defined output file name
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [wc, -l]  # Two fields in a command are grouped inside []
inputs:
  gfile: # user-defined input name
    type: File
    inputBinding:
      position: 0 # 1st field of a screen input, e.g., out.txt
outputs:
  counts: # user-defined ID for stdout
    type: stdout
stdout: wcout.txt # user-defined output file name

Then, the workflow can be written as follows (workflow-ex1.cwl).

cwlVersion: v1.0
class: Workflow
inputs:
  grep_pattern: # user-defined input name
    type: string
  target_file: # user-defined input name
    type: File
outputs:
  counts: # user-defined output name
    type: File
    outputSource: countline/counts # refers to countline and its output [counts] in steps below
steps:
  grep: # step 1 
    run: grep.cwl
    in:
      pattern: grep_pattern # pattern is a name used in grep.cwl
      file_to_search: target_file # file_to_search is a name used in grep.cwl
    out: [results] # from grep.cwl
  countline: # step 2
    run: countline.cwl
    in:
      gfile: grep/results # refers to grep and its output [results] in steps
                          # gfile is a name used in countline.cwl
    out: [counts] # from countline.cwl

This workflow can be executed by using cwltool as

cwltool workflow-ex1.cwl --target_file mock.txt --grep_pattern one

The results are written in wcout.txt:

cat wcout.txt
##        2 /private/tmp/docker_tmpltp9yrf5/stg71d8078a-e294-47c1-956e-6cde7acc160f/grepout.txt

For the syntax of programming a workflow, see https://www.commonwl.org/v1.2/Workflow.html. Note that the out parameter name in each step needs to be enclosed in square brackets [ ], although I don’t know the reason why. One annoying thing in their syntax is that there is no way knowing where the output file will be by looking at the Workflow program; you need to dig deeper into the CommandLineTools program to find it out. Also, to keep on re-naming the input/output parameters in both the CommandLineTools program and the Workflow program seems redundant – Maybe this is in case a workflow calls for the same CommandLineTools multiple times in a parallell processor.

3 RcwlPipeline Tutorial

The official userguide for RcwlPipline is: https://rcwl.org/RcwlBook. To begin, load the package:

library(RcwlPipelines)

And run an update of the package as follows.

Sys.setenv(TMPDIR="/projects/rpci/[groupname]/[username]/tmp") 
cwlUpdate()

3.1 Example: Hello World

In order to program the echo command using the RcwlPipline package and execute it with R, we first define an input parameter as follows.

input1<-InputParam(id="message")

Notice that we did not have to specify its type. Second, we construct a cwlProcess object called Recho for the shell commmand echo by

Recho<-cwlProcess(baseCommand="echo", inputs=InputParamList(input1))

There is only one input in the echo command, so InputParamList above has only one element. Now we can print the definition of a CML commandLineTool: Recho by typing

Recho
## class: cwlProcess 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: echo 
## inputs:
##   message (string):  
## outputs:
## output:
##   type: stdout

We see that this is what we would have written in the Recho.cwl file if we were using cwltools rather than R. As a matter of fact, there is a utility function called writeCWL which will convert the cwlProcess object into general CWL objects and produce a .cwl file and a .yml file which specifies input parameters.

Now, if we assign a value to the input parameter as

Recho$message<-"Hello World"

Then we can execute the Recho command as follows.

result<-runCWL(Recho,outdir=tempdir())
## INFO Final process status is success

To see the output, we type

readLines(result$output)
## [1] "Hello World"

3.2 Example: Word-count workflow

## Define "grep" command
i1<-InputParam(id="pattern", type="string", prefix="-e")
i2<-InputParam(id="file_to_search", type="File")
o1<-OutputParam(id="results", type="File", glob="grepout.txt")
Rgrep<-cwlProcess(baseCommand="grep",
                  inputs=InputParamList(i1,i2),
                  outputs=OutputParamList(o1),
                  stdout="grepout.txt")
Rgrep
## class: cwlProcess 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: grep 
## inputs:
##   pattern (string): -e 
##   file_to_search (File):  
## outputs:
## results:
##   type: File
##   outputBinding:
##     glob: grepout.txt
## stdout: grepout.txt
## Define "wc -l" command
i1<-InputParam(id="gfile",type="File")
o1<-OutputParam(id="counts",type="stdout")
LineCount<-cwlProcess(baseCommand="wc",
                      arguments=list("-l"),
                      inputs=InputParamList(i1),
                      outputs=OutputParamList(o1),
                      stdout="out.txt")
LineCount
## class: cwlProcess 
##  cwlClass: CommandLineTool 
##  cwlVersion: v1.0 
##  baseCommand: wc 
## arguments: -l 
## inputs:
##   gfile (File):  
## outputs:
## counts:
##   type: stdout
## stdout: out.txt
## Define a workflow
i1 <- InputParam(id = "word", type = "string")
i2 <- InputParam(id = "infile", type = "File")
s1 <- cwlStep(id = "FindWord", run = Rgrep,
              In = list(file_to_search = "infile", pattern = "word"))
s2 <- cwlStep(id = "NumWord", run = LineCount,
              In = list(gfile = "FindWord/results"))
o1 <- OutputParam(id = "outfile", type = "File",
                  outputSource = "NumWord/counts")
cwl <- cwlWorkflow(inputs = InputParamList(i1,i2),
                   outputs = OutputParamList(o1))
cwl <- cwl + s1 + s2
cwl
## class: cwlWorkflow 
##  cwlClass: Workflow 
##  cwlVersion: v1.0 
## inputs:
##   word (string):  
##   infile (File):  
## outputs:
## outfile:
##   type: File
##   outputSource: NumWord/counts
## steps:
## FindWord:
##   run: FindWord.cwl
##   in:
##     file_to_search: infile
##     pattern: word
##   out:
##   - results
## NumWord:
##   run: NumWord.cwl
##   in:
##     gfile: FindWord/results
##   out:
##   - counts
## Run the workflow
cwl$infile<-file.path(getwd(),"mock.txt")
cwl$word<-"one"
r3<-runCWL(cwl, outdir = getwd())
## INFO Final process status is success
readLines(r3$output)
## [1] "       2 /private/tmp/docker_tmp1f__8iyb/stg393006f3-1153-474d-aedb-8901bbc2ec97/grepout.txt"
## Draw a flowchart
plotCWL(cwl)

3.3 Example: Sequence alignment

Mutation identification in a tumor cell begins with the alignment of its DNA sequence to a reference genome, followed by the generation of somatic variant calls. The most commonly used software for the sequence alignment are SAMtools and BWA. Their commandline tools have already been defined inside the RcwlPipelines package, so you only need to load them using the cwlLoad function when writing a workflow progam.

library(RcwlPipelines)
## Check and see if all the necessary commandLineTools exist
tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup",
                 type = "tool")
tls
## cwlHub with 6 records
## cache path:  ~/Library/Caches/Rcwl 
## # last modified date:  2021-11-22 
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
## 
##            title             Command              
##   BFC65  | tl_bwa_index      bwa index            
##   BFC66  | tl_bwa            bwa mem              
##   BFC122 | tl_markdup        picard MarkDuplicates
##   BFC153 | tl_sam2bam        samtools view        
##   BFC157 | tl_samtools_index samtools index       
##   BFC167 | tl_sortBam        samtools sort
## Load the commandLineTools
bwa <- cwlLoad("tl_bwa")
bwa_index <- cwlLoad("tl_bwa_index")
markdup <- cwlLoad("tl_markdup")
sam2bam <- cwlLoad("tl_sam2bam")
samtools_index <- cwlLoad("tl_samtools_index")
sortBam <- cwlLoad("tl_sortBam")
## Define input parameters
p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")
## Define steps of a workflow -----------------------
## bwa
s1 <- cwlStep(id = "bwa", run = bwa,
              In = list(threads = "threads",
                        RG = "RG",
                        Ref = "Ref",
                        FQ1 = "FQ1",
                        FQ2 = "FQ2"))
## sam to bam
s2 <- cwlStep(id = "sam2bam", run = sam2bam,
              In = list(sam = "bwa/sam"))
## sort bam
s3 <- cwlStep(id = "sortBam", run = sortBam,
              In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- cwlStep(id = "markdup", run = markdup,
              In = list(ibam = "sortBam/sbam",
                        obam = list(
                          valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
                        matrix = list(
                          valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- cwlStep(id = "idxBam", run = samtools_index,
              In = list(bam = "markdup/mBam"))
## --------------------------------------------------
## Requirements
req1 <- requireStepInputExpression()
req2 <- requireJS()
## Define outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## Define cwlWorkflow
Align <- cwlWorkflow(requirements = list(req1, req2),
                     inputs = InputParamList(p1, p2, p3, p4, p5),
                     outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5
Align
## class: cwlWorkflow 
##  cwlClass: Workflow 
##  cwlVersion: v1.0 
## requirements:
## - class: StepInputExpressionRequirement
## - class: InlineJavascriptRequirement
## inputs:
##   threads (int):  
##   RG (string):  
##   Ref (string):  
##   FQ1 (File):  
##   FQ2 (File?):  
## outputs:
## Bam:
##   type: File
##   outputSource: markdup/mBam
## Idx:
##   type: File
##   outputSource: idxBam/idx
## steps:
## bwa:
##   run: bwa.cwl
##   in:
##     threads: threads
##     RG: RG
##     Ref: Ref
##     FQ1: FQ1
##     FQ2: FQ2
##   out:
##   - sam
## sam2bam:
##   run: sam2bam.cwl
##   in:
##     sam: bwa/sam
##   out:
##   - bam
## sortBam:
##   run: sortBam.cwl
##   in:
##     bam: sam2bam/bam
##   out:
##   - sbam
## markdup:
##   run: markdup.cwl
##   in:
##     ibam: sortBam/sbam
##     obam:
##       valueFrom: $(inputs.ibam.nameroot).mdup.bam
##     matrix:
##       valueFrom: $(inputs.ibam.nameroot).markdup.txt
##   out:
##   - mBam
##   - Mat
## idxBam:
##   run: idxBam.cwl
##   in:
##     bam: markdup/mBam
##   out:
##   - idx
## Draw a flow chart
plotCWL(Align)