Before installing the RcwlPipeline package of R, we need:
Caution: in the conda-forge
webpage, it is instructed to set the channel_priority
parameter be strict, but this will raise errors during the
cwltool installation. So, set it be flexible instead, as
follows:
conda config --set channel_priority flexible
Also, you may want to remove the auto_activate_base
option in the conda configuration file. Otherwise, (base)
will appear on the left of a terminal prompt each time you open a
console.
conda config --set auto_activate_base False
Inside R, type:
BiocManager::install(c("Rcwl", "RcwlPipelines"))
Alternatively, using conda in a regular console:
conda install -c bioconda bioconductor-rcwlpipelines
If the installation went successful, then we should be able to load it inside R by typing
library(RcwlPipelines)
Make sure the above command doesn’t return any error messages. If the
installation using BiocManager::install fails because of
some missing packages which RcwlPiplines depend on, such as
rlang, then it may help to reset the system environment of
R as follows.
Sys.setenv(R_INSTALL_STAGED=FALSE)
Then proceed with the source installation, e.g.,
install.packages("rlang",type="source")
For a general guideline for installing an R package in the CCR cluster, see their knowlegde base article.
Create an R-library directory and a temporary directory under your /projects directory; for example,
mkdir /projects/rpci/songliu/mkorobkin/rlibs
mkdir /projects/rpci/songliu/mkorobkin/tmp
Then, load R module by typing:
module load R/[version]
To begin, you need to specify the temporary directory to use during the installation, as follows.
Sys.setenv(TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
First, install the Bioconductor package in the personal R-library as follows.
install.packages("BiocManager", lib="/projects/rpci/songliu/mkorobkin/rlibs")
Next, install the RcwlPipeline package using
BiocManager as follows.
BiocManager::install(c("Rcwl","RcwlPipelines"),lib="/projects/rpci/songliu/mkorobkin/rlibs")
Once you successfully install the RcwlPipelines, it can be loaded by typing:
library(RcwlPipelines, lib.loc="/projects/rpci/songliu/mkorobkin/rlibs")
wget https://cran.r-project.org/src/base/R-4/R-4.2.1.tar.gz # pick the version you like
tar -xvzf R-4.2.1.tar.gz
cd R-4.2.1
./configure --prefix=/projects/rpci/songliu/mkorobkin/R # Use ones own /projects directory
make && make install
Then, proceed inside this private version of R as usual, i.e.,
dir.create("/projects/rpci/songliu/mkorobkin/tmp/")
Sys.setenv(TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
install.packages("BiocManager")
BiocManager::install(c("Rcwl","RcwlPipelines"))
Notice that you need not to designate a personalized library (e.g. /rlibs) unless using R in a public module.
Note: If your installation halts because of the
dependency: RcppTOML, then download its older version and
compile it outside R as follows.
wget http://lib.stat.cmu.edu/R/CRAN/src/contrib/Archive/RcppTOML/RcppTOML_0.1.3.tar.gz
R CMD INSTALL RcppTOML_0.1.3.tar.gz
Then, proceed in R using force = TRUE option:
BiocManager::install(c("Rcwl","RcwlPipelines"), force = TRUE)
Also note: If your installation fails with a message:
ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual'
then you may need to set the TMPDIR environment variable which R will use as the compilation directory:
mkdir ~/tmp
export TMPDIR=~/tmp
Then, proceed in R using force = TRUE option.
In order to use RcwlPipeline, we need to know its language: CWL. The official userguide http://www.commonwl.org/user_guide/ gives an introduction. (Note that we need the cwltool software in order to execute their example programs.)
In order to check if cwltool has been installed, type
cwltool --version
If this command returns an error:
cwltool: command not found, despite the fact that you have
installed it previously, then you may first have to open a python
virtual environment as follows.
python3 -m venv env
source env/bin/activate
Once you do, type cwltool --version again and see. After
you’re done working on cwltool, you can close the virtual
environment by typing:
deactivate
Consider the echo command in Bash:
echo "Hello World"
## Hello World
It accepts an input string and prints it as is to a screen output. In
CWL, this process can be programmed under a class:
CommandLineTool as follows.
cwlVersion: v1.0
class: CommandLineTool
baseCommand: echo
inputs:
message: # this is not a keyword but a user-defined name for an input parameter
type: string
inputBinding: # specifies the input parameters following the base command
position: 0 # 1st field of an input
outputs:
hello_output: # this is also a user-defined name for an output
type: stdout
stdout: hello-out.txt # output file name (user defined)
This program (hello.cwl) can be executed using cwltool as
cwltool hello.cwl --message 'Hello World'
which outputs the given message to a file (hello-out.txt):
cat hello-out.txt
## Hello World
Notice that the input string (“Hello World”) had to be preceded by a
user-defined input ID: --message. This is how you pass
input parameters in CWL. The syntax to program a
CommandLineTool is given in https://www.commonwl.org/v1.2/CommandLineTool.html.
Suppose you are given a task to count the number of a word ‘one’ in the following file (mock.txt).
one
two
one
three
Here is a solution using shell scripts:
grep one mock.txt > out.txt
wc -l out.txt > out2.txt
cat out2.txt
## 2 out.txt
In order to express this workflow using CWL, we may first create two
files: grep.cwl and countline.cwl as
follows, which define the shell commands grep and
wc -l, respectively.
cwlVersion: v1.0
class: CommandLineTool
baseCommand: grep # Linux shell command
inputs:
pattern: # user-defined input parameter name
type: string
inputBinding:
position: 0 # 1st field of a screen input, e.g., one
file_to_search: # user-defined input parameter name
type: File
inputBinding:
position: 1 # 2nd field of a screen input, e.g., mock.txt
outputs:
results: # user-defined id for stdout
type: stdout
stdout: grepout.txt # user-defined output file name
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [wc, -l] # Two fields in a command are grouped inside []
inputs:
gfile: # user-defined input name
type: File
inputBinding:
position: 0 # 1st field of a screen input, e.g., out.txt
outputs:
counts: # user-defined ID for stdout
type: stdout
stdout: wcout.txt # user-defined output file name
Then, the workflow can be written as follows (workflow-ex1.cwl).
cwlVersion: v1.0
class: Workflow
inputs:
grep_pattern: # user-defined input name
type: string
target_file: # user-defined input name
type: File
outputs:
counts: # user-defined output name
type: File
outputSource: countline/counts # refers to countline and its output [counts] in steps below
steps:
grep: # step 1
run: grep.cwl
in:
pattern: grep_pattern # pattern is a name used in grep.cwl
file_to_search: target_file # file_to_search is a name used in grep.cwl
out: [results] # from grep.cwl
countline: # step 2
run: countline.cwl
in:
gfile: grep/results # refers to grep and its output [results] in steps
# gfile is a name used in countline.cwl
out: [counts] # from countline.cwl
This workflow can be executed by using cwltool as
cwltool workflow-ex1.cwl --target_file mock.txt --grep_pattern one
The results are written in wcout.txt:
cat wcout.txt
## 2 /private/tmp/docker_tmpltp9yrf5/stg71d8078a-e294-47c1-956e-6cde7acc160f/grepout.txt
For the syntax of programming a workflow, see https://www.commonwl.org/v1.2/Workflow.html. Note that
the out parameter name in each step needs to be enclosed in
square brackets [ ], although I don’t know the reason why. One annoying
thing in their syntax is that there is no way knowing where the output
file will be by looking at the Workflow program; you need
to dig deeper into the CommandLineTools program to find it out. Also, to
keep on re-naming the input/output parameters in both the
CommandLineTools program and the Workflow program seems
redundant – Maybe this is in case a workflow calls for the same
CommandLineTools multiple times in a parallell processor.
The official userguide for RcwlPipline is: https://rcwl.org/RcwlBook. To begin, load the package:
library(RcwlPipelines)
And run an update of the package as follows.
Sys.setenv(TMPDIR="/projects/rpci/[groupname]/[username]/tmp")
cwlUpdate()
In order to program the echo command using the RcwlPipline package and execute it with R, we first define an input parameter as follows.
input1<-InputParam(id="message")
Notice that we did not have to specify its type. Second, we construct
a cwlProcess object called Recho for the shell
commmand echo by
Recho<-cwlProcess(baseCommand="echo", inputs=InputParamList(input1))
There is only one input in the echo command, so
InputParamList above has only one element. Now we can print
the definition of a CML commandLineTool: Recho by
typing
Recho
## class: cwlProcess
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: echo
## inputs:
## message (string):
## outputs:
## output:
## type: stdout
We see that this is what we would have written in the
Recho.cwl file if we were using
cwltools rather than R. As a matter of fact, there is a
utility function called writeCWL which will convert the
cwlProcess object into general CWL objects and produce a
.cwl file and a .yml file which specifies
input parameters.
Now, if we assign a value to the input parameter as
Recho$message<-"Hello World"
Then we can execute the Recho command as follows.
result<-runCWL(Recho,outdir=tempdir())
## [1;30mINFO[0m Final process status is success
To see the output, we type
readLines(result$output)
## [1] "Hello World"
## Define "grep" command
i1<-InputParam(id="pattern", type="string", prefix="-e")
i2<-InputParam(id="file_to_search", type="File")
o1<-OutputParam(id="results", type="File", glob="grepout.txt")
Rgrep<-cwlProcess(baseCommand="grep",
inputs=InputParamList(i1,i2),
outputs=OutputParamList(o1),
stdout="grepout.txt")
Rgrep
## class: cwlProcess
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: grep
## inputs:
## pattern (string): -e
## file_to_search (File):
## outputs:
## results:
## type: File
## outputBinding:
## glob: grepout.txt
## stdout: grepout.txt
## Define "wc -l" command
i1<-InputParam(id="gfile",type="File")
o1<-OutputParam(id="counts",type="stdout")
LineCount<-cwlProcess(baseCommand="wc",
arguments=list("-l"),
inputs=InputParamList(i1),
outputs=OutputParamList(o1),
stdout="out.txt")
LineCount
## class: cwlProcess
## cwlClass: CommandLineTool
## cwlVersion: v1.0
## baseCommand: wc
## arguments: -l
## inputs:
## gfile (File):
## outputs:
## counts:
## type: stdout
## stdout: out.txt
## Define a workflow
i1 <- InputParam(id = "word", type = "string")
i2 <- InputParam(id = "infile", type = "File")
s1 <- cwlStep(id = "FindWord", run = Rgrep,
In = list(file_to_search = "infile", pattern = "word"))
s2 <- cwlStep(id = "NumWord", run = LineCount,
In = list(gfile = "FindWord/results"))
o1 <- OutputParam(id = "outfile", type = "File",
outputSource = "NumWord/counts")
cwl <- cwlWorkflow(inputs = InputParamList(i1,i2),
outputs = OutputParamList(o1))
cwl <- cwl + s1 + s2
cwl
## class: cwlWorkflow
## cwlClass: Workflow
## cwlVersion: v1.0
## inputs:
## word (string):
## infile (File):
## outputs:
## outfile:
## type: File
## outputSource: NumWord/counts
## steps:
## FindWord:
## run: FindWord.cwl
## in:
## file_to_search: infile
## pattern: word
## out:
## - results
## NumWord:
## run: NumWord.cwl
## in:
## gfile: FindWord/results
## out:
## - counts
## Run the workflow
cwl$infile<-file.path(getwd(),"mock.txt")
cwl$word<-"one"
r3<-runCWL(cwl, outdir = getwd())
## [1;30mINFO[0m Final process status is success
readLines(r3$output)
## [1] " 2 /private/tmp/docker_tmp1f__8iyb/stg393006f3-1153-474d-aedb-8901bbc2ec97/grepout.txt"
## Draw a flowchart
plotCWL(cwl)
Mutation identification in a tumor cell begins with the alignment of
its DNA sequence to a reference genome, followed by the generation of somatic
variant calls. The most commonly used software for the sequence
alignment are SAMtools and BWA. Their commandline tools
have already been defined inside the RcwlPipelines
package, so you only need to load them using the cwlLoad
function when writing a workflow progam.
library(RcwlPipelines)
## Check and see if all the necessary commandLineTools exist
tls <- cwlSearch("bwa|sam2bam|sortBam|samtools_index|markdup",
type = "tool")
tls
## cwlHub with 6 records
## cache path: ~/Library/Caches/Rcwl
## # last modified date: 2021-11-22
## # cwlSearch() to query scripts
## # cwlLoad('title') to load the script
## # additional mcols(): rid, rpath, Type, Container, mtime, ...
##
## title Command
## BFC65 | tl_bwa_index bwa index
## BFC66 | tl_bwa bwa mem
## BFC122 | tl_markdup picard MarkDuplicates
## BFC153 | tl_sam2bam samtools view
## BFC157 | tl_samtools_index samtools index
## BFC167 | tl_sortBam samtools sort
## Load the commandLineTools
bwa <- cwlLoad("tl_bwa")
bwa_index <- cwlLoad("tl_bwa_index")
markdup <- cwlLoad("tl_markdup")
sam2bam <- cwlLoad("tl_sam2bam")
samtools_index <- cwlLoad("tl_samtools_index")
sortBam <- cwlLoad("tl_sortBam")
## Define input parameters
p1 <- InputParam(id = "threads", type = "int")
p2 <- InputParam(id = "RG", type = "string")
p3 <- InputParam(id = "Ref", type = "string")
p4 <- InputParam(id = "FQ1", type = "File")
p5 <- InputParam(id = "FQ2", type = "File?")
## Define steps of a workflow -----------------------
## bwa
s1 <- cwlStep(id = "bwa", run = bwa,
In = list(threads = "threads",
RG = "RG",
Ref = "Ref",
FQ1 = "FQ1",
FQ2 = "FQ2"))
## sam to bam
s2 <- cwlStep(id = "sam2bam", run = sam2bam,
In = list(sam = "bwa/sam"))
## sort bam
s3 <- cwlStep(id = "sortBam", run = sortBam,
In = list(bam = "sam2bam/bam"))
## mark duplicates
s4 <- cwlStep(id = "markdup", run = markdup,
In = list(ibam = "sortBam/sbam",
obam = list(
valueFrom="$(inputs.ibam.nameroot).mdup.bam"),
matrix = list(
valueFrom="$(inputs.ibam.nameroot).markdup.txt")))
## index bam
s5 <- cwlStep(id = "idxBam", run = samtools_index,
In = list(bam = "markdup/mBam"))
## --------------------------------------------------
## Requirements
req1 <- requireStepInputExpression()
req2 <- requireJS()
## Define outputs
o1 <- OutputParam(id = "Bam", type = "File", outputSource = "markdup/mBam")
o2 <- OutputParam(id = "Idx", type = "File", outputSource = "idxBam/idx")
## Define cwlWorkflow
Align <- cwlWorkflow(requirements = list(req1, req2),
inputs = InputParamList(p1, p2, p3, p4, p5),
outputs = OutputParamList(o1, o2))
## build pipeline
Align <- Align + s1 + s2 + s3 + s4 + s5
Align
## class: cwlWorkflow
## cwlClass: Workflow
## cwlVersion: v1.0
## requirements:
## - class: StepInputExpressionRequirement
## - class: InlineJavascriptRequirement
## inputs:
## threads (int):
## RG (string):
## Ref (string):
## FQ1 (File):
## FQ2 (File?):
## outputs:
## Bam:
## type: File
## outputSource: markdup/mBam
## Idx:
## type: File
## outputSource: idxBam/idx
## steps:
## bwa:
## run: bwa.cwl
## in:
## threads: threads
## RG: RG
## Ref: Ref
## FQ1: FQ1
## FQ2: FQ2
## out:
## - sam
## sam2bam:
## run: sam2bam.cwl
## in:
## sam: bwa/sam
## out:
## - bam
## sortBam:
## run: sortBam.cwl
## in:
## bam: sam2bam/bam
## out:
## - sbam
## markdup:
## run: markdup.cwl
## in:
## ibam: sortBam/sbam
## obam:
## valueFrom: $(inputs.ibam.nameroot).mdup.bam
## matrix:
## valueFrom: $(inputs.ibam.nameroot).markdup.txt
## out:
## - mBam
## - Mat
## idxBam:
## run: idxBam.cwl
## in:
## bam: markdup/mBam
## out:
## - idx
## Draw a flow chart
plotCWL(Align)