1 NGS Data and Clinical Oncology
2 Distributed Research Management Systems
3 Singularity Containers
4 Data Processing using RcwlPiplines
- 4.1 Example: Sequence alignment
- 4.2 Example: Variant calling
5 BiocParallel Libary
- 5.1 Example: Submitting a SLURM job
- 5.2 Example: Variant Calling

1 NGS Data and Clinical Oncology

Next-generation sequencing (NGS) is a massively parallel sequencing technology that offers ultra-high throughput, scalability, and speed. NGS allows scientists to sequence millions of base pairs simultaneously, making analyses to be performed at a much lower cost and more efficient rate than the older technology such as Sanger method. NGS has been adopted in clinical oncology to advance personalized treatment of cancer.

In addition to allowing the development of personalized medicine, NGS holds a promise for more comprehensive understanding of oncogenic pathways to uncover its prevention and ultimate cure. Cancer is caused by mutations of a gene that control the way our cells function, especially how they grow and divide. Mutations can result from errors in DNA replication during cell division, exposure to mutagens, or a viral infection. Somatic mutations (i.e., acquired mutations after birth) are found in tumors themselves and are the most common cause of cancer. A germline mutation, in contrast, is hereditary and present from the time of birth in each cell. These inherited mutations account 5% to 10% of all cancers. By taking aim at the activity of a specific oncogene which some cancer cells are addicted to, researchers believe that they may be able to effectively inhibit cancer cell proliferation, despite the presence of other mutations and chromosomal aberrations.

The most commonly used software for variant calling is Genome Analysis Toolkit (GATK). Their HaplotypeCaller and Mutect2 tools are designed to call germline and somatic variants, respectively.

Exercise

Read an article: Current practices and guidelines for clinical next-generation sequencing oncology testing by S. P. Strom [Nature Methods volume 5, 621–628 (2008)].

2 Distributed Research Management Systems

The software that manages parallel processing of memory-intense jobs submitted by multiple users in a high-performance computing (HPC) environment is called Distributed Research Management System (DRMS). The most commonly used DRMS are SGE, PBS/TORQUE, and SLURM. In particular, CCR uses SLURM; instruction on submitting a batch script to the CCR cluster is found here.

Figure 1: High-performance computing (HPC) with a cluster computers.

3 Singularity Containers

Singularity is a virtual framework designed to run scientific applications on HPC-backed resources. Similar to Docker, Singularity allows users to install and use software within self-contained, portable, and reproducible environments.

A container is a virtual environment that “contains” a set of applications, plus all its dependencies, libraries and other binaries, as well as configuration files needed to run it, bundled into one package. By containerizing the application platform and its dependencies, differences in OS distributions and underlying infrastructure are abstracted away, so the application can run platform-free. Singularity is a container specifically designed to run scientific applications on HPC-backed resources.

Images are bundles of instructions to build a container. They may sometimes be referred to as a disk image or container image. Singularity uses the Singularity Image Format (SIF) to write images, and they are stored in SIF files (with a .sif filename extension). Singularity images can be pulled from Singularity Hub, a registry for container images.

The official userguide for Singularity is found here.

4 Data Processing using RcwlPiplines

To begin, you always need to load the RcwlPipeline library.

library(RcwlPipelines)

And specify temporary and cashe directories (to avoid exceeding container quota); e.g.,

Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/[groupname]/[username]/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/[groupname]/[username]/singularity_cache")

Optional: you may need to run cwlUpdate() for accessing pipeline scripts.

cwlUpdate()

4.1 Example: Sequence alignment

First, download test data into your project directory.

mkdir /projects/rpci/songliu/mkorobkin/data
cd /projects/rpci/songliu/mkorobkin/data
cp -r /projects/rpci/shared/references/others/somatic-snv-test-data/ .

Load all the required modules for sequence alignment.

module load R
module load samtools
module load bwa

Open R, and load the bwaDup pipeline from the RcwlPipelines library.

cwlInstall("pl_bwaDup")
inputs(bwaDup)
## plotCWL(bwaDup)

Prepare inputs.

fqs <- list.files("data/somatic-snv-test-data", ".fq.gz", full.names = TRUE)
fq1 <- grep("R1", fqs, value = TRUE)
fq2 <- grep("R2", fqs, value = TRUE)
ids <- sub("_.*", "", basename(fq1))

RGs <- paste("@RG",
             paste0("ID:", ids),
             paste0("DT:", Sys.Date()),
             paste0("PL:", "Illumina"),
             "CN:RPCCC",
             paste0("SM:", ids), sep = "\\t")

fq1L <- tapply(fq1, ids, as.list, simplify = FALSE)
fq2L <- tapply(fq2, ids, as.list, simplify = FALSE)
RGL <- tapply(RGs, ids, as.list, simplify = FALSE)
FID <- paste(names(RGL), "WES-ILU.bam", sep="_")

Run alignment pipeline.

ref <- "/projects/rpci/shared/reference/current/hs37d5.fa"  # Reference gene
for(i in 1:length(FID)){
    bwaDup$RG <- RGL[[i]]
    bwaDup$FQ1s <- fq1L[[i]]
    bwaDup$FQ2s <- fq2L[[i]]
    bwaDup$outBam <- FID[i]
    bwaDup$Ref <- ref
    bwaDup$threads <- 4
    runCWL(bwaDup, 
           outdir = "output/BAM", 
           showLog = TRUE, 
           docker = "singularity")
}

Check results. Alignments are listed in the “output/BAM” folder.

list.files("output/BAM")

“.bam” files are the alignments.
“.bam.bai” contains index information, i.e., genome locations.
“.flagstat.txt” are the alignment summary.
“.sif” files provide images for a Singularity container.

The BAM format specification can be found in: https://samtools.github.io/hts-specs/SAMv1.pdf

4.2 Example: Variant calling

Load the Mutect2PL pipeline from the RcwlPipelines library.

cwlInstall("pl_Mutect2PL")

Prepare inputs.

ref <- "/projects/rpci/shared/reference/current/hs37d5.fa"  # Reference gene
reg <- data.frame(chr=21, start=10400000, end=10500000)
write.table(reg, 
            "output/region.bed", 
            row.names=FALSE, 
            col.names=FALSE, 
            quote=FALSE, 
            sep="\t")
Mutect2PL$tbam <- "output/BAM/tumor_WES-ILU.bam"
Mutect2PL$nbam <- "output/BAM/normal_WES-ILU.bam"
Mutect2PL$Ref <- ref  
Mutect2PL$normal <- "normal"
Mutect2PL$tumor <- "tumor"
Mutect2PL$interval <- "output/region.bed"
Mutect2PL$gresource <- "/projects/rpci/shared/references/others/af-only-gnomad.raw.sites.b37.vcf"
Mutect2PL$comvcf <- "/projects/rpci/shared/references/others/small_exac_common_3_b37.vcf"
Mutect2PL$pon <- "/projects/rpci/shared/references/others/Mutect2-exome-panel.vcf"
Mutect2PL$threads <- 4L

Run the pipeline.

runCWL(Mutect2PL, 
       outdir = "output/variants", 
       showLog = TRUE, 
       docker = "singularity")

Check the results.

list.files("output/variants")

The final somatic variants can be found in the file: “normal.tumor.filtered.PASS.vcf”. The variants can be viewed in the tumor BAM file using samtools tview or IGV viewer. NOTE: In order to use the IGV viewer in your local computer to brouse BAM files in a remote server, you need to mount the directory using sshfs.

sshfs [user@]hostname:[directory] mountpoint

To unmount the remote directory, type:

umount mountpoint

The VCF format specification can be found here: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Exercise

Watch a few of the video tutorials of IGV viewer. Then, observe the variants found in your VCF files using the IGV viewer and the sshfs.

5 BiocParallel Libary

BiocParallel is an R library to perform parallel computations managed by DRMS. It offers:

unified interface: BatchtoolsParam instances define the method of parallel evaluation (multi-core, snow cluster, etc.) and computing resources (number of workers, error handling, cleanup, etc.).
parallel iteration over lists, files and vectorized operations: bplapply, bpmapply and bpvec provide parallel list iteration and vectorized operations. bpiterate iterates through files distributing chunks to parallel workers.
cluster scheduling: When the parallel environment is managed by a cluster scheduler through functions in the batchtools package, job management and result retrieval are considerably simplified.

5.1 Example: Submitting a SLURM job

Create a file containing a template for the job submission to Slurm (slurm.tmpl):

#!/bin/bash
#SBATCH --job-name=<%= resources$jobname %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem=<%= resources$memory %>
#SBATCH --clusters=faculty
#SBATCH --partition=rpci
#SBATCH --qos=rpci
#SBATCH --constraint=CPU-Gold-6230
#SBATCH --threads-per-core=1

<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>

## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>

<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>

echo "$SLURM_JOB_ID"
date
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
date

Here <% ... %> are replaced by the values of elements in the resources argument in the BatchtoolsParam constructor using the brew library.

Next, write an Rscript (align.R):

library(RcwlPipelines)
library(BiocParallel)  # for the batchtools package
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/songliu/mkorobkin/singularity_cache")

# Load pipeline
cwlInstall("pl_bwaDup")
inputs(bwaDup)

# Prepare inputs
fqs <- list.files("data/RS-03586070", ".fastq.gz", full.names = TRUE)
fq1 <- fqs[grep("R1", fqs)]
fq2 <- fqs[grep("R2", fqs)]
fqs.ids <- do.call(rbind, lapply(strsplit(basename(fq1), split = "[_.]"), function(x)x[1:2]))

RGs <- paste("@RG",
             paste0("ID:", fqs.ids[,1], ".", fqs.ids[2]),
             paste0("LB:", fqs.ids[,1], ".", fqs.ids[,2]),
             paste0("DT:", Sys.Date()),
             paste0("PL:", "Illumina"),
             "CN:RPCCC",
             paste0("SM:", fqs.ids[,1]), sep = "\\t")

fq1L <- tapply(fq1, fqs.ids[1], as.list, simplify = FALSE)
fq2L <- tapply(fq2, fqs.ids[1], as.list, simplify = FALSE)
RGL <- tapply(RGs, fqs.ids[1], as.list, simplify = FALSE)
FID <- paste(fqs.ids[1], "WES-ILU.bam", sep="_")
idBam <- tapply(FID, fqs.ids[,1], function(x)as.list(unique(x)))

idx <- TRUE
inputList <- list(RG = RGL[idx],
                  outBam = idBam[idx],
                  FQ1s = fq1L[idx],
                  FQ2s = fq2L[idx])

paramList <- list(threads = 16,
                  Ref ="/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa")

# Submit a job using runCWLBatch() and BatchtoolsParam() functions
r1 <- runCWLBatch(bwaDup, outdir = "output/BAM", inputList, paramList,
                  BPPARAM = BatchtoolsParam(workers = lengths(inputList)[1],
                                            cluster = "slurm",
                                            template = "~/slurm.tmpl",
                                            resources = list(ncpus = 16,
                                                             jobname = "align",
                                                             walltime = 60*60*72,
                                                             memory = 128000),
                                            log = TRUE, logdir = ".", progressbar = TRUE),
                                            stderr = "", docker="singularity")

Then, submit the job by typing:

R CMD BATCH --no-save --no-restore align.R align.out &

NOTE: R CMD BATCH writes all console output, including messages, warnings, and print() statements, to align.out. In contrast, the command: Rscript align.R > align.out would write only the print() output to the text file and everything else to the terminal. To achieve the same effect as the above command, you must instead type:

R --no-restore --file=align.R > align.out &

Alternatively, you may define a new function in Rscript2 in .bashrc as follows:

function Rscript2() { R --no-restore --file="$1"; }
export -f Rscript2

Then, it suffices to type:

Rscript2 align.R &

which will write all the console outputs into the log file.

5.2 Example: Variant Calling

First, install the VariantCombiner package:

BiocManager::install("hubentu/VariantCombiner")

Then, you can submit the following R script (snv.R) to Slurm:

library(RcwlPipelines)
library(BiocParallel)
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/songliu/mkorobkin/singularity_cache")
Sys.setenv(DEBUGME = "BiocParallel")

# Load pipeline
cwlInstall("pl_SomaticCaller4")
inputs(SomaticCaller4)

# Prepare inputs
ref <- "/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa"
tbam <- list("output/BAM/RS-03586071/RS-03586071_WES-ILU.bam")  ## tumor
nbam <- list("output/BAM/RS-03586069/RS-03586069_WES-ILU.bam")  ## normal
names(tbam) <- "RS-03586071_Bca120518-organoid"
tids <- list("RS-03586071")
nids <- list("RS-03586069")

# Read .bed file
regions <- "data/S31285117_Covered_nochr_clean.bed"

idx=TRUE
inputList <- list(tbam = tbam[idx],
                  nbam = nbam[idx],
                  tumor = tids[idx],
                  normal = nids[idx])
paramList <- list(Ref = ref,
                  interval = regions,
                  dbsnp = "/projects/rpci/songliu/qhu/annotation/GATK_bundle/b37/dbsnp_138.b37.vcf.gz",
                  gresource = "/projects/rpci/songliu/qhu/annotation/Mutect2/af-only-gnomad.raw.sites.b37.vcf",
                  comvcf = "/projects/rpci/songliu/qhu/annotation/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf",
                  pon = "/projects/rpci/songliu/qhu/annotation/Mutect2/Mutect2-exome-panel.vcf",
                  threads = 16)

# Submit a job using runCWLBatch() and BatchtoolsParam() functions
runCWLBatch(SomaticCaller4, outdir = "output/variants", inputList, paramList,
            BPPARAM = BatchtoolsParam(workers = lengths(inputList)[1],
                                      cluster = "slurm",
                                      template = "~/slurm.tmpl",
                                      resources = list(ncpus = 16,
                                                       jobname = "snv",
                                                       walltime = 60*60*24,
                                                       memory = 32000),
                                      log = TRUE, logdir = ".",
                                      progressbar = TRUE,
                                      saveregistry = FALSE),
            stderr = "", docker="singularity")

list.files("output/variants")

NGS Data Analysis with RcwlPipelines

Mitsuko Korobkin

2022-10-05

1 NGS Data and Clinical Oncology

Exercise

2 Distributed Research Management Systems

3 Singularity Containers

4 Data Processing using RcwlPiplines

4.1 Example: Sequence alignment

4.2 Example: Variant calling

Exercise

5 BiocParallel Libary

5.1 Example: Submitting a SLURM job

5.2 Example: Variant Calling