Next-generation sequencing (NGS) is a massively parallel sequencing technology that offers ultra-high throughput, scalability, and speed. NGS allows scientists to sequence millions of base pairs simultaneously, making analyses to be performed at a much lower cost and more efficient rate than the older technology such as Sanger method. NGS has been adopted in clinical oncology to advance personalized treatment of cancer.
In addition to allowing the development of personalized medicine, NGS holds a promise for more comprehensive understanding of oncogenic pathways to uncover its prevention and ultimate cure. Cancer is caused by mutations of a gene that control the way our cells function, especially how they grow and divide. Mutations can result from errors in DNA replication during cell division, exposure to mutagens, or a viral infection. Somatic mutations (i.e., acquired mutations after birth) are found in tumors themselves and are the most common cause of cancer. A germline mutation, in contrast, is hereditary and present from the time of birth in each cell. These inherited mutations account 5% to 10% of all cancers. By taking aim at the activity of a specific oncogene which some cancer cells are addicted to, researchers believe that they may be able to effectively inhibit cancer cell proliferation, despite the presence of other mutations and chromosomal aberrations.
The most commonly used software for variant calling is Genome Analysis Toolkit (GATK). Their HaplotypeCaller and Mutect2 tools are designed to call germline and somatic variants, respectively.
Read an article: Current practices and guidelines for clinical next-generation sequencing oncology testing by S. P. Strom [Nature Methods volume 5, 621–628 (2008)].
The software that manages parallel processing of memory-intense jobs submitted by multiple users in a high-performance computing (HPC) environment is called Distributed Research Management System (DRMS). The most commonly used DRMS are SGE, PBS/TORQUE, and SLURM. In particular, CCR uses SLURM; instruction on submitting a batch script to the CCR cluster is found here.
Figure 1: High-performance computing (HPC) with a cluster computers.
Singularity is a virtual framework designed to run scientific applications on HPC-backed resources. Similar to Docker, Singularity allows users to install and use software within self-contained, portable, and reproducible environments.
A container is a virtual environment that “contains” a set of applications, plus all its dependencies, libraries and other binaries, as well as configuration files needed to run it, bundled into one package. By containerizing the application platform and its dependencies, differences in OS distributions and underlying infrastructure are abstracted away, so the application can run platform-free. Singularity is a container specifically designed to run scientific applications on HPC-backed resources.
Images are bundles of instructions to build a container. They may sometimes be referred to as a disk image or container image. Singularity uses the Singularity Image Format (SIF) to write images, and they are stored in SIF files (with a .sif filename extension). Singularity images can be pulled from Singularity Hub, a registry for container images.
The official userguide for Singularity is found here.
To begin, you always need to load the RcwlPipeline library.
library(RcwlPipelines)
And specify temporary and cashe directories (to avoid exceeding container quota); e.g.,
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/[groupname]/[username]/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/[groupname]/[username]/singularity_cache")
Optional: you may need to run
cwlUpdate() for accessing pipeline scripts.
cwlUpdate()
First, download test data into your project directory.
mkdir /projects/rpci/songliu/mkorobkin/data
cd /projects/rpci/songliu/mkorobkin/data
cp -r /projects/rpci/shared/references/others/somatic-snv-test-data/ .
Load all the required modules for sequence alignment.
module load R
module load samtools
module load bwa
Open R, and load the bwaDup pipeline from the
RcwlPipelines library.
cwlInstall("pl_bwaDup")
inputs(bwaDup)
## plotCWL(bwaDup)
Prepare inputs.
fqs <- list.files("data/somatic-snv-test-data", ".fq.gz", full.names = TRUE)
fq1 <- grep("R1", fqs, value = TRUE)
fq2 <- grep("R2", fqs, value = TRUE)
ids <- sub("_.*", "", basename(fq1))
RGs <- paste("@RG",
paste0("ID:", ids),
paste0("DT:", Sys.Date()),
paste0("PL:", "Illumina"),
"CN:RPCCC",
paste0("SM:", ids), sep = "\\t")
fq1L <- tapply(fq1, ids, as.list, simplify = FALSE)
fq2L <- tapply(fq2, ids, as.list, simplify = FALSE)
RGL <- tapply(RGs, ids, as.list, simplify = FALSE)
FID <- paste(names(RGL), "WES-ILU.bam", sep="_")
Run alignment pipeline.
ref <- "/projects/rpci/shared/reference/current/hs37d5.fa" # Reference gene
for(i in 1:length(FID)){
bwaDup$RG <- RGL[[i]]
bwaDup$FQ1s <- fq1L[[i]]
bwaDup$FQ2s <- fq2L[[i]]
bwaDup$outBam <- FID[i]
bwaDup$Ref <- ref
bwaDup$threads <- 4
runCWL(bwaDup,
outdir = "output/BAM",
showLog = TRUE,
docker = "singularity")
}
Check results. Alignments are listed in the “output/BAM” folder.
list.files("output/BAM")
The BAM format specification can be found in: https://samtools.github.io/hts-specs/SAMv1.pdf
Load the Mutect2PL pipeline from the RcwlPipelines
library.
cwlInstall("pl_Mutect2PL")
Prepare inputs.
ref <- "/projects/rpci/shared/reference/current/hs37d5.fa" # Reference gene
reg <- data.frame(chr=21, start=10400000, end=10500000)
write.table(reg,
"output/region.bed",
row.names=FALSE,
col.names=FALSE,
quote=FALSE,
sep="\t")
Mutect2PL$tbam <- "output/BAM/tumor_WES-ILU.bam"
Mutect2PL$nbam <- "output/BAM/normal_WES-ILU.bam"
Mutect2PL$Ref <- ref
Mutect2PL$normal <- "normal"
Mutect2PL$tumor <- "tumor"
Mutect2PL$interval <- "output/region.bed"
Mutect2PL$gresource <- "/projects/rpci/shared/references/others/af-only-gnomad.raw.sites.b37.vcf"
Mutect2PL$comvcf <- "/projects/rpci/shared/references/others/small_exac_common_3_b37.vcf"
Mutect2PL$pon <- "/projects/rpci/shared/references/others/Mutect2-exome-panel.vcf"
Mutect2PL$threads <- 4L
Run the pipeline.
runCWL(Mutect2PL,
outdir = "output/variants",
showLog = TRUE,
docker = "singularity")
Check the results.
list.files("output/variants")
The final somatic variants can be found in the file:
“normal.tumor.filtered.PASS.vcf”. The variants can be viewed in the
tumor BAM file using samtools tview or IGV
viewer. NOTE: In order to use the IGV viewer in your local computer
to brouse BAM files in a remote server, you need to mount the directory
using sshfs.
sshfs [user@]hostname:[directory] mountpoint
To unmount the remote directory, type:
umount mountpoint
The VCF format specification can be found here: https://samtools.github.io/hts-specs/VCFv4.2.pdf
Watch a few of the video tutorials of IGV viewer. Then, observe the variants found in your VCF files using the IGV viewer and the sshfs.
BiocParallel is an R library to perform parallel computations managed by DRMS. It offers:
BatchtoolsParam instances define the
method of parallel evaluation (multi-core, snow cluster, etc.) and
computing resources (number of workers, error handling, cleanup,
etc.).bplapply, bpmapply and bpvec
provide parallel list iteration and vectorized operations.
bpiterate iterates through files distributing chunks to
parallel workers.batchtools
package, job management and result retrieval are considerably
simplified.Create a file containing a template for the job submission to Slurm (slurm.tmpl):
#!/bin/bash
#SBATCH --job-name=<%= resources$jobname %>
#SBATCH --output=<%= log.file %>
#SBATCH --error=<%= log.file %>
#SBATCH --time=<%= ceiling(resources$walltime / 60) %>
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=<%= resources$ncpus %>
#SBATCH --mem=<%= resources$memory %>
#SBATCH --clusters=faculty
#SBATCH --partition=rpci
#SBATCH --qos=rpci
#SBATCH --constraint=CPU-Gold-6230
#SBATCH --threads-per-core=1
<%
# relative paths are not handled well by Slurm
log.file = fs::path_expand(log.file)
-%>
## Export value of DEBUGME environemnt var to slave
export DEBUGME=<%= Sys.getenv("DEBUGME") %>
<%= sprintf("export OMP_NUM_THREADS=%i", resources$omp.threads) -%>
<%= sprintf("export OPENBLAS_NUM_THREADS=%i", resources$blas.threads) -%>
<%= sprintf("export MKL_NUM_THREADS=%i", resources$blas.threads) -%>
echo "$SLURM_JOB_ID"
date
Rscript -e 'batchtools::doJobCollection("<%= uri %>")'
date
Here <% ... %> are replaced by the values of
elements in the resources argument in the
BatchtoolsParam constructor using the brew
library.
Next, write an Rscript (align.R):
library(RcwlPipelines)
library(BiocParallel) # for the batchtools package
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/songliu/mkorobkin/singularity_cache")
# Load pipeline
cwlInstall("pl_bwaDup")
inputs(bwaDup)
# Prepare inputs
fqs <- list.files("data/RS-03586070", ".fastq.gz", full.names = TRUE)
fq1 <- fqs[grep("R1", fqs)]
fq2 <- fqs[grep("R2", fqs)]
fqs.ids <- do.call(rbind, lapply(strsplit(basename(fq1), split = "[_.]"), function(x)x[1:2]))
RGs <- paste("@RG",
paste0("ID:", fqs.ids[,1], ".", fqs.ids[2]),
paste0("LB:", fqs.ids[,1], ".", fqs.ids[,2]),
paste0("DT:", Sys.Date()),
paste0("PL:", "Illumina"),
"CN:RPCCC",
paste0("SM:", fqs.ids[,1]), sep = "\\t")
fq1L <- tapply(fq1, fqs.ids[1], as.list, simplify = FALSE)
fq2L <- tapply(fq2, fqs.ids[1], as.list, simplify = FALSE)
RGL <- tapply(RGs, fqs.ids[1], as.list, simplify = FALSE)
FID <- paste(fqs.ids[1], "WES-ILU.bam", sep="_")
idBam <- tapply(FID, fqs.ids[,1], function(x)as.list(unique(x)))
idx <- TRUE
inputList <- list(RG = RGL[idx],
outBam = idBam[idx],
FQ1s = fq1L[idx],
FQ2s = fq2L[idx])
paramList <- list(threads = 16,
Ref ="/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa")
# Submit a job using runCWLBatch() and BatchtoolsParam() functions
r1 <- runCWLBatch(bwaDup, outdir = "output/BAM", inputList, paramList,
BPPARAM = BatchtoolsParam(workers = lengths(inputList)[1],
cluster = "slurm",
template = "~/slurm.tmpl",
resources = list(ncpus = 16,
jobname = "align",
walltime = 60*60*72,
memory = 128000),
log = TRUE, logdir = ".", progressbar = TRUE),
stderr = "", docker="singularity")
Then, submit the job by typing:
R CMD BATCH --no-save --no-restore align.R align.out &
NOTE: R CMD BATCH writes all console
output, including messages, warnings, and print() statements, to
align.out. In contrast, the command:
Rscript align.R > align.out would write only the print()
output to the text file and everything else to the terminal. To achieve
the same effect as the above command, you must instead type:
R --no-restore --file=align.R > align.out &
Alternatively, you may define a new function in Rscript2
in .bashrc as follows:
function Rscript2() { R --no-restore --file="$1"; }
export -f Rscript2
Then, it suffices to type:
Rscript2 align.R &
which will write all the console outputs into the log file.
First, install the VariantCombiner package:
BiocManager::install("hubentu/VariantCombiner")
Then, you can submit the following R script (snv.R) to Slurm:
library(RcwlPipelines)
library(BiocParallel)
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/songliu/mkorobkin/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/songliu/mkorobkin/singularity_cache")
Sys.setenv(DEBUGME = "BiocParallel")
# Load pipeline
cwlInstall("pl_SomaticCaller4")
inputs(SomaticCaller4)
# Prepare inputs
ref <- "/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa"
tbam <- list("output/BAM/RS-03586071/RS-03586071_WES-ILU.bam") ## tumor
nbam <- list("output/BAM/RS-03586069/RS-03586069_WES-ILU.bam") ## normal
names(tbam) <- "RS-03586071_Bca120518-organoid"
tids <- list("RS-03586071")
nids <- list("RS-03586069")
# Read .bed file
regions <- "data/S31285117_Covered_nochr_clean.bed"
idx=TRUE
inputList <- list(tbam = tbam[idx],
nbam = nbam[idx],
tumor = tids[idx],
normal = nids[idx])
paramList <- list(Ref = ref,
interval = regions,
dbsnp = "/projects/rpci/songliu/qhu/annotation/GATK_bundle/b37/dbsnp_138.b37.vcf.gz",
gresource = "/projects/rpci/songliu/qhu/annotation/Mutect2/af-only-gnomad.raw.sites.b37.vcf",
comvcf = "/projects/rpci/songliu/qhu/annotation/Mutect2/GetPileupSummaries/small_exac_common_3_b37.vcf",
pon = "/projects/rpci/songliu/qhu/annotation/Mutect2/Mutect2-exome-panel.vcf",
threads = 16)
# Submit a job using runCWLBatch() and BatchtoolsParam() functions
runCWLBatch(SomaticCaller4, outdir = "output/variants", inputList, paramList,
BPPARAM = BatchtoolsParam(workers = lengths(inputList)[1],
cluster = "slurm",
template = "~/slurm.tmpl",
resources = list(ncpus = 16,
jobname = "snv",
walltime = 60*60*24,
memory = 32000),
log = TRUE, logdir = ".",
progressbar = TRUE,
saveregistry = FALSE),
stderr = "", docker="singularity")
list.files("output/variants")