The SNV-calling pipeline is processed using R and an R library called Rcwlpipelines. The official userguide for RcwlPipline is found in here. For a general guideline for installing an R package in the CCR cluster, see their knowlegde base article.
Create an R-library directory and a temporary directory under your /projects directory; for example,
mkdir /projects/rpci/songliu/[username]/rlibs
mkdir /projects/rpci/songliu/[username]/tmp
Here, replace [username] with your own (e.g.,
mkorobkin).
Then, open the R environment by typing:
module use /projects/rpci/shared/modulefiles
R
To begin, you first need to specify the temporary directory to use during the installation, as follows.
Sys.setenv(TMPDIR="/projects/rpci/songliu/[username]/tmp")
Next, install the Bioconductor package in the personal R-library as follows.
install.packages("BiocManager", lib="/projects/rpci/songliu/[username]/rlibs")
Then, install the RcwlPipeline package using
BiocManager as follows.
BiocManager::install(c("Rcwl","RcwlPipelines"),lib="/projects/rpci/songliu/[username]/rlibs")
Once you successfully install the RcwlPipelines, it can be loaded by typing:
library(RcwlPipelines, lib.loc="/projects/rpci/songliu/[username]/rlibs")
Personally, I have been using my own version of R installed in my home directory (and keep using the latest version). To this end, one can do the following:
cd /projects/rpci/songliu/[username]/Software # or anywhere under your /project directory
wget https://cran.r-project.org/src/base/R-4/R-4.5.2.tar.gz # pick the version you like
tar -xvzf R-4.5.2.tar.gz
cd R-4.5.2
./configure --prefix=/projects/rpci/songliu/[username]/R --enable-R-shlib
make && make install
Then, proceed inside this private version of R as usual, i.e.,
dir.create("/projects/rpci/songliu/[username]/tmp/")
Sys.setenv(TMPDIR="/projects/rpci/songliu/[username]/tmp")
install.packages("BiocManager")
BiocManager::install(c("Rcwl","RcwlPipelines"))
library(RcwlPipelines, lib.loc="/projects/rpci/songliu/[username]/rlibs")
Notice that you need not to designate a personalized library
(e.g. /rlibs) unless using R in a public module.
Note: If your installation halts because of the
dependency: RcppTOML, then download its older version and
compile it outside R as follows.
wget http://lib.stat.cmu.edu/R/CRAN/src/contrib/Archive/RcppTOML/RcppTOML_0.1.3.tar.gz
R CMD INSTALL RcppTOML_0.1.3.tar.gz
Then, proceed in R using force = TRUE option:
BiocManager::install(c("Rcwl","RcwlPipelines"), force = TRUE)
Also note: If your installation fails with a message:
ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual'
then you may need to set the TMPDIR environment variable which R will use as the compilation directory:
mkdir ~/tmp
export TMPDIR=~/tmp
Then, proceed in R using BiocManager::install with
force = TRUE option.
Use globas online:
Once you come to File Manager screen, choose RPCI
(hpcapp01) and UB (vortex) clusters as source
and destination host, as shown bellow.
Press start to copy the directories containing fasta files from RPCI host to your /project home directory in UB host. Note that each fasta file is a few GB in memory, and you must make sure your destination has enough space.
Some of Qiang’s R scripts we use for data processing are written with
an assumption that the parent directory of input and output files (e.g.,
.fastq.gz and .bam files, respectively) is named after their publication
ID consisting of a patient/mouse ID (e.g., BCMm#) followed by sample ID
(e.g.. B#). However, the original data file and their parent directories
in the RPCI host are usually not named using pubIDs but their the “RS”
IDs (RS-#). Therefore, it is recommended to download them first in a
generic $JOBDIR/data directory, and then create a directory
for each patient/mouse ID containg soft links to the files in the
$JOBDIR/data directory. For instance,
JOBDIR=/projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss
$JOBDIR/data/ ## Contains origintal fastq files
RS-04394159/
RS-04394159_RS-03976924-Bladder_RS-04388153_S4_R1_001.fastq.gz
RS-04394159_RS-03976924-Bladder_RS-04388153_S4_R2_001.fastq.gz
RS-04394163/
RS-04394163_RS-03976925-Bladder_RS-04388157_S8_R1_001.fastq.gz
RS-04394163_RS-03976925-Bladder_RS-04388157_S8_R2_001.fastq.gz
...
$JOBDIR/BCMm1001/
data/ ## Contains soft links to actual data directories
BCMm1001-B1/ -> $JOBDIR/data/RS-04394159
BCMm1001-B2/ -> $JOBDIR/data/RS-04394163
...
output/ ## Will be created automatically once outputs are generated
BAM/
BCMm1001-B1/
BCMm1001-B1_RS-04394159_RS-04388153_RS-03976924.bam
BCMm1001-B1_RS-04394159_RS-04388153_RS-03976924.bam.bai
BCMm1001-B2/
BCMm1001-B2_RS-04394163_RS-04388157_RS-03976925.bam
BCMm1001-B2_RS-04394163_RS-04388157_RS-03976925.bam.bai
...
...
Suppose you created a file under each
$JOBDIR/patient(mouse)ID/ directory containing look-up
table for RS-ID and publication ID, e.g.,
$ head $JOBDIR/BCMm100*/sampleID.txt
==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1001/sampleID.txt <==
RS-04394159 BCMm1001-B1
RS-04394163 BCMm1001-B2
==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1002/sampleID.txt <==
RS-04394160 BCMm1002-B1
RS-04394164 BCMm1002-B2
RS-04394167 BCMm1002-N
==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1003/sampleID.txt <==
RS-04394161 BCMm1003-B2-bbn12
RS-04394165 BCMm1003-B1-bbn12
RS-04394168 BCMm1003-N
==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1004/sampleID.txt <==
RS-04394162 BCMm1004-B2-bbn12
RS-04394166 BCMm1004-B1-bbn12
RS-04394169 BCMm1004-N
Then the following script would create desired directory structure:
#!/bin/bash
# prep.sh
PREFIX=/projects/rpci/songliu
PID=RQ026467-Huss
JOBDIR=$PREFIX/[username]/projects/$PID ## Your project directory
#DATADIR=$JOBDIR/data ## Original data directory
DATADIR=$PREFIX/mkorobkin/projects/$PID/data ## for testing
PLIST="BCMm1001 BCMm1002 BCMm1003 BCMm1004"
for pid in $PLIST; do
echo $pid:
#ifile=$JOBDIR/$pid/sampleID.txt
ifile=$PREFIX/mkorobkin/projects/$PID/$pid/sampleID.txt ## for testing
OUTDIR=$JOBDIR/$pid/data ## New data directory
mkdir -p $OUTDIR
SLIST=`awk '{print $1}' $ifile`
for rsid in $SLIST; do
sid=`awk -v rsid=$rsid '$1~rsid {print $2}' $ifile`
echo $DATADIR/$rsid $OUTDIR/$sid ## source and target directories
ln -s -T $DATADIR/$rsid $OUTDIR/$sid
done
done
Normally, the sequencing team at RPCI does the trimming of fastq files before sending them to us, but if you are processing fastq files with untrimmed adaptor sequences, modify and use trim.sh:
chmod +x trim.sh
./trim.sh [patient(mouse)ID]
A few general comments about running R scripts in Slurm: To begin, you always need to load the RcwlPipeline library.
library(RcwlPipelines)
And specify temporary and cashe directories (to avoid exceeding container quota; e.g.,
Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/[groupname]/[username]/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/[groupname]/[username]/singularity_cache")
Sys.setenv(SINGULARITY_LOCALCACHEDIR="projects/rpci/[groupname]/[username]/tmp/runtime")
Sys.setenv(DEBUGME = "BiocParallel")
Optional: you may need to run
cwlUpdate() for accessing the latest pipeline scripts.
cwlUpdate()
In addition, it is advisable to add such a custum function as follows
in ~/.bashrc in order to monitor the batch processing your
log files:
# For Slurm batch jobs
function Rscript2() { R --no-restore --file="$1"; }
export -f Rscript2
For target-sequensing alignment, modify and use: align.sh which does not mark or remove duplicates in order to find very low-frequency mutations.
chmod +x align.sh
./align.sh [patient(mouse)ID]
On the other hand, for whole-exome-sequensing (WES) alignment, modify and use: alignDup.sh instead.
chmod +x alignDup.sh
./alignDup.sh [patient(mouse)ID]
Note: In both cases, you need to select different reference genomes depending on what kind of samples are sequenced, i.e.,
## Reference genome
#rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa ## human
rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa ## mouse
Also note: You need to properly assign the filenames for output BAM files, according to the convention specified in this file. For example, we may have:
> # Prepare inputs
> fqs <- list.files("/projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1003/data", ".fastq.gz", full.names = TRUE, recursive = TRUE)
> fq1 <- fqs[grep("R1", fqs)]
> fq2 <- fqs[grep("R2", fqs)]
> fqs.ids <- do.call(rbind, lapply(strsplit(basename(fq1), split = "[_.]"), function(x)x[1:6]))
>
> fqs.ids
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "RS-04394165" "RS-04183872-Bladder" "RS-04388159" "S10" "R1" "001"
[2,] "RS-04394161" "RS-04183873-Bladder" "RS-04388155" "S6" "R1" "001"
[3,] "RS-04394168" "RS-04183874-Tail" "RS-04388162" "S13" "R1" "001"
>
> fqs.ids[,2] <- gsub("-[A-Z][a-z]*$","",fqs.ids[,2])
>
> fqs.ids
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "RS-04394165" "RS-04183872" "RS-04388159" "S10" "R1" "001" ## "-Bladder" or "-Tail" removed
[2,] "RS-04394161" "RS-04183873" "RS-04388155" "S6" "R1" "001" ##
[3,] "RS-04394168" "RS-04183874" "RS-04388162" "S13" "R1" "001" ##
>
> sids <- basename(dirname(fq1))
>
> FID <- paste0(sids, "_", fqs.ids[,1], "_", fqs.ids[,3], "_", fqs.ids[,2], ".bam")
> FID
[1] "BCMm1003-B1-bbn12_RS-04394165_RS-04388159_RS-04183872.bam" ## [PubID]_[RSIDs].bam
[2] "BCMm1003-B2-bbn12_RS-04394161_RS-04388155_RS-04183873.bam" ##
[3] "BCMm1003-N_RS-04394168_RS-04388162_RS-04183874.bam" ##
If successful, you will have output files under
$JOBDIR/[patient(mouse)ID]/ourput/BAM/ directories.
Otherwise, descend to a patient’s (or a mouse’s) src
directory and re-run the align.R script. That is,
cd $JOBDIR/[patient(mouse)ID]/src
Rscript2 align.R &
Once BAM and BAM-index files are produced by align.sh or
alignDup.sh, you can run a coverage analysis using coverage2.sh. You may need
to adjust the coverage threshold, depending on the depths of coverage
achieved in sequencing.
#coverageThreshold <- "10000,20000,50000,100000" ## no spaces in between
coverageThreshold <- "3000,5000,8000,10000" ##
#coverageThreshold <- "1,10,20,30" ##
Note: coverage2.sh script uses the mosdepth software, and thus you need to load necessary modules BEFORE calling an R scripts as follows.
module use /projects/rpci/shared/modulefiles
module load gcc mosdepth
Then proceed as usual, i.e.,
chmod +x coverage2.sh
./coverage2.sh [patient(mouse)ID]
Also note: You need to have a target region file in
BED format and assign it to the rgfile parameter in
coverage2.sh. For instance,
PREFIX=/projects/rpci/songliu/mkorobkin
PID=RQ026467-Huss
. . .
rgfile=$PREFIX/projects/$PID/PGD943.coveredRegion_clean.bed
When successful, the coverage2.sh script produces
outputs under $JOBDIR/[patient(mouse)ID]/output/coverage2/
directory.
For target-sequencing only, you also need to modify and run flagstat.sh. (The WES alignment pipeline includes the flagstat calculation, so there is no need to do this step.).
chmod +x flagstat.sh
./flagstat.sh
This will produce an output file (.flagstat.txt) under
the $JOBDIR/[patient(mouse)ID]/output/BAM/ directory.
After processing all samples in the project, obtain a comprehensive summary file using summary2.sh. First, you need to load all the following required modules.
module use /projects/rpci/shared/modulefiles
module load gcc mosdepth
module load openmpi
module load samtools
Then, proceed as usual, i.e.,
chmod +x summary2.sh
./summary2.sh
The results will be written in the
$JOBDIR/[projectID]_reads_summary2.csv.
For human samples, modify and use: snv4.sh. This script calls
for the SomaticCaller4 pipeline in the RcwlPipelines
library, and processes BAM files from a pair of tumor-normal samples for
somatic mutations using 4 different SNV callers (Mutect2PL, MuSE,
mantaStrelka and VarDict) in series. This is the most time-consuming
step, and make sure to allocate enough computing time.
The R script in snv4.sh requires an external parameter
file (pfile) which contains the information of normal and tumor sample
pairs:
OUTDIR=$JOBDIR/output
pfile=$OUTDIR/TNpair.txt
. . .
df <- data.frame(read.table("$pfile"))
tids <- as.list(df\$V1)
tbam <- as.list(df\$V2)
nids <- as.list(df\$V3)
nbam <- as.list(df\$V4)
names(tbam) <- as.list(df\$V1)
Here, tids are a set of publication IDs of tumor
samples, tbam are their BAM files (with their complete
paths), nids are corresponding normal samples used as a
germline, and nbam are their BAM files, respectively. You
need to modify the snv4.sh script and make sure to provide
correct tumor-normal pairs in the $OUTDIR/TNpair.txt file
for the SNV calling.
chmod +x snv4.sh
./snv4.sh [patientID]
For mouse samples, modify and use: snv4_mouse.sh instead of
snv4.sh.
chmod +x snv4_mouse.sh
./snv4_mouse.sh [mouseID]
When successful, outputs from all 4 SNV callers are written in VCF
format under the $JOBDIR/[patient(mouse)ID]/output/variants
directory. If processing aborts for some reason, decend to
$JOBDIR/[patient(mouse)ID]/src directory and re-run the
snv4.R or snv_mouse.R script. You may want to
run such script as follows to see if all the SNV callers are processed
properly or not.
#!/usr/bin/bash
# check.sh
if [[ $# -eq 0 ]] ; then
echo "Usage: ./check.sh <patientID>"
exit 1
fi
pid=$1
PID=RQ026467-Huss
PREFIX=/projects/rpci/songliu/mkorobkin/projects
JOBDIR=$PREFIX/$PID
OUTDIR=$JOBDIR/$pid/output
SNVDIR=$OUTDIR/variants
SLIST=`ls $SNVDIR`
for tid in $SLIST; do
#[ -d "./data/$tid" ] && echo "Directory $tid exists." || echo "Error: Directory $tid does not exist."
if [ ! "$(ls -A $SNVDIR/$tid)" ]; then
echo $tid empty
elif ! compgen -G "$SNVDIR/$tid/*_VarDict.vcf" > /dev/null; then
echo $tid: vardict undone
elif ! compgen -G "$SNVDIR/$tid/*_MuSE.vcf" > /dev/null; then
echo $tid: muse undone
elif ! compgen -G "$SNVDIR/$tid/*filtered.PASS.vcf" > /dev/null; then
echo $tid: mutect2 undone
elif [ ! -f "$SNVDIR/$tid/somatic.snvs.vcf.gz" ]; then
echo $tid: strelka2 undone
elif ! compgen -G "$SNVDIR/$tid/*strelka2_mutect2_muse_vardict.vcf" > /dev/null; then
echo $tid: Outputs from different callers need to be combined: Run combine.sh
else
echo $tid
fi
done
Occasinally, VCF output files from different SNV callers for a sample
fail to combine. If this happens, then modify and run combine.sh to mannually
produce
[pubID-tumor]_[pubID-normal]_strelka2_mutect2_muse_vardict.vcf
file.
chmod +x combine.sh
./combine.sh [patient(mouse)ID]
Once snv4.sh or snv4_mouse.sh produces
outputs from all 4 SNV callers, we can run Variant Effect Predictor (VEP), which
predicts the functional effects of genomic variants. To this end, modify
and use: vep4.sh.
NOTE: Make sure to select the correct reference genome for human or mouse samples:
rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa ## human
#rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa ## mouse
Then proceed as usual, i.e.,
chmod +x vep4.sh
./vep4.sh [human(mouse)ID]
which should produce
[pubID-tumor]_[pubID-normal]_strelka2_mutect2_muse_vardict.vep.vcf
file in the $JOBID/[human(mouse)ID]/output/variants/[pubID]
directories.
Also NOTE: VEP annotation does not consume too much memories, and so it is recommended to use single CPU from the scavenger partitions in UB cluster.
Finally, convert VCF files to “.sj” files using: vcf2sj4.sh.
chmod +x vcf2sj4.sh
./vcf2sj4.sh [human(mouse)ID]
To run an additional SNV calling using bambino, modify and use: snv5.sh. Make sure to select correct reference genome for human or for mouse, e.g.,
rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa ## human
#rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa ## mouse
Output files will be in the
$JOBDIR/[patient(mouse)ID]/output/variants2/ directory.
Then, convert them to “.sj” format by copying both reformat_bambino2SJ.pl and
reformat.sh into the same
directory and run:
chmod +x reformat.sh
./reformat.sh [human(mouse)ID]
After you finish processing all samples from all patients/mice in one project, to create a look-up table of their BAM files and corresponding SNV files, modify and use bvlist.sh:
chmod +x bvlist.sh
./bvlist.sh