1 Prerequisite

1.1 Install RcwlPipeline

The SNV-calling pipeline is processed using R and an R library called Rcwlpipelines. The official userguide for RcwlPipline is found in here. For a general guideline for installing an R package in the CCR cluster, see their knowlegde base article.

1.1.2 Using R under one’s home directory in the CCR cluster

Personally, I have been using my own version of R installed in my home directory (and keep using the latest version). To this end, one can do the following:

cd /projects/rpci/songliu/[username]/Software # or anywhere under your /project directory
wget https://cran.r-project.org/src/base/R-4/R-4.5.2.tar.gz  # pick the version you like
tar -xvzf R-4.5.2.tar.gz
cd R-4.5.2
./configure --prefix=/projects/rpci/songliu/[username]/R --enable-R-shlib 
make && make install

Then, proceed inside this private version of R as usual, i.e.,

dir.create("/projects/rpci/songliu/[username]/tmp/")
Sys.setenv(TMPDIR="/projects/rpci/songliu/[username]/tmp") 
install.packages("BiocManager")
BiocManager::install(c("Rcwl","RcwlPipelines"))
library(RcwlPipelines, lib.loc="/projects/rpci/songliu/[username]/rlibs")

Notice that you need not to designate a personalized library (e.g. /rlibs) unless using R in a public module.

Note: If your installation halts because of the dependency: RcppTOML, then download its older version and compile it outside R as follows.

wget http://lib.stat.cmu.edu/R/CRAN/src/contrib/Archive/RcppTOML/RcppTOML_0.1.3.tar.gz
R CMD INSTALL RcppTOML_0.1.3.tar.gz

Then, proceed in R using force = TRUE option:

BiocManager::install(c("Rcwl","RcwlPipelines"), force = TRUE)

Also note: If your installation fails with a message:

ERROR: 'configure' exists but is not executable -- see the 'R Installation and Administration Manual'

then you may need to set the TMPDIR environment variable which R will use as the compilation directory:

mkdir ~/tmp
export TMPDIR=~/tmp

Then, proceed in R using BiocManager::install with force = TRUE option.

1.2 Copying fasta file

Use globas online:

Once you come to File Manager screen, choose RPCI (hpcapp01) and UB (vortex) clusters as source and destination host, as shown bellow.

Press start to copy the directories containing fasta files from RPCI host to your /project home directory in UB host. Note that each fasta file is a few GB in memory, and you must make sure your destination has enough space.

2 Organizing files

Some of Qiang’s R scripts we use for data processing are written with an assumption that the parent directory of input and output files (e.g., .fastq.gz and .bam files, respectively) is named after their publication ID consisting of a patient/mouse ID (e.g., BCMm#) followed by sample ID (e.g.. B#). However, the original data file and their parent directories in the RPCI host are usually not named using pubIDs but their the “RS” IDs (RS-#). Therefore, it is recommended to download them first in a generic $JOBDIR/data directory, and then create a directory for each patient/mouse ID containg soft links to the files in the $JOBDIR/data directory. For instance,

JOBDIR=/projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss
$JOBDIR/data/ ## Contains origintal fastq files
    RS-04394159/
        RS-04394159_RS-03976924-Bladder_RS-04388153_S4_R1_001.fastq.gz  
        RS-04394159_RS-03976924-Bladder_RS-04388153_S4_R2_001.fastq.gz
    RS-04394163/
        RS-04394163_RS-03976925-Bladder_RS-04388157_S8_R1_001.fastq.gz  
        RS-04394163_RS-03976925-Bladder_RS-04388157_S8_R2_001.fastq.gz
    ...
$JOBDIR/BCMm1001/
    data/ ## Contains soft links to actual data directories
        BCMm1001-B1/ -> $JOBDIR/data/RS-04394159
        BCMm1001-B2/ -> $JOBDIR/data/RS-04394163
        ...
    output/ ## Will be created automatically once outputs are generated
        BAM/
            BCMm1001-B1/
                 BCMm1001-B1_RS-04394159_RS-04388153_RS-03976924.bam  
                 BCMm1001-B1_RS-04394159_RS-04388153_RS-03976924.bam.bai  
            BCMm1001-B2/
                 BCMm1001-B2_RS-04394163_RS-04388157_RS-03976925.bam  
                 BCMm1001-B2_RS-04394163_RS-04388157_RS-03976925.bam.bai  
            ...
        ...

Suppose you created a file under each $JOBDIR/patient(mouse)ID/ directory containing look-up table for RS-ID and publication ID, e.g.,

$ head $JOBDIR/BCMm100*/sampleID.txt
==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1001/sampleID.txt <==
RS-04394159 BCMm1001-B1
RS-04394163 BCMm1001-B2

==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1002/sampleID.txt <==
RS-04394160 BCMm1002-B1
RS-04394164 BCMm1002-B2
RS-04394167 BCMm1002-N

==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1003/sampleID.txt <==
RS-04394161 BCMm1003-B2-bbn12
RS-04394165 BCMm1003-B1-bbn12
RS-04394168 BCMm1003-N

==> /projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1004/sampleID.txt <==
RS-04394162 BCMm1004-B2-bbn12
RS-04394166 BCMm1004-B1-bbn12
RS-04394169 BCMm1004-N

Then the following script would create desired directory structure:

#!/bin/bash
# prep.sh

PREFIX=/projects/rpci/songliu
PID=RQ026467-Huss
JOBDIR=$PREFIX/[username]/projects/$PID      ## Your project directory
#DATADIR=$JOBDIR/data                        ## Original data directory
DATADIR=$PREFIX/mkorobkin/projects/$PID/data ## for testing

PLIST="BCMm1001 BCMm1002 BCMm1003 BCMm1004"

for pid in $PLIST; do

  echo $pid:
  #ifile=$JOBDIR/$pid/sampleID.txt
  ifile=$PREFIX/mkorobkin/projects/$PID/$pid/sampleID.txt ## for testing

  OUTDIR=$JOBDIR/$pid/data ## New data directory
  mkdir -p $OUTDIR

  SLIST=`awk '{print $1}' $ifile`
  for rsid in $SLIST; do
    sid=`awk -v rsid=$rsid '$1~rsid {print $2}' $ifile`
    echo $DATADIR/$rsid $OUTDIR/$sid ## source and target directories 
    ln -s -T $DATADIR/$rsid $OUTDIR/$sid
  done

done

3 (Optional) Trim adaptor sequence

Normally, the sequencing team at RPCI does the trimming of fastq files before sending them to us, but if you are processing fastq files with untrimmed adaptor sequences, modify and use trim.sh:

chmod +x trim.sh
./trim.sh [patient(mouse)ID]

A few general comments about running R scripts in Slurm: To begin, you always need to load the RcwlPipeline library.

library(RcwlPipelines)

And specify temporary and cashe directories (to avoid exceeding container quota; e.g.,

Sys.setenv(SINGULARITY_TMPDIR="/projects/rpci/[groupname]/[username]/tmp")
Sys.setenv(SINGULARITY_CACHEDIR="/projects/rpci/[groupname]/[username]/singularity_cache")
Sys.setenv(SINGULARITY_LOCALCACHEDIR="projects/rpci/[groupname]/[username]/tmp/runtime")
Sys.setenv(DEBUGME = "BiocParallel")

Optional: you may need to run cwlUpdate() for accessing the latest pipeline scripts.

cwlUpdate()

In addition, it is advisable to add such a custum function as follows in ~/.bashrc in order to monitor the batch processing your log files:

# For Slurm batch jobs
function Rscript2() { R --no-restore --file="$1"; }
export -f Rscript2

4 Sequence Alignment

For target-sequensing alignment, modify and use: align.sh which does not mark or remove duplicates in order to find very low-frequency mutations.

chmod +x align.sh
./align.sh [patient(mouse)ID]

On the other hand, for whole-exome-sequensing (WES) alignment, modify and use: alignDup.sh instead.

chmod +x alignDup.sh
./alignDup.sh [patient(mouse)ID]

Note: In both cases, you need to select different reference genomes depending on what kind of samples are sequenced, i.e.,

## Reference genome
#rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa  ## human
rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa   ## mouse

Also note: You need to properly assign the filenames for output BAM files, according to the convention specified in this file. For example, we may have:

> # Prepare inputs
> fqs <- list.files("/projects/rpci/songliu/mkorobkin/projects/RQ026467-Huss/BCMm1003/data", ".fastq.gz", full.names = TRUE, recursive = TRUE)
> fq1 <- fqs[grep("R1", fqs)]
> fq2 <- fqs[grep("R2", fqs)]
> fqs.ids <- do.call(rbind, lapply(strsplit(basename(fq1), split = "[_.]"), function(x)x[1:6]))
> 
> fqs.ids
     [,1]          [,2]                  [,3]          [,4]  [,5] [,6] 
[1,] "RS-04394165" "RS-04183872-Bladder" "RS-04388159" "S10" "R1" "001"
[2,] "RS-04394161" "RS-04183873-Bladder" "RS-04388155" "S6"  "R1" "001"
[3,] "RS-04394168" "RS-04183874-Tail"    "RS-04388162" "S13" "R1" "001"
> 
> fqs.ids[,2] <- gsub("-[A-Z][a-z]*$","",fqs.ids[,2])
> 
> fqs.ids
     [,1]          [,2]          [,3]          [,4]  [,5] [,6] 
[1,] "RS-04394165" "RS-04183872" "RS-04388159" "S10" "R1" "001"   ## "-Bladder" or "-Tail" removed
[2,] "RS-04394161" "RS-04183873" "RS-04388155" "S6"  "R1" "001"   ##
[3,] "RS-04394168" "RS-04183874" "RS-04388162" "S13" "R1" "001"   ##
> 
> sids <- basename(dirname(fq1))
> 
> FID <- paste0(sids, "_", fqs.ids[,1], "_", fqs.ids[,3], "_", fqs.ids[,2], ".bam")  
> FID
[1] "BCMm1003-B1-bbn12_RS-04394165_RS-04388159_RS-04183872.bam"  ## [PubID]_[RSIDs].bam
[2] "BCMm1003-B2-bbn12_RS-04394161_RS-04388155_RS-04183873.bam"  ## 
[3] "BCMm1003-N_RS-04394168_RS-04388162_RS-04183874.bam"         ##

If successful, you will have output files under $JOBDIR/[patient(mouse)ID]/ourput/BAM/ directories. Otherwise, descend to a patient’s (or a mouse’s) src directory and re-run the align.R script. That is,

cd $JOBDIR/[patient(mouse)ID]/src
Rscript2 align.R &

5 (Optional) Coverage Analysis

Once BAM and BAM-index files are produced by align.sh or alignDup.sh, you can run a coverage analysis using coverage2.sh. You may need to adjust the coverage threshold, depending on the depths of coverage achieved in sequencing.

#coverageThreshold <- "10000,20000,50000,100000"  ## no spaces in between
coverageThreshold <- "3000,5000,8000,10000"       ## 
#coverageThreshold <- "1,10,20,30"                ## 

Note: coverage2.sh script uses the mosdepth software, and thus you need to load necessary modules BEFORE calling an R scripts as follows.

module use /projects/rpci/shared/modulefiles
module load gcc mosdepth

Then proceed as usual, i.e.,

chmod +x coverage2.sh
./coverage2.sh [patient(mouse)ID]

Also note: You need to have a target region file in BED format and assign it to the rgfile parameter in coverage2.sh. For instance,

PREFIX=/projects/rpci/songliu/mkorobkin
PID=RQ026467-Huss
. . .
rgfile=$PREFIX/projects/$PID/PGD943.coveredRegion_clean.bed

When successful, the coverage2.sh script produces outputs under $JOBDIR/[patient(mouse)ID]/output/coverage2/ directory.

5.1 Flag Statistics

For target-sequencing only, you also need to modify and run flagstat.sh. (The WES alignment pipeline includes the flagstat calculation, so there is no need to do this step.).

chmod +x flagstat.sh
./flagstat.sh

This will produce an output file (.flagstat.txt) under the $JOBDIR/[patient(mouse)ID]/output/BAM/ directory.

5.2 Summary

After processing all samples in the project, obtain a comprehensive summary file using summary2.sh. First, you need to load all the following required modules.

module use /projects/rpci/shared/modulefiles
module load gcc mosdepth
module load openmpi
module load samtools 

Then, proceed as usual, i.e.,

chmod +x summary2.sh
./summary2.sh

The results will be written in the $JOBDIR/[projectID]_reads_summary2.csv.

6 SNV calling

6.1 Human samples

For human samples, modify and use: snv4.sh. This script calls for the SomaticCaller4 pipeline in the RcwlPipelines library, and processes BAM files from a pair of tumor-normal samples for somatic mutations using 4 different SNV callers (Mutect2PL, MuSE, mantaStrelka and VarDict) in series. This is the most time-consuming step, and make sure to allocate enough computing time.

The R script in snv4.sh requires an external parameter file (pfile) which contains the information of normal and tumor sample pairs:

OUTDIR=$JOBDIR/output
pfile=$OUTDIR/TNpair.txt
. . .
df <- data.frame(read.table("$pfile"))
tids <- as.list(df\$V1)
tbam <- as.list(df\$V2)  
nids <- as.list(df\$V3)
nbam <- as.list(df\$V4)  
names(tbam) <- as.list(df\$V1)

Here, tids are a set of publication IDs of tumor samples, tbam are their BAM files (with their complete paths), nids are corresponding normal samples used as a germline, and nbam are their BAM files, respectively. You need to modify the snv4.sh script and make sure to provide correct tumor-normal pairs in the $OUTDIR/TNpair.txt file for the SNV calling.

chmod +x snv4.sh
./snv4.sh [patientID]

6.2 Mouse samples

For mouse samples, modify and use: snv4_mouse.sh instead of snv4.sh.

chmod +x snv4_mouse.sh
./snv4_mouse.sh [mouseID]

6.3 Combine VCF files

When successful, outputs from all 4 SNV callers are written in VCF format under the $JOBDIR/[patient(mouse)ID]/output/variants directory. If processing aborts for some reason, decend to $JOBDIR/[patient(mouse)ID]/src directory and re-run the snv4.R or snv_mouse.R script. You may want to run such script as follows to see if all the SNV callers are processed properly or not.

#!/usr/bin/bash
# check.sh

if [[ $# -eq 0 ]] ; then
    echo "Usage: ./check.sh <patientID>"
    exit 1
fi
pid=$1

PID=RQ026467-Huss
PREFIX=/projects/rpci/songliu/mkorobkin/projects
JOBDIR=$PREFIX/$PID
OUTDIR=$JOBDIR/$pid/output

SNVDIR=$OUTDIR/variants
SLIST=`ls $SNVDIR`

for tid in $SLIST; do
#[ -d "./data/$tid" ] && echo "Directory $tid exists." || echo "Error: Directory $tid does not exist."

if [ ! "$(ls -A $SNVDIR/$tid)" ]; then
  echo $tid empty
elif ! compgen -G "$SNVDIR/$tid/*_VarDict.vcf" > /dev/null; then
  echo $tid: vardict undone 
elif ! compgen -G "$SNVDIR/$tid/*_MuSE.vcf" > /dev/null; then
  echo $tid: muse undone 
elif ! compgen -G "$SNVDIR/$tid/*filtered.PASS.vcf" > /dev/null; then
  echo $tid: mutect2 undone 
elif [ ! -f "$SNVDIR/$tid/somatic.snvs.vcf.gz" ]; then
 echo $tid: strelka2 undone
elif ! compgen -G "$SNVDIR/$tid/*strelka2_mutect2_muse_vardict.vcf" > /dev/null; then
  echo $tid: Outputs from different callers need to be combined: Run combine.sh  
else
  echo $tid
fi
done

Occasinally, VCF output files from different SNV callers for a sample fail to combine. If this happens, then modify and run combine.sh to mannually produce [pubID-tumor]_[pubID-normal]_strelka2_mutect2_muse_vardict.vcf file.

chmod +x combine.sh
./combine.sh [patient(mouse)ID]

7 Variance Annotation

Once snv4.sh or snv4_mouse.sh produces outputs from all 4 SNV callers, we can run Variant Effect Predictor (VEP), which predicts the functional effects of genomic variants. To this end, modify and use: vep4.sh.

NOTE: Make sure to select the correct reference genome for human or mouse samples:

rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa  ## human
#rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa  ## mouse

Then proceed as usual, i.e.,

chmod +x vep4.sh
./vep4.sh [human(mouse)ID]

which should produce [pubID-tumor]_[pubID-normal]_strelka2_mutect2_muse_vardict.vep.vcf file in the $JOBID/[human(mouse)ID]/output/variants/[pubID] directories.

Also NOTE: VEP annotation does not consume too much memories, and so it is recommended to use single CPU from the scavenger partitions in UB cluster.

8 VCF to SJ conversion

Finally, convert VCF files to “.sj” files using: vcf2sj4.sh.

chmod +x vcf2sj4.sh
./vcf2sj4.sh [human(mouse)ID]

9 (Optional) Bambino SNV caller

To run an additional SNV calling using bambino, modify and use: snv5.sh. Make sure to select correct reference genome for human or for mouse, e.g.,

rfile=/projects/rpci/shared/reference/human_g1k_v37/human_g1k_v37.fa  ## human
#rfile=/projects/rpci/shared/references/GRCm38/Mus_musculus.GRCm38.dna.toplevel.fa  ## mouse

Output files will be in the $JOBDIR/[patient(mouse)ID]/output/variants2/ directory. Then, convert them to “.sj” format by copying both reformat_bambino2SJ.pl and reformat.sh into the same directory and run:

chmod +x reformat.sh
./reformat.sh [human(mouse)ID]

10 BV Lists

After you finish processing all samples from all patients/mice in one project, to create a look-up table of their BAM files and corresponding SNV files, modify and use bvlist.sh:

chmod +x bvlist.sh
./bvlist.sh