SMuRF v2.0

by Huang Weitai

1st May 2020


Introduction

SMuRF R package predicts a consensus set of somatic mutation calls using RandomForest machine learning. SMuRF generates a set of point mutations and insertions/deletions (indels) trained on the latest community-curated tumor whole genome sequencing data (Alioto et. al., 2015, Nat. Comms.). Our method is fast and accurate and analyses both whole-genome and whole-exome sequencing data from different cancer types.

For more information see our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btz018

Citation
Huang, W., Guo, YA., Muthukumar, K., Baruah, P., Chang, M., and AJ., Skanderup, SMuRF: Portable and accurate ensemble prediction of somatic mutations. Bioinformatics, 2019: p. 3157-3159


Input from bcbio-nextgen pipeline

Before running SMuRF, you require output data from the bcbio-nextgen pipeline that generates the VCF output for the variant callers: MuTect2, FreeBayes, VarDict, VarScan and the latest Strelka2. An additional caller Strelka2, has been added to SMuRF 2.0 and the information is documented on our wiki page.

SMuRF v1.6.4 is still available here SMuRF v1.6.4 readme

Note that your vcf.gz files need to be tab-indexed (.tbi files required) for retrieving gene annotations in SMuRF. We would recommend the bcbio-nextgen pipeline for a better user experience.

SMuRF requires the VCF output from each caller (.vcf.gz) to be placed in the same directory and files tagged with the caller (eg. sample1-mutect.vcf.gz, sample1-freebayes.vcf.gz, sample1-vardict.vcf.gz, sample1-varscan.vcf.gz)


Input directly from VCF Callers (optional)

For Users not running bcbio-nextgen pipeline: Alternatively, install and execute the individual callers.

Refer to the installation and instructions for each caller:
- VarDict
- VarScan
- MuTect2
- FreeBayes
- Strelka2


Test Dataset

In this vignette, we utilise a partial output dataset derived from the chronic lymphocytic leukemia (CLL) data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539. This test dataset is provided in the SMuRF package.


Requirements

R 3.3 & 3.4 : bioconductor::VariantAnnotation

R >=3.5 : BiocManager::VariantAnnotation

h2o package : If h2o package takes a long time to install, try manually installing from their AWS page.


Installation


1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF


2. You can install the current SMuRF directly from Github via the following R commands:

#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")

(Alternative option) SMuRF installation via downloading of the package from Github:

#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf and save to your local directory
install.packages("my/current/directory/smurf", repos = NULL, type = "source")

SMuRF concurrently predicts single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels) and saves time by parsing the VCF files once.

Missing packages will be installed the first time you run SMuRF.

library("smurf") #load SMuRF package

smurf() #check version and parameters

# "SMuRFv2.0 (30th April 2020)"
smurf(directory=NULL, mode=NULL, nthreads = -1,
                 annotation=F, output.dir=NULL,  parse.dir=NULL,
                 snv.cutoff = 'default', indel.cutoff = 'default',
                 build=NULL, change.build=F, t.label=NULL, re.tabIndex=F,
                 check.packages=T)

myresults = smurf(mydir, 'combined', build='hg19') #save output into 'myresults' variable


Functions

Arguments Description
directory Choose directory where the Variant Caller Format(VCF) files are located
output.dir Path to output directory (if saving files as .txt)
parse.dir Specify if changing SMuRF default cutoffs. Path to the location of existing snv-parse.txt and indel-parse.txt files generated by SMuRF
mode Choose “snv”, “indel” or “combined” (snv+indel). “combined” provides a separate list of SNVs and indels.
annotation TRUE or FALSE (default). Provide gene annotations for each variant call.
nthreads=-1 Number of cores used for RandomForest prediction. Default (-1) for maximum number of cores. For 32-bit Windows, only 1 core is allowed (nthreads=1).
t.label (Optional) Provide the sample name for your tumour sample to ease the identification of the normal and tumour sample names in your vcf
build Specify you genome build: build=‘hg19’ or build=“hg38”
change.build For conversion of your genomic coordinates
snv.cutoff Default SNV model cutoff, unless a number between 0 to 1 is stated
indel.cutoff Default indel model cutoff, unless a number between 0 to 1 is stated
re.tabIndex TRUE or FALSE (default). Set to TRUE to create tab-indexed (.tbi) files for each vcf
check.packages=T Developer mode

For more information on the parameters see R documentation:

help(smurf)


Examples:

library("smurf") #load SMuRF package

 myresults = smurf(directory="/path/to/directory..",
                   mode="snv",
                   output.dir="/path/to/output", build='hg19')
 
 #Alter snv or indel model score cutoff (see below)
 myresults = smurf(directory="/path/to/directory..",
                   mode="combined",
                   nthreads = 1, #for Windows OS
                   snv.cutoff = 0.2, indel.cutoff = 'default',
                   parse.dir="/path/to/parse/files/", build='hg19')
 
 #Include gene annotations for coding regions in output
 myresults = smurf(directory="/path/to/directory..",
                   mode="combined", 
                   annotation=T, build='hg19')
 
 #Change hg38 to hg19 coordinates in gene annotation output
 myresults = smurf(directory="/path/to/directory..",
                   mode="combined", 
                   annotation=T, 
                   build='hg38', change.build=T)
                   
 #Specify tumor sample name eg."SampleA-001-T"
 myresults = smurf(directory="/path/to/directory..",
                   mode="combined", 
                   annotation=T, 
                   build='hg38', t.label = '-T')
                                       
 #Specify directories manually
 dir.list = list(mutect='/path/to/mutect.vcf.gz',
                 freebayes='/path/to/freebayes.vcf.gz',
                 vardict='/path/to/vardict.vcf.gz',
                 varscan='/path/to/varscan.vcf.gz')
 myresults = smurf(directory=dir.list, 
                   mode="combined")


Troubleshooting: Adjusting snv and indel model score cutoffs manually


SMuRF is fine-tuned to achieve higher sensitivity. Re-adjust the stringency of the prediction with a specific cut-off value.
Use parameters ‘snv.cutoff’ or ‘indel.cutoff’ to adjust the thresholds.
To re-adjust the cut-off value of an existing SMuRF output, simply provide the parse.dir to the snv-parse and indel-parse files for immediate re-processing.

#start with a specific cutoff
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode="combined", #change cutoffs
                  snv.cutoff=0.2, indel.cutoff=0.1, #specify new cutoffs
                  output.dir = 'C:/Users/admin/myresults') 

#modify cutoff from existing SMuRF parse files
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                  mode="combined",
                  snv.cutoff=0.2, indel.cutoff=0.1,
                  parse.dir = 'C:/Users/admin/myresults', #SMuRF output path to existing parse.txt files
                  output.dir = 'C:/Users/admin/myresults2') 

#Plot histogram
hist(as.numeric(myresults$smurf_indel$predicted_indel[,'SMuRF_score']), main = 'Re-adjusted predicted indels', xlab = 'SMuRF_score', col = 'grey50')


Output format

Output files saved include:

  1. Variant statistics (stats)

  2. Predicted reads (predicted)

  3. Parsed-raw file (parse)

  4. Predicted reads with annotations (annotated)* #for smurf’s “cdsannotation” function only

  5. Time taken (time)


#Main predicted file (SNV & indel)

myresults$smurf_snv$predicted_snv

myresults$smurf_indel$predicted_indel
Column Description
Chr Chromosome number
START_POS_REF/END_POS_REF Start and End nucleotide position of the somatic mutation
REF/ALT Consensus Ref and Alt nucleotide changes of the highest likelihood
REF_MFVdVs/ALT_MFVdVs Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs) and lastly the newly added Strelka2
FILTER Passed (TRUE) or Reject (FALSE) [boolean] mutation calls from the individual callers
Sample_Name Sample name is extracted based on your labeled samples in the vcf files
Alt_Allele_Freq Mean Variant allele frequency calculated from the tumor reads of the callers
Depth ref/alt N/T Mean read depth from the N/T sample for ref/alt alleles
SMuRF_score SMuRF confidence score of the predicted mutation


myresults$smurf_indel$stats_indel

#             Passed_Calls
# Strelka2             466
# Mutect2              232
# FreeBayes            306
# VarDict              483
# VarScan             1273
# Atleast1            2431
# Atleast2             278
# Atleast3              43
# Atleast4               7
# All5                   1
# SMuRF_INDEL           88

myresults$smurf_snv$stats_snv

#           Passed_Calls
# Strelka2          1362
# Mutect2           1539
# FreeBayes          216
# VarDict            239
# VarScan           1734
# Atleast1          4017
# Atleast2           928
# Atleast3            60
# Atleast4            48
# All5                37
# SMuRF_SNV         1043


We added gene annotations using SnpEff (from bcbio) and SMuRF extracts the coding annotations from the canonical transcripts with the highest impact. Take note that your vcf.gz files should be tab-indexed (.tbi files required).

myresults = smurf(mydir, "cdsannotation") #runs SMuRF for SNV and indels + generate annotations

myresults$smurf_snv_annotation$annotated[order(myresults$smurf_snv_annotation$annotated$REGION)[1:2],]
#    Chr START_POS_REF END_POS_REF REF ALT   REF_MFVdVs   ALT_MFVdVs FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict
# 52   1      77806132    77806132   G   A    G/G/G/G/G    A/A/A/A/A           TRUE             TRUE           TRUE
# 81   1     170961432   170961432   C   T C/NA/NA/NA/C T/NA/NA/NA/T           TRUE            FALSE          FALSE
#    FILTER_Varscan FILTER_Strelka2     Sample_Name Alt_Allele_Freq N_refDepth N_altDepth T_refDepth T_altDepth Allele
# 52           TRUE            TRUE icgc_cll_tumour           0.500         14          0         15         15      A
# 81          FALSE            TRUE icgc_cll_tumour           0.467         33          0         16         14      T
#          Annotation   Impact Gene_name         Gene_ID Feature_Type      Feature_ID Transcript_BioType  Rank    HGVS.c
# 52 missense_variant MODERATE       AK5 ENSG00000154027   transcript ENST00000354567     protein_coding  6/14  c.770G>A
# 81 missense_variant MODERATE     MROH9 ENSG00000117501   transcript ENST00000367759     protein_coding 12/22 c.1156C>T
#         HGVS.p  cDNA.pos   CDS.pos  AA.pos Distance REGION SMuRF_score
# 52 p.Arg257His 1033/3248  770/1689 257/562        .    CDS   0.9083840
# 81 p.Arg386Cys 1310/3165 1156/2586 386/861        .    CDS   0.8107475


Time taken for your run:

myresults$time.taken

<!-- Time difference of 20.52405 secs -->


The raw parsed output:

myresults$smurf_indel$parse_indel

myresults$smurf_snv$parse_snv


Proceed to save the output in your desired formats State “output.dir” to generate .txt files in your desired directory Example:

myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
                           mode ="combined", nthreads = 1,
                           annotation=T, build='hg19',
                           output.dir = 'C:/Users/admin/myresults')


Running on multiple samples

Iterate over multiple samples by providing the list of directories of where your sample files are located.

#Example

project.dir = 'path/to/my/dir'
samples = c('sample_A', 'sample_B', 'sample_C') #sample dir where vcf files are located

for(i in 1:length(samples)) {
 smurf(directory=paste0(project.dir, '/', samples[i]),
        mode="combined", build='hg19',
        output.dir = paste0('C:/Users/admin/myresults-',sample[i]))
 } 


For errors and bugs, please report on our Github page.