by Huang Weitai
1st May 2020
SMuRF R package predicts a consensus set of somatic mutation calls using RandomForest machine learning. SMuRF generates a set of point mutations and insertions/deletions (indels) trained on the latest community-curated tumor whole genome sequencing data (Alioto et. al., 2015, Nat. Comms.). Our method is fast and accurate and analyses both whole-genome and whole-exome sequencing data from different cancer types.
For more information see our Bioinformatics paper: https://doi.org/10.1093/bioinformatics/btz018
Citation Huang, W., Guo, YA., Muthukumar, K., Baruah, P., Chang, M., and AJ., Skanderup, SMuRF: Portable and accurate ensemble prediction of somatic mutations. Bioinformatics, 2019: p. 3157-3159
Input from bcbio-nextgen pipeline Input directly from VCF Callers (optional) Test Dataset Requirements Installation Functions -Get SNV and indel predictions -Annotate genes -Subset region-of-interest Troubleshooting cutoffs Output format Running on multiple samples
Before running SMuRF, you require output data from the bcbio-nextgen pipeline that generates the VCF output for the variant callers: MuTect2, FreeBayes, VarDict, VarScan and the latest Strelka2. An additional caller Strelka2, has been added to SMuRF 2.0 and the information is documented on our wiki page.
SMuRF v1.6.4 is still available here SMuRF v1.6.4 readme
Note that your vcf.gz files need to be tab-indexed (.tbi files required) for retrieving gene annotations in SMuRF. We would recommend the bcbio-nextgen pipeline for a better user experience.
SMuRF requires the VCF output from each caller (.vcf.gz) to be placed in the same directory and files tagged with the caller (eg. sample1-mutect.vcf.gz, sample1-freebayes.vcf.gz, sample1-vardict.vcf.gz, sample1-varscan.vcf.gz)
For Users not running bcbio-nextgen pipeline: Alternatively, install and execute the individual callers.
Refer to the installation and instructions for each caller:
- VarDict
- VarScan
- MuTect2
- FreeBayes
- Strelka2
In this vignette, we utilise a partial output dataset derived from the chronic lymphocytic leukemia (CLL) data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539. This test dataset is provided in the SMuRF package.
R 3.3 & 3.4 : bioconductor::VariantAnnotation
R >=3.5 : BiocManager::VariantAnnotation
h2o package : If h2o package takes a long time to install, try manually installing from their AWS page.
1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF
2. You can install the current SMuRF directly from Github via the following R commands:
#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")
(Alternative option) SMuRF installation via downloading of the package from Github:
#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf and save to your local directory
install.packages("my/current/directory/smurf", repos = NULL, type = "source")
SMuRF concurrently predicts single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels) and saves time by parsing the VCF files once.
Missing packages will be installed the first time you run SMuRF.
library("smurf") #load SMuRF package
smurf() #check version and parameters
# "SMuRFv2.0 (30th April 2020)"
smurf(directory=NULL, mode=NULL, nthreads = -1,
annotation=F, output.dir=NULL, parse.dir=NULL,
snv.cutoff = 'default', indel.cutoff = 'default',
build=NULL, change.build=F, t.label=NULL, re.tabIndex=F,
check.packages=T)
myresults = smurf(mydir, 'combined', build='hg19') #save output into 'myresults' variable
| Arguments | Description |
|---|---|
| directory | Choose directory where the Variant Caller Format(VCF) files are located |
| output.dir | Path to output directory (if saving files as .txt) |
| parse.dir | Specify if changing SMuRF default cutoffs. Path to the location of existing snv-parse.txt and indel-parse.txt files generated by SMuRF |
| mode | Choose “snv”, “indel” or “combined” (snv+indel). “combined” provides a separate list of SNVs and indels. |
| annotation | TRUE or FALSE (default). Provide gene annotations for each variant call. |
| nthreads=-1 | Number of cores used for RandomForest prediction. Default (-1) for maximum number of cores. For 32-bit Windows, only 1 core is allowed (nthreads=1). |
| t.label | (Optional) Provide the sample name for your tumour sample to ease the identification of the normal and tumour sample names in your vcf |
| build | Specify you genome build: build=‘hg19’ or build=“hg38” |
| change.build | For conversion of your genomic coordinates |
| snv.cutoff | Default SNV model cutoff, unless a number between 0 to 1 is stated |
| indel.cutoff | Default indel model cutoff, unless a number between 0 to 1 is stated |
| re.tabIndex | TRUE or FALSE (default). Set to TRUE to create tab-indexed (.tbi) files for each vcf |
| check.packages=T | Developer mode |
For more information on the parameters see R documentation:
help(smurf)
Examples:
library("smurf") #load SMuRF package
myresults = smurf(directory="/path/to/directory..",
mode="snv",
output.dir="/path/to/output", build='hg19')
#Alter snv or indel model score cutoff (see below)
myresults = smurf(directory="/path/to/directory..",
mode="combined",
nthreads = 1, #for Windows OS
snv.cutoff = 0.2, indel.cutoff = 'default',
parse.dir="/path/to/parse/files/", build='hg19')
#Include gene annotations for coding regions in output
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T, build='hg19')
#Change hg38 to hg19 coordinates in gene annotation output
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T,
build='hg38', change.build=T)
#Specify tumor sample name eg."SampleA-001-T"
myresults = smurf(directory="/path/to/directory..",
mode="combined",
annotation=T,
build='hg38', t.label = '-T')
#Specify directories manually
dir.list = list(mutect='/path/to/mutect.vcf.gz',
freebayes='/path/to/freebayes.vcf.gz',
vardict='/path/to/vardict.vcf.gz',
varscan='/path/to/varscan.vcf.gz')
myresults = smurf(directory=dir.list,
mode="combined")
SMuRF is fine-tuned to achieve higher sensitivity. Re-adjust the stringency of the prediction with a specific cut-off value. Use parameters ‘snv.cutoff’ or ‘indel.cutoff’ to adjust the thresholds. To re-adjust the cut-off value of an existing SMuRF output, simply provide the parse.dir to the snv-parse and indel-parse files for immediate re-processing.
#start with a specific cutoff
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode="combined", #change cutoffs
snv.cutoff=0.2, indel.cutoff=0.1, #specify new cutoffs
output.dir = 'C:/Users/admin/myresults')
#modify cutoff from existing SMuRF parse files
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode="combined",
snv.cutoff=0.2, indel.cutoff=0.1,
parse.dir = 'C:/Users/admin/myresults', #SMuRF output path to existing parse.txt files
output.dir = 'C:/Users/admin/myresults2')
#Plot histogram
hist(as.numeric(myresults$smurf_indel$predicted_indel[,'SMuRF_score']), main = 'Re-adjusted predicted indels', xlab = 'SMuRF_score', col = 'grey50')
Output files saved include:
Variant statistics (stats)
Predicted reads (predicted)
Parsed-raw file (parse)
Predicted reads with annotations (annotated)* #for smurf’s “cdsannotation” function only
Time taken (time)
#Main predicted file (SNV & indel)
myresults$smurf_snv$predicted_snv
myresults$smurf_indel$predicted_indel
| Column | Description |
|---|---|
| Chr | Chromosome number |
| START_POS_REF/END_POS_REF | Start and End nucleotide position of the somatic mutation |
| REF/ALT | Consensus Ref and Alt nucleotide changes of the highest likelihood |
| REF_MFVdVs/ALT_MFVdVs | Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs) and lastly the newly added Strelka2 |
| FILTER | Passed (TRUE) or Reject (FALSE) [boolean] mutation calls from the individual callers |
| Sample_Name | Sample name is extracted based on your labeled samples in the vcf files |
| Alt_Allele_Freq | Mean Variant allele frequency calculated from the tumor reads of the callers |
| Depth ref/alt N/T | Mean read depth from the N/T sample for ref/alt alleles |
| SMuRF_score | SMuRF confidence score of the predicted mutation |
myresults$smurf_indel$stats_indel
# Passed_Calls
# Strelka2 466
# Mutect2 232
# FreeBayes 306
# VarDict 483
# VarScan 1273
# Atleast1 2431
# Atleast2 278
# Atleast3 43
# Atleast4 7
# All5 1
# SMuRF_INDEL 88
myresults$smurf_snv$stats_snv
# Passed_Calls
# Strelka2 1362
# Mutect2 1539
# FreeBayes 216
# VarDict 239
# VarScan 1734
# Atleast1 4017
# Atleast2 928
# Atleast3 60
# Atleast4 48
# All5 37
# SMuRF_SNV 1043
We added gene annotations using SnpEff (from bcbio) and SMuRF extracts the coding annotations from the canonical transcripts with the highest impact. Take note that your vcf.gz files should be tab-indexed (.tbi files required).
myresults = smurf(mydir, "cdsannotation") #runs SMuRF for SNV and indels + generate annotations
myresults$smurf_snv_annotation$annotated[order(myresults$smurf_snv_annotation$annotated$REGION)[1:2],]
# Chr START_POS_REF END_POS_REF REF ALT REF_MFVdVs ALT_MFVdVs FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict
# 52 1 77806132 77806132 G A G/G/G/G/G A/A/A/A/A TRUE TRUE TRUE
# 81 1 170961432 170961432 C T C/NA/NA/NA/C T/NA/NA/NA/T TRUE FALSE FALSE
# FILTER_Varscan FILTER_Strelka2 Sample_Name Alt_Allele_Freq N_refDepth N_altDepth T_refDepth T_altDepth Allele
# 52 TRUE TRUE icgc_cll_tumour 0.500 14 0 15 15 A
# 81 FALSE TRUE icgc_cll_tumour 0.467 33 0 16 14 T
# Annotation Impact Gene_name Gene_ID Feature_Type Feature_ID Transcript_BioType Rank HGVS.c
# 52 missense_variant MODERATE AK5 ENSG00000154027 transcript ENST00000354567 protein_coding 6/14 c.770G>A
# 81 missense_variant MODERATE MROH9 ENSG00000117501 transcript ENST00000367759 protein_coding 12/22 c.1156C>T
# HGVS.p cDNA.pos CDS.pos AA.pos Distance REGION SMuRF_score
# 52 p.Arg257His 1033/3248 770/1689 257/562 . CDS 0.9083840
# 81 p.Arg386Cys 1310/3165 1156/2586 386/861 . CDS 0.8107475
Time taken for your run:
myresults$time.taken
<!-- Time difference of 20.52405 secs -->
The raw parsed output:
myresults$smurf_indel$parse_indel
myresults$smurf_snv$parse_snv
Proceed to save the output in your desired formats State “output.dir” to generate .txt files in your desired directory Example:
myresults = smurf(directory = paste0(find.package("smurf"), "/data"),
mode ="combined", nthreads = 1,
annotation=T, build='hg19',
output.dir = 'C:/Users/admin/myresults')
Iterate over multiple samples by providing the list of directories of where your sample files are located.
#Example
project.dir = 'path/to/my/dir'
samples = c('sample_A', 'sample_B', 'sample_C') #sample dir where vcf files are located
for(i in 1:length(samples)) {
smurf(directory=paste0(project.dir, '/', samples[i]),
mode="combined", build='hg19',
output.dir = paste0('C:/Users/admin/myresults-',sample[i]))
}
For errors and bugs, please report on our Github page.