SMuRF vignette

by Huang Weitai

9th June 2017

SMuRF is an R package that contains functions for the prediction of a consensus set of somatic mutation calls based on a random forest machine learning classification model. To run these functions, you will need output data from the bcbio-nextgen pipeline http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#cancer-variant-calling containing the VCF output for algorithms MuTect2, FreeBayes, VarDict and VarScan.

In this vignette, we will be using a partial output dataset (https://github.com/skandlab/SMuRF/tree/master/test) derived from the chronic lymphocytic leukemia data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539.

Requirements for package:

Dependencies: https://github.com/skandlab/SMuRF/wiki/Running-SMuRF

Installation instructions:

1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF
2. You can install the current SMuRF directly from Github via the following R command:

#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")

(Alternative option) SMuRF installation via downloading of the package from Github:

#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf
install.packages("my/current/directory/smurf", repos = NULL, type = "source")

Download the test files into your designated directory: https://github.com/skandlab/SMuRF/tree/master/test

download.file('https://github.com/skandlab/SMuRF/raw/master/test/varscan.vcf.gz','varscan.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/vardict.vcf.gz','vardict.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/mutect2.vcf.gz','mutect2.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/freebayes.vcf.gz','freebayes.vcf.gz')

Before we start using the package’s functions, set your designated file directory containing your sample

mydir <- getwd() #get current directory

SMuRF allows for the prediction of single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels). In this example we will be predicting both SNVs and INDELs and thus we will be using the “combined” import functionality (other options are “snv” and “indel” only).

library("smurf")
myresults <- smurf(mydir, "combined") #save output into 'myresults' variable
#smurf(mydir, "snv") 
#smurf(mydir, "indel") 
#this will run SMuRF and generate predictions based on input files in 'mydir'

The first time you run SMuRF, required packages may be installed.

Output files saved includes variant statistics (stats) and the predicted reads (predicted)

# myresults <- smurf(mydir, "combined") 

myresults$smurf_indel$stats_indel

#            Passed_Calls
#Mutect2              383
#FreeBayes             85
#VarDict              166
#VarScan              158
#Atleast1             765
#Atleast2              25
#Atleast3               2
#All4                   0
#SMuRF_INDEL            1

myresults$smurf_indel$predicted_indel

#Chr START_POS_REF END_POS_REF REF_MFVdVs    ALT_MFVdVs SMuRF_score
#1 "1" "17820432"    "17820433"  "AT/AT/AT/AT" "A/A/A/A"  "0.571875" 

myresults$smurf_snv$stats_snv

#            Passed_Calls
#Mutect2             5004
#FreeBayes            115
#VarDict              124
#VarScan              501
#Atleast1            5684
#Atleast2              42
#Atleast3              16
#All4                   2
#SMuRF_SNV             12

myresults$smurf_snv$predicted_snv

#Chr START_POS_REF END_POS_REF REF_MFVdVs ALT_MFVdVs TRUTH_confidence
#1  "1"     "12207135"    "12207135"  "G/G/G/G"  "A/A/A/A"  "0.9941938"     
#2  "1"     "14955425"    "14955425"  "C/C/C/C"  "A/A/A/A"  "0.9025974"     
#3  "1"     "18525077"    "18525077"  "G/G/G/G"  "A/A/A/A"  "0.9972492"     
#4  "1"     "21459584"    "21459584"  "T/T/T/T"  "A/A/A/A"  "0.9919866"     
#5  "1"     " 2180985"    " 2180985"  "A/A/A/A"  "G/G/G/G"  "0.9992510"     
#6  "1"     "22385823"    "22385823"  "A/A/A/A"  "G/G/G/G"  "0.6207352"     
#7  "1"     "25322776"    "25322776"  "C/C/NA/C" "T/T/NA/T" "0.8995001"     
#8  "1"     "28317030"    "28317030"  "G/NA/G/G" "T/NA/T/T" "0.8962453"     
#9  "1"     " 5035185"    " 5035185"  "C/C/C/C"  "T/T/T/T"  "0.9321060"     
#10 "1"     " 8881322"    " 8881322"  "G/G/G/G"  "A/A/A/A"  "0.9116895"     
#11 "1"     " 8929624"    " 8929624"  "A/A/A/A"  "G/G/G/G"  "0.8765643"     
#12 "1"     " 9478609"    " 9478609"  "A/A/A/A"  "G/G/G/G"  "0.5984848"

Output file description/legend

Column Name	Description
Chr	Chromosome number
START_POS_REF/END_POS_REF	Start and End nucleotide position of the somatic mutation
REF_MFVdVs/ALT_MFVdVs	Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs)
SMuRF_score	SMuRF confidence score of the predicted mutation

You may also retrieve the time taken for your run.

myresults$time.taken

#Time difference of 20.52405 secs

You can check the parsed output used for the prediction:

myresults$smurf_indel$parse_indel

myresults$smurf_snv$parse_snv

You may check the output files generated by the test samples in this section to the expected results we provided located in the results folder https://github.com/skandlab/SMuRF/tree/master/test/results.

Running on multiple samples

Use our R package to efficiently do somatic mutation predictions on multiple matched tumor-normal samples by providing the list of directories of where your sample files are located.

#Example

sample_directories <- list("my/dir/sample_A", "my/dir/sample_B", "my/dir/sample_C")

myresults <- list()

for(i in 1:length(sample_directories))
 {
 myresults[[i]] <- smurf(sample_directories[i], "combined")
 } 
 
#myresults[[1]]$time.taken
#Time difference of 9.973997 secs

#myresults[[2]]$time.taken
#Time difference of 11.1712 secs

#myresults[[3]]$time.taken
#Time difference of 15.18325 secs

For errors and bugs, please report on our Github page.