by Huang Weitai
9th June 2017
SMuRF is an R package that contains functions for the prediction of a consensus set of somatic mutation calls based on a random forest machine learning classification model. To run these functions, you will need output data from the bcbio-nextgen pipeline http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#cancer-variant-calling containing the VCF output for algorithms MuTect2, FreeBayes, VarDict and VarScan.
In this vignette, we will be using a partial output dataset (https://github.com/skandlab/SMuRF/tree/master/test) derived from the chronic lymphocytic leukemia data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539.
Requirements for package:
Dependencies: https://github.com/skandlab/SMuRF/wiki/Running-SMuRF
Installation instructions:
1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF
2. You can install the current SMuRF directly from Github via the following R command:
#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")
(Alternative option) SMuRF installation via downloading of the package from Github:
#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf
install.packages("my/current/directory/smurf", repos = NULL, type = "source")
Download the test files into your designated directory: https://github.com/skandlab/SMuRF/tree/master/test
download.file('https://github.com/skandlab/SMuRF/raw/master/test/varscan.vcf.gz','varscan.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/vardict.vcf.gz','vardict.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/mutect2.vcf.gz','mutect2.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/freebayes.vcf.gz','freebayes.vcf.gz')
Before we start using the package’s functions, set your designated file directory containing your sample
mydir <- getwd() #get current directory
SMuRF allows for the prediction of single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels). In this example we will be predicting both SNVs and INDELs and thus we will be using the “combined” import functionality (other options are “snv” and “indel” only).
library("smurf")
myresults <- smurf(mydir, "combined") #save output into 'myresults' variable
#smurf(mydir, "snv")
#smurf(mydir, "indel")
#this will run SMuRF and generate predictions based on input files in 'mydir'
The first time you run SMuRF, required packages may be installed.
Output files saved includes variant statistics (stats) and the predicted reads (predicted)
# myresults <- smurf(mydir, "combined")
myresults$smurf_indel$stats_indel
# Passed_Calls
#Mutect2 383
#FreeBayes 85
#VarDict 166
#VarScan 158
#Atleast1 765
#Atleast2 25
#Atleast3 2
#All4 0
#SMuRF_INDEL 1
myresults$smurf_indel$predicted_indel
#Chr START_POS_REF END_POS_REF REF_MFVdVs ALT_MFVdVs SMuRF_score
#1 "1" "17820432" "17820433" "AT/AT/AT/AT" "A/A/A/A" "0.571875"
myresults$smurf_snv$stats_snv
# Passed_Calls
#Mutect2 5004
#FreeBayes 115
#VarDict 124
#VarScan 501
#Atleast1 5684
#Atleast2 42
#Atleast3 16
#All4 2
#SMuRF_SNV 12
myresults$smurf_snv$predicted_snv
#Chr START_POS_REF END_POS_REF REF_MFVdVs ALT_MFVdVs TRUTH_confidence
#1 "1" "12207135" "12207135" "G/G/G/G" "A/A/A/A" "0.9941938"
#2 "1" "14955425" "14955425" "C/C/C/C" "A/A/A/A" "0.9025974"
#3 "1" "18525077" "18525077" "G/G/G/G" "A/A/A/A" "0.9972492"
#4 "1" "21459584" "21459584" "T/T/T/T" "A/A/A/A" "0.9919866"
#5 "1" " 2180985" " 2180985" "A/A/A/A" "G/G/G/G" "0.9992510"
#6 "1" "22385823" "22385823" "A/A/A/A" "G/G/G/G" "0.6207352"
#7 "1" "25322776" "25322776" "C/C/NA/C" "T/T/NA/T" "0.8995001"
#8 "1" "28317030" "28317030" "G/NA/G/G" "T/NA/T/T" "0.8962453"
#9 "1" " 5035185" " 5035185" "C/C/C/C" "T/T/T/T" "0.9321060"
#10 "1" " 8881322" " 8881322" "G/G/G/G" "A/A/A/A" "0.9116895"
#11 "1" " 8929624" " 8929624" "A/A/A/A" "G/G/G/G" "0.8765643"
#12 "1" " 9478609" " 9478609" "A/A/A/A" "G/G/G/G" "0.5984848"
Output file description/legend
| Column Name | Description |
|---|---|
| Chr | Chromosome number |
| START_POS_REF/END_POS_REF | Start and End nucleotide position of the somatic mutation |
| REF_MFVdVs/ALT_MFVdVs | Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs) |
| SMuRF_score | SMuRF confidence score of the predicted mutation |
You may also retrieve the time taken for your run.
myresults$time.taken
#Time difference of 20.52405 secs
You can check the parsed output used for the prediction:
myresults$smurf_indel$parse_indel
myresults$smurf_snv$parse_snv
You may check the output files generated by the test samples in this section to the expected results we provided located in the results folder https://github.com/skandlab/SMuRF/tree/master/test/results.
Running on multiple samples
Use our R package to efficiently do somatic mutation predictions on multiple matched tumor-normal samples by providing the list of directories of where your sample files are located.
#Example
sample_directories <- list("my/dir/sample_A", "my/dir/sample_B", "my/dir/sample_C")
myresults <- list()
for(i in 1:length(sample_directories))
{
myresults[[i]] <- smurf(sample_directories[i], "combined")
}
#myresults[[1]]$time.taken
#Time difference of 9.973997 secs
#myresults[[2]]$time.taken
#Time difference of 11.1712 secs
#myresults[[3]]$time.taken
#Time difference of 15.18325 secs
For errors and bugs, please report on our Github page.