SMuRF vignette

by Huang Weitai

11th Jan 2018


SMuRF is an R package that contains functions for the prediction of a consensus set of somatic mutation calls based on a random forest machine learning classification model. To run these functions, you will need output data from the bcbio-nextgen pipeline http://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#cancer-variant-calling containing the VCF output for algorithms MuTect2, FreeBayes, VarDict and VarScan.

In this vignette, we will be using a partial output dataset (https://github.com/skandlab/SMuRF/tree/master/test) derived from the chronic lymphocytic leukemia data downloaded from the European Genome-phenome Archive (EGA) under the accession number EGAS00001001539.

Requirements for package:

Dependencies: https://github.com/skandlab/SMuRF/wiki/Running-SMuRF


Installation instructions:


1. The latest version of the package is updated on Github https://github.com/skandlab/SMuRF
2. You can install the current SMuRF directly from Github via the following R command:

#devtools is required
install.packages("devtools")
library(devtools)
install_github("skandlab/SMuRF", subdir="smurf")

(Alternative option) SMuRF installation via downloading of the package from Github:

#Clone or download package from Github https://github.com/skandlab/SMuRF/tree/master/smurf
install.packages("my/current/directory/smurf", repos = NULL, type = "source")

Download the test files into your designated directory: https://github.com/skandlab/SMuRF/tree/master/test

download.file('https://github.com/skandlab/SMuRF/raw/master/test/varscan.vcf.gz','varscan.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/vardict.vcf.gz','vardict.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/mutect2.vcf.gz','mutect2.vcf.gz')
download.file('https://github.com/skandlab/SMuRF/raw/master/test/freebayes.vcf.gz','freebayes.vcf.gz')


Before we start using the package’s functions, set your designated file directory containing your sample

mydir <- getwd() #get current directory

SMuRF allows for the prediction of single somatic nucleotide variants (SNV) as well as small insertions and deletions (indels). In this example we will be predicting both SNVs and INDELs and thus we will be using the “combined” import functionality (other options are “snv” and “indel” only).

library("smurf")
myresults <- smurf(mydir, "combined") #save output into 'myresults' variable

#this will run SMuRF and generate predictions based on input files in 'mydir'

The first time you run SMuRF, required packages may be installed.

Output files saved includes variant statistics (stats) and the predicted reads (predicted)

# myresults <- smurf(mydir, "combined") 

myresults$smurf_indel$stats_indel

<!--             Passed_Calls -->
<!-- Mutect2             1524 -->
<!-- FreeBayes            349 -->
<!-- VarDict              531 -->
<!-- VarScan             1450 -->
<!-- Atleast1            3657 -->
<!-- Atleast2             166 -->
<!-- Atleast3              27 -->
<!-- All4                   4 -->
<!-- SMuRF_INDEL            5 -->

myresults$smurf_indel$predicted_indel

<!-- Chr START_POS_REF END_POS_REF REF ALT  REF_MFVdVs ALT_MFVdVs Sample_Name -->
<!-- 27     1      17820432    17820433  AT   A AT/AT/AT/AT    A/A/A/A    icgc_cll -->
<!-- 53     1      81654021    81654022  CA   C CA/CA/CA/CA    C/C/C/C    icgc_cll -->
<!-- 61     1      91134042    91134043  CT   C CT/CT/CT/CT    C/C/C/C    icgc_cll -->
<!-- 1628   1      32639063    32639064  TA   T TA/TA/NA/TA   T/T/NA/T    icgc_cll -->
<!-- 1749   1      71747307    71747308  AT   A NA/AT/AT/AT   NA/A/A/A    icgc_cll -->
<!--      FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict FILTER_Varscan -->
<!-- 27             TRUE            FALSE           TRUE           TRUE -->
<!-- 53             TRUE             TRUE           TRUE           TRUE -->
<!-- 61             TRUE             TRUE           TRUE           TRUE -->
<!-- 1628          FALSE             TRUE          FALSE           TRUE -->
<!-- 1749          FALSE             TRUE          FALSE           TRUE -->
<!--      Alt_Allele_Freq N_refDepth N_altDepth T_refDepth T_altDepth SMuRF_score -->
<!-- 27            0.5650         24          1         13         13   0.6233728 -->
<!-- 53            0.4640         26          0         15         13   0.9799784 -->
<!-- 61            0.4850         31          0         18         16   0.9608392 -->
<!-- 1628          0.3810         20          0         13          7   0.9294872 -->
<!-- 1749          0.4615         74          4         57         51   0.4692025 -->

myresults$smurf_snv$stats_snv

<!--           Passed_Calls -->
<!-- Mutect2           4892 -->
<!-- FreeBayes          284 -->
<!-- VarDict            310 -->
<!-- VarScan           5533 -->
<!-- Atleast1         10748 -->
<!-- Atleast2           170 -->
<!-- Atleast3            59 -->
<!-- All4                42 -->
<!-- SMuRF_SNV          204 -->

head(myresults$smurf_snv$predicted_snv)

<!--    Chr START_POS_REF END_POS_REF REF ALT REF_MFVdVs ALT_MFVdVs Sample_Name -->
<!-- 16   1       2180985     2180985   A   G    A/A/A/A    G/G/G/G    icgc_cll -->
<!-- 31   1       5035185     5035185   C   T    C/C/C/C    T/T/T/T    icgc_cll -->
<!-- 37   1       8881322     8881322   G   A    G/G/G/G    A/A/A/A    icgc_cll -->
<!-- 38   1       8929624     8929624   A   G    A/A/A/A    G/G/G/G    icgc_cll -->
<!-- 39   1       9196716     9196716   C   T   C/NA/C/C   T/NA/T/T    icgc_cll -->
<!-- 40   1       9478609     9478609   A   G    A/A/A/A    G/G/G/G    icgc_cll -->
<!--    FILTER_Mutect2 FILTER_Freebayes FILTER_Vardict FILTER_Varscan -->
<!-- 16           TRUE             TRUE           TRUE           TRUE -->
<!-- 31           TRUE             TRUE           TRUE           TRUE -->
<!-- 37           TRUE             TRUE           TRUE           TRUE -->
<!-- 38           TRUE             TRUE           TRUE           TRUE -->
<!-- 39           TRUE            FALSE           TRUE           TRUE -->
<!-- 40           TRUE             TRUE           TRUE           TRUE -->
<!--    Alt_Allele_Freq N_refDepth N_altDepth T_refDepth T_altDepth SMuRF_score -->
<!-- 16           0.512         90          0         56         53   0.9950507 -->
<!-- 31           0.309         82          1         41         17   0.9942729 -->
<!-- 37           0.348         40          0         16          8   0.9692672 -->
<!-- 38           0.375         58          1         12          7   0.9240989 -->
<!-- 39           0.200         35          0         17          3   0.5476332 -->
<!-- 40           0.393         37          0         20         16   0.5114379 -->

Output file description/legend

Column Name Description
Chr Chromosome number
START_POS_REF/END_POS_REF Start and End nucleotide position of the somatic mutation
REF/ALT Consensus Ref and Alt nucleotide changes of the highest likelihood
REF_MFVdVs/ALT_MFVdVs Reference and Alternative nucleotide changes from each caller; Mutect2 (M), Freebayes (F), Vardict (Vd), Varscan (Vs)
Sample_Name Sample name is extracted based on your labeled samples in the vcf files
FILTER Passed (TRUE) or Reject (FALSE) [boolean] mutation calls from the individual callers
Alt_Allele_Freq Mean Variant allele frequency calculated from the tumor reads of the callers
Depth ref/alt N/T Mean read depth from the N/T sample for ref/alt alleles
SMuRF_score SMuRF confidence score of the predicted mutation


You may also retrieve the time taken for your run.

myresults$time.taken

<!-- Time difference of 20.52405 secs -->

You can check the parsed output used for the prediction:

myresults$smurf_indel$parse_indel

myresults$smurf_snv$parse_snv

Proceed to save the output in your desired formats Example:

a<- myresults$smurf_indel$stats_indel
write.table(a , file = "indel-stats.txt", sep = "\t", quote = FALSE, row.names = TRUE, na = ".")

a<- myresults$smurf_indel$predicted_indel
write.table(a , file = "indel-predicted.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")

a<- myresults$smurf_indel$parse_indel
write.table(a , file = "indel-parse.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")

a<- myresults$smurf_snv$stats_snv
write.table(a , file = "snv-stats.txt", sep = "\t", quote = FALSE, row.names = TRUE, na = ".")

a<- myresults$smurf_snv$predicted_snv
write.table(a , file = "snv-predicted.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")

a<- myresults$smurf_snv$parse_snv
write.table(a , file = "snv-parse.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")

a<- myresults$smurf_snv_annotation$annotated
if(!is.null(a)){
  write.table(a , file = "snv-annotated.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")
}

a<- myresults$smurf_indel_annotation$annotated
if(!is.null(a)){
  write.table(a , file = "indel-annotated.txt", sep = "\t", quote = FALSE, row.names = FALSE, na = ".")
}

a<- myresults$time.taken
write(a, file = "time.txt")

As an additional function, we have added gene annotations using SnpEff (from bcbio) and SMuRF extracts the coding annotations from the canonical transcripts with the highest impact for your convenience:

myresults <- smurf(mydir, "cdsannotation") #runs SMuRF for SNV and indels + generate annotations

You may check the output files generated by the test samples in this section to the expected results we provided located in the results folder https://github.com/skandlab/SMuRF/tree/master/test/results.


Running on multiple samples

Use our R package to efficiently do somatic mutation predictions on multiple matched tumor-normal samples by providing the list of directories of where your sample files are located.

#Example

sample_directories <- list("my/dir/sample_A", "my/dir/sample_B", "my/dir/sample_C")

myresults <- list()

for(i in 1:length(sample_directories))
 {
 myresults[[i]] <- smurf(sample_directories[i], "combined")
 } 
 
#myresults[[1]]$time.taken
#Time difference of 9.973997 secs

#myresults[[2]]$time.taken
#Time difference of 11.1712 secs

#myresults[[3]]$time.taken
#Time difference of 15.18325 secs

For errors and bugs, please report on our Github page.