maftools : QuickLook

Author

Qinqin Zha

Published

January 26, 2025

Generating MAF files

For VCF files or simple tabular files, easy option is to use vcf2maf utility which will annotate VCFs, prioritize transcripts, and generates an MAF. Recent updates to gatk has also enabled
If you’re using ANNOVAR for variant annotations, maftools has a handy function annovarToMaf for converting tabular annovar outputs to MAF.

MAF field requirements

MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.

Mandatory fields: Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.

Installation

if (!require("BiocManager"))
    install.packages("BiocManager")
BiocManager::install("maftools")

Reading and summarizing maf files

Required input files

an MAF file - can be gz compressed. Required.
an optional but recommended clinical data associated with each sample/Tumor_Sample_Barcode in MAF.
an optional copy number data if available. Can be GISTIC output or a custom table containing sample names, gene names and copy-number status (Amp or Del).

Reading MAF files.

read.maf function reads MAF files, summarizes it in various ways and stores it as an MAF object. Even though MAF file is alone enough, it is recommended to provide annotations associated with samples in MAF. One can also integrate copy number data if available.

library(maftools)

#path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools') 
#clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools') 

laml = read.maf(maf = laml.maf, clinicalData = laml.clin)

-Reading
-Validating
-Silent variants: 475 
-Summarizing
-Processing clinical data
-Finished in 0.220s elapsed (0.190s cpu)

#Wtihout Clinical data 
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)

maf_data_view <- laml@data
maf_gene.summary <- laml@gene.summary

Collect MAF data from TCGA

https://portal.gdc.cancer.gov/

library(TCGAbiolinks)
GDCprojects <- getGDCprojects()
summary <- TCGAbiolinks:::getProjectSummary("TCGA-CHOL")

query.maf.hg19 <- GDCquery(project = "TCGA-CHOL", 
                           data.category = "Simple Nucleotide Variation", 
                           
                           access = "open"
                           )

--------------------------------------

o GDCquery: Searching in GDC database

--------------------------------------

Genome of reference: hg38

--------------------------------------------

oo Accessing GDC. This might take a while...

--------------------------------------------

ooo Project: TCGA-CHOL

--------------------

oo Filtering results

--------------------

ooo By access

----------------

oo Checking data

----------------

ooo Checking if there are duplicated cases

ooo Checking if there are results for the query

-------------------

o Preparing output

-------------------

                           #legacy = TRUE

GDCdownload(query.maf.hg19)

Downloading data for project TCGA-CHOL

Of the 51 files for download 51 already exist.

All samples have been already downloaded

maf_TCGA_data <- GDCprepare(query.maf.hg19)

maf_TCGA <- read.maf(maf_TCGA_data,isTCGA = TRUE)

-Validating
-Silent variants: 927 
-Summarizing
--Possible FLAGS among top ten genes:
  MUC16
  MUC5B
-Processing clinical data
--Missing clinical data
-Finished in 0.170s elapsed (0.170s cpu)

MAF object

Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and any associated sample annotations.

There are accessor methods to access the useful slots from MAF object.

#Typing laml shows basic summary of MAF file.
laml

An object of class  MAF 
                   ID          summary  Mean Median
               <char>           <char> <num>  <num>
 1:        NCBI_Build               37    NA     NA
 2:            Center genome.wustl.edu    NA     NA
 3:           Samples              193    NA     NA
 4:            nGenes             1241    NA     NA
 5:   Frame_Shift_Del               52 0.269      0
 6:   Frame_Shift_Ins               91 0.472      0
 7:      In_Frame_Del               10 0.052      0
 8:      In_Frame_Ins               42 0.218      0
 9: Missense_Mutation             1342 6.953      7
10: Nonsense_Mutation              103 0.534      0
11:       Splice_Site               92 0.477      0
12:             total             1732 8.974      9

#Shows sample summry.
getSampleSummary(laml)
#Shows gene summary.
getGeneSummary(laml)
#shows clinical data associated with samples
getClinicalData(laml)
#Shows all fields in MAF
getFields(laml)
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')

Visualization

Plotting MAF summary.

We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.

plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)

mafbarplot(maf = laml)

Use mafbarplot for a minimal barplot of mutated genes.

Oncoplots

Drawing oncoplots

Better representation of maf file can be shown as oncoplots, also known as waterfall plots.

#oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)

NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.

For more details on customisation see the Customizing oncoplots vignette.

Transition and Transversions.

titv function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visualized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)

#plot titv summary
plotTiTv(res = laml.titv)

Lollipop plots for amino acid changes

lollipopPlot function requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot looks for column AAChange, and if its not found in the MAF file, it prints all available fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argument AACol.

By default lollipopPlot uses the longest isoform of the gene.

#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
lollipopPlot(
  maf = laml,
  gene = 'DNMT3A',
  AACol = 'Protein_Change',
  showMutationRate = TRUE,
  labelPos = 882
)

3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.

     HGNC refseq.ID protein.ID aa.length
   <char>    <char>     <char>     <num>
1: DNMT3A NM_022552  NP_072046       912
2: DNMT3A NM_153759  NP_715640       723
3: DNMT3A NM_175629  NP_783328       912

Using longer transcript NM_022552 for now.

Removed 3 mutations for which AA position was not available

General protein domains can be drawn with the function plotProtein

plotProtein(gene = "TP53", refSeqID = "NM_000546")

Rainfall plots

Cancer genomes, especially solid tumors are characterized by genomic loci with localized hyper-mutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot. If detectChangePoints is set to TRUE, rainfall plot also highlights regions where potential changes in inter-event distances are located.

brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)

rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.8)

rainfallPlot(maf = laml, detectChangePoints = TRUE, pointSize = 0.8)

“Kataegis” are defined as those genomic segments containing six or more consecutive mutations with an average inter-mutation distance of less than or equal to 1,00 bp 5.

Compare mutation load against TCGA cohorts

tcgaCompare uses mutation load from TCGA MC3 for comparing muttaion burden against 33 TCGA cohorts. Plot generated is similar to the one described in Alexandrov et al 5.

name list of cohort = https://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/2017/kras.pdf

brca.mutload = tcgaCompare(maf = brca, cohortName = 'Example-BRCA', logscale = TRUE, capture_size = 50)

Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.

Plotting VAF

This function plots Variant Allele Frequencies as a boxplot which quickly helps to estimate clonal status of top mutated genes (clonal genes usually have mean allele frequency around ~50% assuming pure sample)

https://www.biostars.org/p/361558/

plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')

Processing copy-number data

Reading and summarizing gistic output files.

We can summarize output files generated by GISTIC programme. As mentioned earlier, we need four files that are generated by GISTIC, i.e, all_lesions.conf_XX.txt, amp_genes.conf_XX.txt, del_genes.conf_XX.txt and scores.gistic, where XX is the confidence level. See GISTIC documentation for details.

https://www.genepattern.org/doc/GISTIC_2.0/2.0.23/GISTICDocumentation_standalone.htm

Genomic Identification of Significant Targets in Cancer, version 2.0

We describe methods with enhanced power and specificity to identify genes targeted by somatic copy-number alterations (SCNAs) that drive cancer growth

A common approach to identifying drivers is to study large collections of cancer samples, on the notion that regions containing driver events should be altered at higher frequencies than regions containing only passengers [4,6,7,9-14]. For example, we developed an algorithm, GISTIC (Genomic Identification of Significant Targets in Cancer) [15], that identifies likely driver SCNAs by evaluating the frequency and amplitude of observed events. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218867/

all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")

laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)

-Processing Gistic files..
--Processing amp_genes.conf_99.txt
--Processing del_genes.conf_99.txt
--Processing scores.gistic
--Summarizing by samples

#GISTIC object
laml.gistic

An object of class  GISTIC 
          ID summary
      <char>   <num>
1:   Samples     191
2:    nGenes    2622
3: cytoBands      16
4:       Amp     388
5:       Del   26481
6:     total   26869

Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary, getGeneSummary and getCytoBandSummary. Summarized results can be written to output files using function write.GisticSummary.

Visualizing gistic results.

There are three types of plots available to visualize gistic results.

genome plot

Red=amplification blue =deletion

gisticChromPlot(gistic = laml.gistic, markBands = "all")

Bubble plot

gisticBubblePlot(gistic = laml.gistic)

oncoplot

This is similar to oncoplots except for copy number data. One can again sort the matrix according to annotations, if any. Below plot is the gistic results for LAML, sorted according to FAB classification. Plot shows that 7q deletions are virtually absent in M4 subtype where as it is widespread in other subtypes.

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3612855/

The World Health Organization (WHO) classifies acute myeloid leukemia (AML) via genetic, immunophenotypic, biological, and clinical features. Still, “AML, not otherwise specified (NOS)” is further subdivided based on morphologic criteria similar to those of the French-American-British (FAB) classification. FAB M0 was independently associated with significantly lower likelihood of achieving complete remission and inferior relapse-free and overall survival as compared with FAB M1, M2, M4, M5, and M6, with inconclusive data regarding M7.

gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)

Visualizing CBS segments

https://cran.r-project.org/web/packages/PSCBS/vignettes/CBS.pdf

The Circular Binary Segmentation (CBS) method partitions a genome into segments of constant total copy numbers (TCNs) based on DNA microarray data.

https://jeremy9959.net/Blog/cbs-fixed/

Circular Binary Segmentation is an algorithm for finding changepoints in sequential data, and in particular for identifying changes in copy number from CGH or other types of genomic data. The algorithm is described in the 2004 paper Circular Binary segmentation for the analysis of array-based DNA copy number data. It is implemented in the R package DNAcopy, which is widely used in tools for the analysis of copy number in genomics. For example, both the gingko and cnvkit packages ultimately refer the segmentation of the count data that they generate from sequences to to DNAcopy, and thus to the CBS algorithm.

tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)

No 'tsb' specified, plot head 1 sample. Set tsb='ALL' to plot all samples.

NULL

Analysis

Somatic Interactions

Mutually exclusive or co-occurring set of genes can be detected using somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.

https://medlineplus.gov/genetics/gene/asxl1/ https://medlineplus.gov/genetics/gene/runx1/ https://medlineplus.gov/genetics/gene/flt3/ https://medlineplus.gov/genetics/gene/tp53/

The ASXL1 gene provides instructions for making a protein that is involved in a process known as chromatin remodeling. Chromatin is the complex of DNA and proteins that packages DNA into chromosomes. The structure of chromatin can be changed (remodeled) to alter how tightly DNA is packaged. When DNA is tightly packed, gene activity (expression) is lower than when DNA is loosely packed.

The RUNX1 gene provides instructions for making a protein called runt-related transcription factor 1 (RUNX1). Like other transcription factors, the RUNX1 protein attaches (binds) to specific regions of DNA and helps control the activity of particular genes.

The FLT3 gene provides instructions for making a protein called fms-like tyrosine kinase 3 (FLT3), which is part of a family of proteins called receptor tyrosine kinases (RTKs). Receptor tyrosine kinases transmit signals from the cell surface into the cell through a process called signal transduction.

The TP53 gene provides instructions for making a protein called tumor protein p53 (or p53). This protein acts as a tumor suppressor, which means that it regulates cell division by keeping cells from growing and dividing (proliferating) too fast or in an uncontrolled way.

#exclusive/co-occurance event analysis on top 10 mutated genes. 
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1))

      gene1  gene2       pValue oddsRatio    00    01    11    10        pAdj
     <char> <char>        <num>     <num> <int> <int> <int> <int>       <num>
  1:  ASXL1  RUNX1 0.0001541586 55.215541   176    12     4     1 0.003568486
  2:   IDH2  RUNX1 0.0002809928  9.590877   164     9     7    13 0.006055880
  3:   IDH2  ASXL1 0.0004030636 41.077327   172     1     4    16 0.008126283
  4:   FLT3   NPM1 0.0009929836  3.763161   125    16    17    35 0.018664260
  5:   SMC3 DNMT3A 0.0010451985 20.177713   144    42     6     1 0.018664260
 ---                                                                         
296:  PLCE1  ASXL1 1.0000000000  0.000000   184     5     0     4 1.000000000
297:  RAD21  FAM5C 1.0000000000  0.000000   183     5     0     5 1.000000000
298:  PLCE1  FAM5C 1.0000000000  0.000000   184     5     0     4 1.000000000
299:  PLCE1  RAD21 1.0000000000  0.000000   184     5     0     4 1.000000000
300:   EZH2  PLCE1 1.0000000000  0.000000   186     4     0     3 1.000000000
                  Event         pair event_ratio
                 <char>       <char>      <char>
  1:       Co_Occurence ASXL1, RUNX1        4/13
  2:       Co_Occurence  IDH2, RUNX1        7/22
  3:       Co_Occurence  ASXL1, IDH2        4/17
  4:       Co_Occurence   FLT3, NPM1       17/51
  5:       Co_Occurence DNMT3A, SMC3        6/43
 ---                                            
296: Mutually_Exclusive ASXL1, PLCE1         0/9
297: Mutually_Exclusive FAM5C, RAD21        0/10
298: Mutually_Exclusive FAM5C, PLCE1         0/9
299: Mutually_Exclusive PLCE1, RAD21         0/9
300: Mutually_Exclusive  EZH2, PLCE1         0/7

Detecting cancer driver genes based on positional clustering

maftools has a function oncodrive which identifies cancer genes (driver) from a given MAF. oncodrive is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite OncodriveCLUST article 7.

laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')

Warning in oncodrive(maf = laml, AACol = "Protein_Change", minMut = 5,
pvalMethod = "zscore"): Oncodrive has been superseeded by OncodriveCLUSTL. See
http://bg.upf.edu/group/projects/oncodrive-clust.php

head(laml.sig)

   Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
        <char>           <int>           <int>        <int>        <int>
1:        IDH1               0               0            0            0
2:        IDH2               0               0            0            0
3:        NPM1               0              33            0            0
4:        NRAS               0               0            0            0
5:       U2AF1               0               0            0            0
6:         KIT               1               1            0            1
   Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
               <int>             <int>       <int> <num>          <int>
1:                18                 0           0    18             18
2:                20                 0           0    20             20
3:                 1                 0           0    34             33
4:                15                 0           0    15             15
5:                 8                 0           0     8              8
6:                 7                 0           0    10              8
   AlteredSamples clusters muts_in_clusters clusterScores protLen   zscore
            <int>    <int>            <int>         <num>   <int>    <num>
1:             18        1               18     1.0000000     414 5.546154
2:             20        2               20     1.0000000     452 5.546154
3:             33        2               32     0.9411765     294 5.093665
4:             15        2               15     0.9218951     189 4.945347
5:              8        1                7     0.8750000     240 4.584615
6:              8        2                9     0.8500000     976 4.392308
           pval          fdr fract_muts_in_clusters
          <num>        <num>                  <num>
1: 1.460110e-08 1.022077e-07              1.0000000
2: 1.460110e-08 1.022077e-07              1.0000000
3: 1.756034e-07 8.194826e-07              0.9411765
4: 3.800413e-07 1.330144e-06              1.0000000
5: 2.274114e-06 6.367520e-06              0.8750000
6: 5.607691e-06 1.308461e-05              0.9000000

We can plot the results using plotOncodrive.

plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE, labelSize = 0.5)

plotOncodrive plots the results as scatter plot with size of the points proportional to the number of clusters found in the gene. X-axis shows number of mutations (or fraction of mutations) observed in these clusters. In the above example, IDH1 has a single cluster and all of the 18 mutations are accumulated within that cluster, giving it a cluster score of one. For details on oncodrive algorithm, please refer to OncodriveCLUST article 7.

Adding and summarizing pfam domains

maftools comes with the function pfamDomains, which adds pfam domain information to the amino acid changes. pfamDomain also summarizes amino acid changes according to the domains that are affected. This serves the purpose of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation module from MuSic tool 8.

laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)

Warning in pfamDomains(maf = laml, AACol = "Protein_Change", top = 10): Removed
50 mutations for which AA position was not available

#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE]

        HGNC AAPos Variant_Classification     N total  fraction   DomainLabel
      <char> <num>                 <fctr> <int> <num>     <num>        <char>
   1: DNMT3A   882      Missense_Mutation    27    54 0.5000000 AdoMet_MTases
   2:   IDH1   132      Missense_Mutation    18    18 1.0000000      PTZ00435
   3:   IDH2   140      Missense_Mutation    17    20 0.8500000      PTZ00435
   4:   FLT3   835      Missense_Mutation    14    52 0.2692308      PKc_like
   5:   FLT3   599           In_Frame_Ins    10    52 0.1923077      PKc_like
  ---                                                                        
1512: ZNF646   875      Missense_Mutation     1     1 1.0000000          <NA>
1513: ZNF687   554      Missense_Mutation     1     2 0.5000000          <NA>
1514: ZNF687   363      Missense_Mutation     1     2 0.5000000          <NA>
1515: ZNF75D     5      Missense_Mutation     1     1 1.0000000          <NA>
1516: ZNF827   427        Frame_Shift_Del     1     1 1.0000000          <NA>

#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE]

       DomainLabel nMuts nGenes
            <char> <int>  <int>
  1:      PKc_like    55      5
  2:      PTZ00435    38      2
  3: AdoMet_MTases    33      1
  4:         7tm_1    24     24
  5:       COG5048    17     17
 ---                           
499:    ribokinase     1      1
500:   rim_protein     1      1
501: sigpep_I_bact     1      1
502:           trp     1      1
503:        zf-BED     1      1

Survival analysis

Survival analysis is an essential part of cohort based sequencing projects. Function mafSurvive performs survival analysis and draws kaplan meier curve by grouping samples based on mutation status of user defined gene(s) or manually provided samples those make up a group. This function requires input data to contain Tumor_Sample_Barcode (make sure they match to those in MAF file), binary event (1/0) and time to event.

Our annotation data already contains survival information and in case you have survival data stored in a separate table provide them via argument clinicalData

Mutation in any given genes

#Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = laml, genes = 'TP53', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)

Looking for clinical data in annoatation slot of MAF..

Number of mutated samples for given genes:

TP53 
  15

Removed 11 samples with NA's

Median survival..

    Group medianTime     N
   <char>      <num> <int>
1: Mutant      182.5    14
2:     WT      366.0   168

Predict genesets associated with survival

Identify set of genes which results in poor survival

#Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groups
prog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)

Removed 11 samples with NA's

print(prog_geneset)

    Gene_combination P_value    hr    WT Mutant
              <char>   <num> <num> <int>  <int>
 1:      FLT3_DNMT3A 0.00104 2.510   164     18
 2:      DNMT3A_SMC3 0.04880 2.220   176      6
 3:      DNMT3A_NPM1 0.07190 1.720   166     16
 4:      DNMT3A_TET2 0.19600 1.780   176      6
 5:        FLT3_TET2 0.20700 1.860   177      5
 6:        NPM1_IDH1 0.21900 0.495   176      6
 7:      DNMT3A_IDH1 0.29300 1.500   173      9
 8:       IDH2_RUNX1 0.31800 1.580   176      6
 9:        FLT3_NPM1 0.53600 1.210   165     17
10:      DNMT3A_IDH2 0.68000 0.747   178      4
11:      DNMT3A_NRAS 0.99200 0.986   178      4

Above results show a combination (N = 2) of genes which are associated with poor survival (P < 0.05). We can draw KM curve for above results with the function mafSurvGroup

mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")

Looking for clinical data in annoatation slot of MAF..

Removed 11 samples with NA's

Median survival..

    Group medianTime     N
   <char>      <num> <int>
1: Mutant      242.5    18
2:     WT      379.5   164

Comparing two cohorts (MAFs)

Cancers differ from each other in terms of their mutation pattern. We can compare two different cohorts to detect such differentially mutated genes. For example, recent article by Madan et. al 9, have shown that patients with relapsed APL (Acute Promyelocytic Leukemia) tends to have mutations in PML and RARA genes, which were absent during primary stage of the disease. This difference between two cohorts (in this case primary and relapse APL) can be detected using function mafComapre, which performs fisher test on all genes between two cohorts to detect differentially mutated genes.

#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)

#Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single sample.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)
print(pt.vs.rt)

$results
   Hugo_Symbol Primary Relapse         pval         or       ci.up      ci.low
        <char>   <num>   <int>        <num>      <num>       <num>       <num>
1:         PML       1      11 1.529935e-05 0.03537381   0.2552937 0.000806034
2:        RARA       0       7 2.574810e-04 0.00000000   0.3006159 0.000000000
3:       RUNX1       1       5 1.310500e-02 0.08740567   0.8076265 0.001813280
4:        FLT3      26       4 1.812779e-02 3.56086275  14.7701728 1.149009169
5:      ARID1B       5       8 2.758396e-02 0.26480490   0.9698686 0.064804160
6:         WT1      20      14 2.229087e-01 0.60619329   1.4223101 0.263440988
7:        KRAS       6       1 4.334067e-01 2.88486293 135.5393108 0.337679367
8:        NRAS      15       4 4.353567e-01 1.85209500   8.0373994 0.553883512
9:      ARID1A       7       4 7.457274e-01 0.80869223   3.9297309 0.195710173
        adjPval
          <num>
1: 0.0001376942
2: 0.0011586643
3: 0.0393149868
4: 0.0407875250
5: 0.0496511201
6: 0.3343630535
7: 0.4897762916
8: 0.4897762916
9: 0.7457273717

$SampleSummary
    Cohort SampleSize
    <char>      <num>
1: Primary        124
2: Relapse         58

Forest plots

Above results show two genes PML and RARA which are highly mutated in Relapse APL compared to Primary APL. We can visualize these results as a forestplot.

forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)

Co-onco plots

Another alternative way of displaying above results is by plotting two oncoplots side by side. coOncoplot function takes two maf objects and plots them side by side for better comparison.

genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)

Co-bar plots

coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")

Lollipop plot-2

Along with plots showing cohort wise differences, its also possible to show gene wise differences with lollipopPlot2 function.

lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")

Clinical enrichment analysis

clinicalEnrichment is another function which takes any clinical feature associated with the samples and performs enrichment analysis. It performs various groupwise and pairwise comparisions to identify enriched mutations for every category within a clincila feature. Below is an example to identify mutations associated with FAB_classification.

fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')

Sample size per factor in FAB_classification:


M0 M1 M2 M3 M4 M5 M6 M7 
19 44 44 21 39 19  3  3 
     Hugo_Symbol Feature_1 Feature_2 n_mutated_Feature1 n_mutated_Feature2
          <char>    <char>    <char>             <char>             <char>
  1:        FLT3        M1        M0           15 of 44            4 of 19
  2:        FLT3        M2        M0            8 of 44            4 of 19
  3:        FLT3        M3        M0            6 of 21            4 of 19
  4:        FLT3        M4        M0           13 of 39            4 of 19
  5:        FLT3        M5        M0            6 of 19            4 of 19
 ---                                                                      
617:         TTN      <NA>      <NA>               <NA>               <NA>
618:         TTN      <NA>      <NA>               <NA>               <NA>
619:         TTN      <NA>      <NA>               <NA>               <NA>
620:         TTN      <NA>      <NA>               <NA>               <NA>
621:         TTN      <NA>      <NA>               <NA>               <NA>
       fdr Analysis Group1 Group2 n_mutated_group1 n_mutated_group2   p_value
     <num>   <char> <char> <char>           <char>           <char>     <num>
  1:     1 Pairwise   <NA>   <NA>             <NA>             <NA>        NA
  2:     1 Pairwise   <NA>   <NA>             <NA>             <NA>        NA
  3:     1 Pairwise   <NA>   <NA>             <NA>             <NA>        NA
  4:     1 Pairwise   <NA>   <NA>             <NA>             <NA>        NA
  5:     1 Pairwise   <NA>   <NA>             <NA>             <NA>        NA
 ---                                                                         
617:    NA    Group     M3   Rest          1 of 21         5 of 172 0.5038411
618:    NA    Group     M4   Rest          2 of 39         4 of 154 0.3497841
619:    NA    Group     M5   Rest          1 of 19         5 of 174 0.4676584
620:    NA    Group     M6   Rest           0 of 3         6 of 190 1.0000000
621:    NA    Group     M7   Rest           0 of 3         6 of 190 1.0000000
           OR     OR_low  OR_high
        <num>      <num>    <num>
  1:       NA         NA       NA
  2:       NA         NA       NA
  3:       NA         NA       NA
  4:       NA         NA       NA
  5:       NA         NA       NA
 ---                             
617: 1.664663 0.03364588 16.00879
618: 2.018253 0.17625631 14.70452
619: 1.869952 0.03763487 18.10979
620: 0.000000 0.00000000 85.42683
621: 0.000000 0.00000000 85.42683

#Results are returned as a list. Significant associations p-value < 0.05
fab.ce$groupwise_comparision[p_value < 0.05]

   Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2      p_value
        <char> <char> <char>           <char>           <char>        <num>
1:        IDH1     M1   Rest         11 of 44         7 of 149 0.0002597371
2:        TP53     M7   Rest           3 of 3        12 of 190 0.0003857187
3:      DNMT3A     M5   Rest         10 of 19        38 of 174 0.0089427384
4:       CEBPA     M2   Rest          7 of 44         6 of 149 0.0117352110
5:       RUNX1     M0   Rest          5 of 19        11 of 174 0.0117436825
6:        NPM1     M5   Rest          7 of 19        26 of 174 0.0248582372
7:        NPM1     M3   Rest          0 of 21        33 of 172 0.0278630823
8:      DNMT3A     M3   Rest          1 of 21        47 of 172 0.0294005111
         OR      OR_low    OR_high       fdr
      <num>       <num>      <num>     <num>
1: 6.670592 2.173829026 21.9607250 0.0308575
2:      Inf 5.355415451        Inf 0.0308575
3: 3.941207 1.333635173 11.8455979 0.3757978
4: 4.463237 1.204699322 17.1341278 0.3757978
5: 5.216902 1.243812880 19.4051505 0.3757978
6: 3.293201 1.001404899 10.1210509 0.5880102
7: 0.000000 0.000000000  0.8651972 0.5880102
8: 0.133827 0.003146708  0.8848897 0.5880102

Above results shows IDH1 mutations are enriched in M1 subtype of leukemia compared to rest of the cohort. Similarly DNMT3A is in M5, RUNX1 is in M0, and so on. These are well known results and this function effectively recaptures them. One can use any sort of clincial feature to perform such an analysis. There is also a small function - plotEnrichmentResults which can be used to plot these results.

plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05, geneFontSize = 0.5, annoFontSize = 1)

Drug-Gene Interactions

drugInteractions function checks for drug–gene interactions and gene druggability information compiled from Drug Gene Interaction database.

dgi = drugInteractions(maf = laml, fontSize = 0.75)

dgi.brca = drugInteractions(maf = brca, fontSize = 0.75)

Above plot shows potential druggable gene categories along with upto top 5 genes involved in them. One can also extract information on drug-gene interactions. For example below is the results for known/reported drugs to interact with DNMT3A.

dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)

Number of claimed drugs for given genes:
     Gene     N
   <char> <int>
1: DNMT3A     7

#Printing selected columns.
dnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)]

     Gene interaction_types    drug_name drug_claim_name
   <char>            <char>       <char>          <char>
1: DNMT3A                                            N/A
2: DNMT3A                   DAUNORUBICIN    Daunorubicin
3: DNMT3A                     DECITABINE      Decitabine
4: DNMT3A                     IDARUBICIN      IDARUBICIN
5: DNMT3A                     DECITABINE      DECITABINE
6: DNMT3A         inhibitor   DECITABINE   CHEMBL1201129
7: DNMT3A         inhibitor  AZACITIDINE      CHEMBL1489

Oncogenic Signaling Pathways

OncogenicPathways function checks for enrichment of known Oncogenic Signaling Pathways in TCGA cohorts

Its also possible to visualize complete pathway.

Tumor suppressor genes are in red, and oncogenes are in blue font.

Tumor heterogeneity and MATH scores

Heterogeneity in tumor samples

Tumors are generally heterogeneous i.e, consist of multiple clones. This heterogeneity can be inferred by clustering variant allele frequencies. inferHeterogeneity function uses vaf information to cluster variants (using mclust), to infer clonality. By default, inferHeterogeneity function looks for column t_vaf containing vaf information. However, if the field name is different from t_vaf, we can manually specify it using argument vafCol. For example, in this case study vaf is stored under the field name i_TumorVAF_WU.

#Heterogeneity in sample TCGA.AB.2972
library("mclust")

Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.

tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')

Processing TCGA-AB-2972..

print(tcga.ab.2972.het$clusterMeans)

   Tumor_Sample_Barcode cluster   meanVaf
                 <fctr>  <char>     <num>
1:         TCGA-AB-2972       2 0.4496571
2:         TCGA-AB-2972       1 0.2454750
3:         TCGA-AB-2972 outlier 0.3695000

#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)

Above figure shows clear separation of two clones clustered at mean variant allele frequencies of ~45% (major clone) and another minor clone at variant allele frequency of ~25%.

Although clustering of variant allele frequencies gives us a fair idea on heterogeneity, it is also possible to measure the extent of heterogeneity in terms of a numerical value. MATH score (mentioned as a subtitle in above plot) is a simple quantitative measure of intra-tumor heterogeneity, which calculates the width of the vaf distribution. Higher MATH scores are found to be associated with poor outcome. MATH score can also be used a proxy variable for survival analysis 11.

Ignoring variants in copy number altered regions

We can use copy number information to ignore variants located on copy-number altered regions. Copy number alterations results in abnormally high/low variant allele frequencies, which tends to affect clustering. Removing such variants improves clustering and density estimation while retaining biologically meaningful results. Copy number information can be provided as a segmented file generated from segmentation programmes, such as Circular Binary Segmentation from “DNACopy” Bioconductor package 6.

seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')

Processing TCGA-AB-3009..

Removed 1 variants with no copy number data.

   Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
        <char>     <char>          <num>        <num>               <fctr>
1:        PHF6         23      133551224    133551224         TCGA-AB-3009
       t_vaf Segment_Start Segment_End Segment_Mean    CN
       <num>         <int>       <int>        <num> <num>
1: 0.9349112            NA          NA           NA    NA

Copy number altered variants:

   Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
        <char>     <char>          <num>        <num>               <fctr>
1:     NFKBIL2          8      145668658    145668658         TCGA-AB-3009
2:         NF1         17       29562981     29562981         TCGA-AB-3009
3:       SUZ12         17       30293198     30293198         TCGA-AB-3009
       t_vaf Segment_Start Segment_End Segment_Mean       CN    cluster
       <num>         <int>       <int>        <num>    <num>     <char>
1: 0.4415584     145232496   145760746       0.3976 2.634629 CN_altered
2: 0.8419000      29054355    30363868      -0.9157 1.060173 CN_altered
3: 0.8958333      29054355    30363868      -0.9157 1.060173 CN_altered

#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)

Above figure shows two genes NF1 and SUZ12 with high VAF’s, which is due to copy number alterations (deletion). Those two genes are ignored from analysis.