if (!require("BiocManager"))
install.packages("BiocManager")
BiocManager::install("maftools")maftools : QuickLook
Generating MAF files
For VCF files or simple tabular files, easy option is to use vcf2maf utility which will annotate VCFs, prioritize transcripts, and generates an MAF. Recent updates to gatk has also enabled
If you’re using ANNOVAR for variant annotations, maftools has a handy function
annovarToMaffor converting tabular annovar outputs to MAF.
MAF field requirements
MAF files contain many fields ranging from chromosome names to cosmic annotations. However most of the analysis in maftools uses following fields.
- Mandatory fields: Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type and Tumor_Sample_Barcode.
Installation
Reading and summarizing maf files
Required input files
- an MAF file - can be gz compressed. Required.
- an optional but recommended clinical data associated with each sample/Tumor_Sample_Barcode in MAF.
- an optional copy number data if available. Can be GISTIC output or a custom table containing sample names, gene names and copy-number status (
AmporDel).
Reading MAF files.
read.maf function reads MAF files, summarizes it in various ways and stores it as an MAF object. Even though MAF file is alone enough, it is recommended to provide annotations associated with samples in MAF. One can also integrate copy number data if available.
library(maftools)#path to TCGA LAML MAF file
laml.maf = system.file('extdata', 'tcga_laml.maf.gz', package = 'maftools')
#clinical information containing survival information and histology. This is optional
laml.clin = system.file('extdata', 'tcga_laml_annot.tsv', package = 'maftools')
laml = read.maf(maf = laml.maf, clinicalData = laml.clin)-Reading
-Validating
-Silent variants: 475
-Summarizing
-Processing clinical data
-Finished in 0.220s elapsed (0.190s cpu)
#Wtihout Clinical data
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
maf_data_view <- laml@data
maf_gene.summary <- laml@gene.summaryCollect MAF data from TCGA
https://portal.gdc.cancer.gov/
library(TCGAbiolinks)
GDCprojects <- getGDCprojects()
summary <- TCGAbiolinks:::getProjectSummary("TCGA-CHOL")
query.maf.hg19 <- GDCquery(project = "TCGA-CHOL",
data.category = "Simple Nucleotide Variation",
access = "open"
)--------------------------------------
o GDCquery: Searching in GDC database
--------------------------------------
Genome of reference: hg38
--------------------------------------------
oo Accessing GDC. This might take a while...
--------------------------------------------
ooo Project: TCGA-CHOL
--------------------
oo Filtering results
--------------------
ooo By access
----------------
oo Checking data
----------------
ooo Checking if there are duplicated cases
ooo Checking if there are results for the query
-------------------
o Preparing output
-------------------
#legacy = TRUE
GDCdownload(query.maf.hg19)Downloading data for project TCGA-CHOL
Of the 51 files for download 51 already exist.
All samples have been already downloaded
maf_TCGA_data <- GDCprepare(query.maf.hg19)
maf_TCGA <- read.maf(maf_TCGA_data,isTCGA = TRUE)-Validating
-Silent variants: 927
-Summarizing
--Possible FLAGS among top ten genes:
MUC16
MUC5B
-Processing clinical data
--Missing clinical data
-Finished in 0.170s elapsed (0.170s cpu)
MAF object
Summarized MAF file is stored as an MAF object. MAF object contains main maf file, summarized data and any associated sample annotations.
There are accessor methods to access the useful slots from MAF object.
#Typing laml shows basic summary of MAF file.
lamlAn object of class MAF
ID summary Mean Median
<char> <char> <num> <num>
1: NCBI_Build 37 NA NA
2: Center genome.wustl.edu NA NA
3: Samples 193 NA NA
4: nGenes 1241 NA NA
5: Frame_Shift_Del 52 0.269 0
6: Frame_Shift_Ins 91 0.472 0
7: In_Frame_Del 10 0.052 0
8: In_Frame_Ins 42 0.218 0
9: Missense_Mutation 1342 6.953 7
10: Nonsense_Mutation 103 0.534 0
11: Splice_Site 92 0.477 0
12: total 1732 8.974 9
#Shows sample summry.
getSampleSummary(laml)
#Shows gene summary.
getGeneSummary(laml)
#shows clinical data associated with samples
getClinicalData(laml)
#Shows all fields in MAF
getFields(laml)
#Writes maf summary to an output file with basename laml.
write.mafSummary(maf = laml, basename = 'laml')Visualization
Plotting MAF summary.
We can use plotmafSummary to plot the summary of the maf file, which displays number of variants in each sample as a stacked barplot and variant types as a boxplot summarized by Variant_Classification.
plotmafSummary(maf = laml, rmOutlier = TRUE, addStat = 'median', dashboard = TRUE, titvRaw = FALSE)mafbarplot(maf = laml)Use mafbarplot for a minimal barplot of mutated genes.
Oncoplots
Drawing oncoplots
Better representation of maf file can be shown as oncoplots, also known as waterfall plots.
#oncoplot for top ten mutated genes.
oncoplot(maf = laml, top = 10)NOTE: Variants annotated as Multi_Hit are those genes which are mutated more than once in the same sample.
For more details on customisation see the Customizing oncoplots vignette.
Transition and Transversions.
titv function classifies SNPs into Transitions and Transversions and returns a list of summarized tables in various ways. Summarized data can also be visualized as a boxplot showing overall distribution of six different conversions and as a stacked barplot showing fraction of conversions in each sample.
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
#plot titv summary
plotTiTv(res = laml.titv)Lollipop plots for amino acid changes
lollipopPlot function requires us to have amino acid changes information in the maf file. However MAF files have no clear guidelines on naming the field for amino acid changes, with different studies having different field (or column) names for amino acid changes. By default, lollipopPlot looks for column AAChange, and if its not found in the MAF file, it prints all available fields with a warning message. For below example, MAF file contains amino acid changes under a field/column name ‘Protein_Change’. We will manually specify this using argument AACol.
By default lollipopPlot uses the longest isoform of the gene.
#lollipop plot for DNMT3A, which is one of the most frequent mutated gene in Leukemia.
lollipopPlot(
maf = laml,
gene = 'DNMT3A',
AACol = 'Protein_Change',
showMutationRate = TRUE,
labelPos = 882
)3 transcripts available. Use arguments refSeqID or proteinID to manually specify tx name.
HGNC refseq.ID protein.ID aa.length
<char> <char> <char> <num>
1: DNMT3A NM_022552 NP_072046 912
2: DNMT3A NM_153759 NP_715640 723
3: DNMT3A NM_175629 NP_783328 912
Using longer transcript NM_022552 for now.
Removed 3 mutations for which AA position was not available
General protein domains can be drawn with the function plotProtein
plotProtein(gene = "TP53", refSeqID = "NM_000546")Rainfall plots
Cancer genomes, especially solid tumors are characterized by genomic loci with localized hyper-mutations 5. Such hyper mutated genomic regions can be visualized by plotting inter variant distance on a linear genomic scale. These plots generally called rainfall plots and we can draw such plots using rainfallPlot. If detectChangePoints is set to TRUE, rainfall plot also highlights regions where potential changes in inter-event distances are located.
brca <- system.file("extdata", "brca.maf.gz", package = "maftools")
brca = read.maf(maf = brca, verbose = FALSE)
rainfallPlot(maf = brca, detectChangePoints = TRUE, pointSize = 0.8)rainfallPlot(maf = laml, detectChangePoints = TRUE, pointSize = 0.8)“Kataegis” are defined as those genomic segments containing six or more consecutive mutations with an average inter-mutation distance of less than or equal to 1,00 bp 5.
Compare mutation load against TCGA cohorts
tcgaCompare uses mutation load from TCGA MC3 for comparing muttaion burden against 33 TCGA cohorts. Plot generated is similar to the one described in Alexandrov et al 5.
name list of cohort = https://www.cancer.gov/research/key-initiatives/ras/ras-central/blog/2017/kras.pdf
brca.mutload = tcgaCompare(maf = brca, cohortName = 'Example-BRCA', logscale = TRUE, capture_size = 50)Warning in FUN(X[[i]], ...): Removed 0 samples with zero mutations.
Plotting VAF
This function plots Variant Allele Frequencies as a boxplot which quickly helps to estimate clonal status of top mutated genes (clonal genes usually have mean allele frequency around ~50% assuming pure sample)
https://www.biostars.org/p/361558/
plotVaf(maf = laml, vafCol = 'i_TumorVAF_WU')Processing copy-number data
Reading and summarizing gistic output files.
We can summarize output files generated by GISTIC programme. As mentioned earlier, we need four files that are generated by GISTIC, i.e, all_lesions.conf_XX.txt, amp_genes.conf_XX.txt, del_genes.conf_XX.txt and scores.gistic, where XX is the confidence level. See GISTIC documentation for details.
https://www.genepattern.org/doc/GISTIC_2.0/2.0.23/GISTICDocumentation_standalone.htm
Genomic Identification of Significant Targets in Cancer, version 2.0
We describe methods with enhanced power and specificity to identify genes targeted by somatic copy-number alterations (SCNAs) that drive cancer growth
A common approach to identifying drivers is to study large collections of cancer samples, on the notion that regions containing driver events should be altered at higher frequencies than regions containing only passengers [4,6,7,9-14]. For example, we developed an algorithm, GISTIC (Genomic Identification of Significant Targets in Cancer) [15], that identifies likely driver SCNAs by evaluating the frequency and amplitude of observed events. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3218867/
all.lesions <- system.file("extdata", "all_lesions.conf_99.txt", package = "maftools")
amp.genes <- system.file("extdata", "amp_genes.conf_99.txt", package = "maftools")
del.genes <- system.file("extdata", "del_genes.conf_99.txt", package = "maftools")
scores.gis <- system.file("extdata", "scores.gistic", package = "maftools")
laml.gistic = readGistic(gisticAllLesionsFile = all.lesions, gisticAmpGenesFile = amp.genes, gisticDelGenesFile = del.genes, gisticScoresFile = scores.gis, isTCGA = TRUE)-Processing Gistic files..
--Processing amp_genes.conf_99.txt
--Processing del_genes.conf_99.txt
--Processing scores.gistic
--Summarizing by samples
#GISTIC object
laml.gisticAn object of class GISTIC
ID summary
<char> <num>
1: Samples 191
2: nGenes 2622
3: cytoBands 16
4: Amp 388
5: Del 26481
6: total 26869
Similar to MAF objects, there are methods available to access slots of GISTIC object - getSampleSummary, getGeneSummary and getCytoBandSummary. Summarized results can be written to output files using function write.GisticSummary.
Visualizing gistic results.
There are three types of plots available to visualize gistic results.
genome plot
Red=amplification blue =deletion
gisticChromPlot(gistic = laml.gistic, markBands = "all")Bubble plot
gisticBubblePlot(gistic = laml.gistic)oncoplot
This is similar to oncoplots except for copy number data. One can again sort the matrix according to annotations, if any. Below plot is the gistic results for LAML, sorted according to FAB classification. Plot shows that 7q deletions are virtually absent in M4 subtype where as it is widespread in other subtypes.
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3612855/
The World Health Organization (WHO) classifies acute myeloid leukemia (AML) via genetic, immunophenotypic, biological, and clinical features. Still, “AML, not otherwise specified (NOS)” is further subdivided based on morphologic criteria similar to those of the French-American-British (FAB) classification. FAB M0 was independently associated with significantly lower likelihood of achieving complete remission and inferior relapse-free and overall survival as compared with FAB M1, M2, M4, M5, and M6, with inconclusive data regarding M7.
gisticOncoPlot(gistic = laml.gistic, clinicalData = getClinicalData(x = laml), clinicalFeatures = 'FAB_classification', sortByAnnotation = TRUE, top = 10)Visualizing CBS segments
https://cran.r-project.org/web/packages/PSCBS/vignettes/CBS.pdf
The Circular Binary Segmentation (CBS) method partitions a genome into segments of constant total copy numbers (TCNs) based on DNA microarray data.
https://jeremy9959.net/Blog/cbs-fixed/
Circular Binary Segmentation is an algorithm for finding changepoints in sequential data, and in particular for identifying changes in copy number from CGH or other types of genomic data. The algorithm is described in the 2004 paper Circular Binary segmentation for the analysis of array-based DNA copy number data. It is implemented in the R package DNAcopy, which is widely used in tools for the analysis of copy number in genomics. For example, both the gingko and cnvkit packages ultimately refer the segmentation of the count data that they generate from sequences to to DNAcopy, and thus to the CBS algorithm.
tcga.ab.009.seg <- system.file("extdata", "TCGA.AB.3009.hg19.seg.txt", package = "maftools")
plotCBSsegments(cbsFile = tcga.ab.009.seg)No 'tsb' specified, plot head 1 sample. Set tsb='ALL' to plot all samples.
NULL
Analysis
Somatic Interactions
Mutually exclusive or co-occurring set of genes can be detected using somaticInteractions function, which performs pair-wise Fisher’s Exact test to detect such significant pair of genes.
https://medlineplus.gov/genetics/gene/asxl1/ https://medlineplus.gov/genetics/gene/runx1/ https://medlineplus.gov/genetics/gene/flt3/ https://medlineplus.gov/genetics/gene/tp53/
The ASXL1 gene provides instructions for making a protein that is involved in a process known as chromatin remodeling. Chromatin is the complex of DNA and proteins that packages DNA into chromosomes. The structure of chromatin can be changed (remodeled) to alter how tightly DNA is packaged. When DNA is tightly packed, gene activity (expression) is lower than when DNA is loosely packed.
The RUNX1 gene provides instructions for making a protein called runt-related transcription factor 1 (RUNX1). Like other transcription factors, the RUNX1 protein attaches (binds) to specific regions of DNA and helps control the activity of particular genes.
The FLT3 gene provides instructions for making a protein called fms-like tyrosine kinase 3 (FLT3), which is part of a family of proteins called receptor tyrosine kinases (RTKs). Receptor tyrosine kinases transmit signals from the cell surface into the cell through a process called signal transduction.
The TP53 gene provides instructions for making a protein called tumor protein p53 (or p53). This protein acts as a tumor suppressor, which means that it regulates cell division by keeping cells from growing and dividing (proliferating) too fast or in an uncontrolled way.
#exclusive/co-occurance event analysis on top 10 mutated genes.
somaticInteractions(maf = laml, top = 25, pvalue = c(0.05, 0.1)) gene1 gene2 pValue oddsRatio 00 01 11 10 pAdj
<char> <char> <num> <num> <int> <int> <int> <int> <num>
1: ASXL1 RUNX1 0.0001541586 55.215541 176 12 4 1 0.003568486
2: IDH2 RUNX1 0.0002809928 9.590877 164 9 7 13 0.006055880
3: IDH2 ASXL1 0.0004030636 41.077327 172 1 4 16 0.008126283
4: FLT3 NPM1 0.0009929836 3.763161 125 16 17 35 0.018664260
5: SMC3 DNMT3A 0.0010451985 20.177713 144 42 6 1 0.018664260
---
296: PLCE1 ASXL1 1.0000000000 0.000000 184 5 0 4 1.000000000
297: RAD21 FAM5C 1.0000000000 0.000000 183 5 0 5 1.000000000
298: PLCE1 FAM5C 1.0000000000 0.000000 184 5 0 4 1.000000000
299: PLCE1 RAD21 1.0000000000 0.000000 184 5 0 4 1.000000000
300: EZH2 PLCE1 1.0000000000 0.000000 186 4 0 3 1.000000000
Event pair event_ratio
<char> <char> <char>
1: Co_Occurence ASXL1, RUNX1 4/13
2: Co_Occurence IDH2, RUNX1 7/22
3: Co_Occurence ASXL1, IDH2 4/17
4: Co_Occurence FLT3, NPM1 17/51
5: Co_Occurence DNMT3A, SMC3 6/43
---
296: Mutually_Exclusive ASXL1, PLCE1 0/9
297: Mutually_Exclusive FAM5C, RAD21 0/10
298: Mutually_Exclusive FAM5C, PLCE1 0/9
299: Mutually_Exclusive PLCE1, RAD21 0/9
300: Mutually_Exclusive EZH2, PLCE1 0/7
Detecting cancer driver genes based on positional clustering
maftools has a function oncodrive which identifies cancer genes (driver) from a given MAF. oncodrive is a based on algorithm oncodriveCLUST which was originally implemented in Python. Concept is based on the fact that most of the variants in cancer causing genes are enriched at few specific loci (aka hot-spots). This method takes advantage of such positions to identify cancer genes. If you use this function, please cite OncodriveCLUST article 7.
laml.sig = oncodrive(maf = laml, AACol = 'Protein_Change', minMut = 5, pvalMethod = 'zscore')Warning in oncodrive(maf = laml, AACol = "Protein_Change", minMut = 5,
pvalMethod = "zscore"): Oncodrive has been superseeded by OncodriveCLUSTL. See
http://bg.upf.edu/group/projects/oncodrive-clust.php
head(laml.sig) Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
<char> <int> <int> <int> <int>
1: IDH1 0 0 0 0
2: IDH2 0 0 0 0
3: NPM1 0 33 0 0
4: NRAS 0 0 0 0
5: U2AF1 0 0 0 0
6: KIT 1 1 0 1
Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
<int> <int> <int> <num> <int>
1: 18 0 0 18 18
2: 20 0 0 20 20
3: 1 0 0 34 33
4: 15 0 0 15 15
5: 8 0 0 8 8
6: 7 0 0 10 8
AlteredSamples clusters muts_in_clusters clusterScores protLen zscore
<int> <int> <int> <num> <int> <num>
1: 18 1 18 1.0000000 414 5.546154
2: 20 2 20 1.0000000 452 5.546154
3: 33 2 32 0.9411765 294 5.093665
4: 15 2 15 0.9218951 189 4.945347
5: 8 1 7 0.8750000 240 4.584615
6: 8 2 9 0.8500000 976 4.392308
pval fdr fract_muts_in_clusters
<num> <num> <num>
1: 1.460110e-08 1.022077e-07 1.0000000
2: 1.460110e-08 1.022077e-07 1.0000000
3: 1.756034e-07 8.194826e-07 0.9411765
4: 3.800413e-07 1.330144e-06 1.0000000
5: 2.274114e-06 6.367520e-06 0.8750000
6: 5.607691e-06 1.308461e-05 0.9000000
We can plot the results using plotOncodrive.
plotOncodrive(res = laml.sig, fdrCutOff = 0.1, useFraction = TRUE, labelSize = 0.5)plotOncodrive plots the results as scatter plot with size of the points proportional to the number of clusters found in the gene. X-axis shows number of mutations (or fraction of mutations) observed in these clusters. In the above example, IDH1 has a single cluster and all of the 18 mutations are accumulated within that cluster, giving it a cluster score of one. For details on oncodrive algorithm, please refer to OncodriveCLUST article 7.
Adding and summarizing pfam domains
maftools comes with the function pfamDomains, which adds pfam domain information to the amino acid changes. pfamDomain also summarizes amino acid changes according to the domains that are affected. This serves the purpose of knowing what domain in given cancer cohort, is most frequently affected. This function is inspired from Pfam annotation module from MuSic tool 8.
laml.pfam = pfamDomains(maf = laml, AACol = 'Protein_Change', top = 10)Warning in pfamDomains(maf = laml, AACol = "Protein_Change", top = 10): Removed
50 mutations for which AA position was not available
#Protein summary (Printing first 7 columns for display convenience)
laml.pfam$proteinSummary[,1:7, with = FALSE] HGNC AAPos Variant_Classification N total fraction DomainLabel
<char> <num> <fctr> <int> <num> <num> <char>
1: DNMT3A 882 Missense_Mutation 27 54 0.5000000 AdoMet_MTases
2: IDH1 132 Missense_Mutation 18 18 1.0000000 PTZ00435
3: IDH2 140 Missense_Mutation 17 20 0.8500000 PTZ00435
4: FLT3 835 Missense_Mutation 14 52 0.2692308 PKc_like
5: FLT3 599 In_Frame_Ins 10 52 0.1923077 PKc_like
---
1512: ZNF646 875 Missense_Mutation 1 1 1.0000000 <NA>
1513: ZNF687 554 Missense_Mutation 1 2 0.5000000 <NA>
1514: ZNF687 363 Missense_Mutation 1 2 0.5000000 <NA>
1515: ZNF75D 5 Missense_Mutation 1 1 1.0000000 <NA>
1516: ZNF827 427 Frame_Shift_Del 1 1 1.0000000 <NA>
#Domain summary (Printing first 3 columns for display convenience)
laml.pfam$domainSummary[,1:3, with = FALSE] DomainLabel nMuts nGenes
<char> <int> <int>
1: PKc_like 55 5
2: PTZ00435 38 2
3: AdoMet_MTases 33 1
4: 7tm_1 24 24
5: COG5048 17 17
---
499: ribokinase 1 1
500: rim_protein 1 1
501: sigpep_I_bact 1 1
502: trp 1 1
503: zf-BED 1 1
Survival analysis
Survival analysis is an essential part of cohort based sequencing projects. Function mafSurvive performs survival analysis and draws kaplan meier curve by grouping samples based on mutation status of user defined gene(s) or manually provided samples those make up a group. This function requires input data to contain Tumor_Sample_Barcode (make sure they match to those in MAF file), binary event (1/0) and time to event.
Our annotation data already contains survival information and in case you have survival data stored in a separate table provide them via argument clinicalData
Mutation in any given genes
#Survival analysis based on grouping of DNMT3A mutation status
mafSurvival(maf = laml, genes = 'TP53', time = 'days_to_last_followup', Status = 'Overall_Survival_Status', isTCGA = TRUE)Looking for clinical data in annoatation slot of MAF..
Number of mutated samples for given genes:
TP53
15
Removed 11 samples with NA's
Median survival..
Group medianTime N
<char> <num> <int>
1: Mutant 182.5 14
2: WT 366.0 168
Predict genesets associated with survival
Identify set of genes which results in poor survival
#Using top 20 mutated genes to identify a set of genes (of size 2) to predict poor prognostic groups
prog_geneset = survGroup(maf = laml, top = 20, geneSetSize = 2, time = "days_to_last_followup", Status = "Overall_Survival_Status", verbose = FALSE)Removed 11 samples with NA's
print(prog_geneset) Gene_combination P_value hr WT Mutant
<char> <num> <num> <int> <int>
1: FLT3_DNMT3A 0.00104 2.510 164 18
2: DNMT3A_SMC3 0.04880 2.220 176 6
3: DNMT3A_NPM1 0.07190 1.720 166 16
4: DNMT3A_TET2 0.19600 1.780 176 6
5: FLT3_TET2 0.20700 1.860 177 5
6: NPM1_IDH1 0.21900 0.495 176 6
7: DNMT3A_IDH1 0.29300 1.500 173 9
8: IDH2_RUNX1 0.31800 1.580 176 6
9: FLT3_NPM1 0.53600 1.210 165 17
10: DNMT3A_IDH2 0.68000 0.747 178 4
11: DNMT3A_NRAS 0.99200 0.986 178 4
Above results show a combination (N = 2) of genes which are associated with poor survival (P < 0.05). We can draw KM curve for above results with the function mafSurvGroup
mafSurvGroup(maf = laml, geneSet = c("DNMT3A", "FLT3"), time = "days_to_last_followup", Status = "Overall_Survival_Status")Looking for clinical data in annoatation slot of MAF..
Removed 11 samples with NA's
Median survival..
Group medianTime N
<char> <num> <int>
1: Mutant 242.5 18
2: WT 379.5 164
Comparing two cohorts (MAFs)
Cancers differ from each other in terms of their mutation pattern. We can compare two different cohorts to detect such differentially mutated genes. For example, recent article by Madan et. al 9, have shown that patients with relapsed APL (Acute Promyelocytic Leukemia) tends to have mutations in PML and RARA genes, which were absent during primary stage of the disease. This difference between two cohorts (in this case primary and relapse APL) can be detected using function mafComapre, which performs fisher test on all genes between two cohorts to detect differentially mutated genes.
#Primary APL MAF
primary.apl = system.file("extdata", "APL_primary.maf.gz", package = "maftools")
primary.apl = read.maf(maf = primary.apl)
#Relapse APL MAF
relapse.apl = system.file("extdata", "APL_relapse.maf.gz", package = "maftools")
relapse.apl = read.maf(maf = relapse.apl)#Considering only genes which are mutated in at-least in 5 samples in one of the cohort to avoid bias due to genes mutated in single sample.
pt.vs.rt <- mafCompare(m1 = primary.apl, m2 = relapse.apl, m1Name = 'Primary', m2Name = 'Relapse', minMut = 5)
print(pt.vs.rt)$results
Hugo_Symbol Primary Relapse pval or ci.up ci.low
<char> <num> <int> <num> <num> <num> <num>
1: PML 1 11 1.529935e-05 0.03537381 0.2552937 0.000806034
2: RARA 0 7 2.574810e-04 0.00000000 0.3006159 0.000000000
3: RUNX1 1 5 1.310500e-02 0.08740567 0.8076265 0.001813280
4: FLT3 26 4 1.812779e-02 3.56086275 14.7701728 1.149009169
5: ARID1B 5 8 2.758396e-02 0.26480490 0.9698686 0.064804160
6: WT1 20 14 2.229087e-01 0.60619329 1.4223101 0.263440988
7: KRAS 6 1 4.334067e-01 2.88486293 135.5393108 0.337679367
8: NRAS 15 4 4.353567e-01 1.85209500 8.0373994 0.553883512
9: ARID1A 7 4 7.457274e-01 0.80869223 3.9297309 0.195710173
adjPval
<num>
1: 0.0001376942
2: 0.0011586643
3: 0.0393149868
4: 0.0407875250
5: 0.0496511201
6: 0.3343630535
7: 0.4897762916
8: 0.4897762916
9: 0.7457273717
$SampleSummary
Cohort SampleSize
<char> <num>
1: Primary 124
2: Relapse 58
Forest plots
Above results show two genes PML and RARA which are highly mutated in Relapse APL compared to Primary APL. We can visualize these results as a forestplot.
forestPlot(mafCompareRes = pt.vs.rt, pVal = 0.1)Co-onco plots
Another alternative way of displaying above results is by plotting two oncoplots side by side. coOncoplot function takes two maf objects and plots them side by side for better comparison.
genes = c("PML", "RARA", "RUNX1", "ARID1B", "FLT3")
coOncoplot(m1 = primary.apl, m2 = relapse.apl, m1Name = 'PrimaryAPL', m2Name = 'RelapseAPL', genes = genes, removeNonMutated = TRUE)Co-bar plots
coBarplot(m1 = primary.apl, m2 = relapse.apl, m1Name = "Primary", m2Name = "Relapse")Lollipop plot-2
Along with plots showing cohort wise differences, its also possible to show gene wise differences with lollipopPlot2 function.
lollipopPlot2(m1 = primary.apl, m2 = relapse.apl, gene = "PML", AACol1 = "amino_acid_change", AACol2 = "amino_acid_change", m1_name = "Primary", m2_name = "Relapse")Clinical enrichment analysis
clinicalEnrichment is another function which takes any clinical feature associated with the samples and performs enrichment analysis. It performs various groupwise and pairwise comparisions to identify enriched mutations for every category within a clincila feature. Below is an example to identify mutations associated with FAB_classification.
fab.ce = clinicalEnrichment(maf = laml, clinicalFeature = 'FAB_classification')Sample size per factor in FAB_classification:
M0 M1 M2 M3 M4 M5 M6 M7
19 44 44 21 39 19 3 3
Hugo_Symbol Feature_1 Feature_2 n_mutated_Feature1 n_mutated_Feature2
<char> <char> <char> <char> <char>
1: FLT3 M1 M0 15 of 44 4 of 19
2: FLT3 M2 M0 8 of 44 4 of 19
3: FLT3 M3 M0 6 of 21 4 of 19
4: FLT3 M4 M0 13 of 39 4 of 19
5: FLT3 M5 M0 6 of 19 4 of 19
---
617: TTN <NA> <NA> <NA> <NA>
618: TTN <NA> <NA> <NA> <NA>
619: TTN <NA> <NA> <NA> <NA>
620: TTN <NA> <NA> <NA> <NA>
621: TTN <NA> <NA> <NA> <NA>
fdr Analysis Group1 Group2 n_mutated_group1 n_mutated_group2 p_value
<num> <char> <char> <char> <char> <char> <num>
1: 1 Pairwise <NA> <NA> <NA> <NA> NA
2: 1 Pairwise <NA> <NA> <NA> <NA> NA
3: 1 Pairwise <NA> <NA> <NA> <NA> NA
4: 1 Pairwise <NA> <NA> <NA> <NA> NA
5: 1 Pairwise <NA> <NA> <NA> <NA> NA
---
617: NA Group M3 Rest 1 of 21 5 of 172 0.5038411
618: NA Group M4 Rest 2 of 39 4 of 154 0.3497841
619: NA Group M5 Rest 1 of 19 5 of 174 0.4676584
620: NA Group M6 Rest 0 of 3 6 of 190 1.0000000
621: NA Group M7 Rest 0 of 3 6 of 190 1.0000000
OR OR_low OR_high
<num> <num> <num>
1: NA NA NA
2: NA NA NA
3: NA NA NA
4: NA NA NA
5: NA NA NA
---
617: 1.664663 0.03364588 16.00879
618: 2.018253 0.17625631 14.70452
619: 1.869952 0.03763487 18.10979
620: 0.000000 0.00000000 85.42683
621: 0.000000 0.00000000 85.42683
#Results are returned as a list. Significant associations p-value < 0.05
fab.ce$groupwise_comparision[p_value < 0.05] Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2 p_value
<char> <char> <char> <char> <char> <num>
1: IDH1 M1 Rest 11 of 44 7 of 149 0.0002597371
2: TP53 M7 Rest 3 of 3 12 of 190 0.0003857187
3: DNMT3A M5 Rest 10 of 19 38 of 174 0.0089427384
4: CEBPA M2 Rest 7 of 44 6 of 149 0.0117352110
5: RUNX1 M0 Rest 5 of 19 11 of 174 0.0117436825
6: NPM1 M5 Rest 7 of 19 26 of 174 0.0248582372
7: NPM1 M3 Rest 0 of 21 33 of 172 0.0278630823
8: DNMT3A M3 Rest 1 of 21 47 of 172 0.0294005111
OR OR_low OR_high fdr
<num> <num> <num> <num>
1: 6.670592 2.173829026 21.9607250 0.0308575
2: Inf 5.355415451 Inf 0.0308575
3: 3.941207 1.333635173 11.8455979 0.3757978
4: 4.463237 1.204699322 17.1341278 0.3757978
5: 5.216902 1.243812880 19.4051505 0.3757978
6: 3.293201 1.001404899 10.1210509 0.5880102
7: 0.000000 0.000000000 0.8651972 0.5880102
8: 0.133827 0.003146708 0.8848897 0.5880102
Above results shows IDH1 mutations are enriched in M1 subtype of leukemia compared to rest of the cohort. Similarly DNMT3A is in M5, RUNX1 is in M0, and so on. These are well known results and this function effectively recaptures them. One can use any sort of clincial feature to perform such an analysis. There is also a small function - plotEnrichmentResults which can be used to plot these results.
plotEnrichmentResults(enrich_res = fab.ce, pVal = 0.05, geneFontSize = 0.5, annoFontSize = 1)Drug-Gene Interactions
drugInteractions function checks for drug–gene interactions and gene druggability information compiled from Drug Gene Interaction database.
dgi = drugInteractions(maf = laml, fontSize = 0.75)dgi.brca = drugInteractions(maf = brca, fontSize = 0.75)Above plot shows potential druggable gene categories along with upto top 5 genes involved in them. One can also extract information on drug-gene interactions. For example below is the results for known/reported drugs to interact with DNMT3A.
dnmt3a.dgi = drugInteractions(genes = "DNMT3A", drugs = TRUE)Number of claimed drugs for given genes:
Gene N
<char> <int>
1: DNMT3A 7
#Printing selected columns.
dnmt3a.dgi[,.(Gene, interaction_types, drug_name, drug_claim_name)] Gene interaction_types drug_name drug_claim_name
<char> <char> <char> <char>
1: DNMT3A N/A
2: DNMT3A DAUNORUBICIN Daunorubicin
3: DNMT3A DECITABINE Decitabine
4: DNMT3A IDARUBICIN IDARUBICIN
5: DNMT3A DECITABINE DECITABINE
6: DNMT3A inhibitor DECITABINE CHEMBL1201129
7: DNMT3A inhibitor AZACITIDINE CHEMBL1489
Oncogenic Signaling Pathways
OncogenicPathways function checks for enrichment of known Oncogenic Signaling Pathways in TCGA cohorts
Its also possible to visualize complete pathway.
Tumor suppressor genes are in red, and oncogenes are in blue font.
Tumor heterogeneity and MATH scores
Heterogeneity in tumor samples
Tumors are generally heterogeneous i.e, consist of multiple clones. This heterogeneity can be inferred by clustering variant allele frequencies. inferHeterogeneity function uses vaf information to cluster variants (using mclust), to infer clonality. By default, inferHeterogeneity function looks for column t_vaf containing vaf information. However, if the field name is different from t_vaf, we can manually specify it using argument vafCol. For example, in this case study vaf is stored under the field name i_TumorVAF_WU.
#Heterogeneity in sample TCGA.AB.2972
library("mclust")Package 'mclust' version 6.1.1
Type 'citation("mclust")' for citing this R package in publications.
tcga.ab.2972.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-2972', vafCol = 'i_TumorVAF_WU')Processing TCGA-AB-2972..
print(tcga.ab.2972.het$clusterMeans) Tumor_Sample_Barcode cluster meanVaf
<fctr> <char> <num>
1: TCGA-AB-2972 2 0.4496571
2: TCGA-AB-2972 1 0.2454750
3: TCGA-AB-2972 outlier 0.3695000
#Visualizing results
plotClusters(clusters = tcga.ab.2972.het)Above figure shows clear separation of two clones clustered at mean variant allele frequencies of ~45% (major clone) and another minor clone at variant allele frequency of ~25%.
Although clustering of variant allele frequencies gives us a fair idea on heterogeneity, it is also possible to measure the extent of heterogeneity in terms of a numerical value. MATH score (mentioned as a subtitle in above plot) is a simple quantitative measure of intra-tumor heterogeneity, which calculates the width of the vaf distribution. Higher MATH scores are found to be associated with poor outcome. MATH score can also be used a proxy variable for survival analysis 11.
Ignoring variants in copy number altered regions
We can use copy number information to ignore variants located on copy-number altered regions. Copy number alterations results in abnormally high/low variant allele frequencies, which tends to affect clustering. Removing such variants improves clustering and density estimation while retaining biologically meaningful results. Copy number information can be provided as a segmented file generated from segmentation programmes, such as Circular Binary Segmentation from “DNACopy” Bioconductor package 6.
seg = system.file('extdata', 'TCGA.AB.3009.hg19.seg.txt', package = 'maftools')
tcga.ab.3009.het = inferHeterogeneity(maf = laml, tsb = 'TCGA-AB-3009', segFile = seg, vafCol = 'i_TumorVAF_WU')Processing TCGA-AB-3009..
Removed 1 variants with no copy number data.
Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
<char> <char> <num> <num> <fctr>
1: PHF6 23 133551224 133551224 TCGA-AB-3009
t_vaf Segment_Start Segment_End Segment_Mean CN
<num> <int> <int> <num> <num>
1: 0.9349112 NA NA NA NA
Copy number altered variants:
Hugo_Symbol Chromosome Start_Position End_Position Tumor_Sample_Barcode
<char> <char> <num> <num> <fctr>
1: NFKBIL2 8 145668658 145668658 TCGA-AB-3009
2: NF1 17 29562981 29562981 TCGA-AB-3009
3: SUZ12 17 30293198 30293198 TCGA-AB-3009
t_vaf Segment_Start Segment_End Segment_Mean CN cluster
<num> <int> <int> <num> <num> <char>
1: 0.4415584 145232496 145760746 0.3976 2.634629 CN_altered
2: 0.8419000 29054355 30363868 -0.9157 1.060173 CN_altered
3: 0.8958333 29054355 30363868 -0.9157 1.060173 CN_altered
#Visualizing results. Highlighting those variants on copynumber altered variants.
plotClusters(clusters = tcga.ab.3009.het, genes = 'CN_altered', showCNvars = TRUE)Above figure shows two genes NF1 and SUZ12 with high VAF’s, which is due to copy number alterations (deletion). Those two genes are ignored from analysis.