What follows will outline the steps necessary to apply the DISCOVER test for seeking driver mutations and then visualizing the results through our application.

This tutorial will outline installation of the required packages, collection of the results, and uploading the resulting rds file for analysis within the Shiny application. Be aware that runtime for the discover test is several hours, and should be run overnight. The output file will be available for download as an example for those who would forgo running this tutorial file.

For data we will be using the TCGA BRCA set, and will also show how to retrieve this data using the TCGA biolinks package.

Background

Tumors are formed through the accumlation of mutations that gradually lead to malignant growth. Understanding the driver mutations that are particularly crucial to enabling this malignant growth is one task that will need to be accomplished in order to identify potential targets for precision medicine.

Numerous methods exist to identify potential driver mutations, and there are a number of ways of classifying them, however they can largely be grouped into three approaches based on the work of Dimitrakopoulos et. al.

First are the approaches which focus on finding particularly mutated pathways of genes, using a known list such as the work done by Grossman et. al. which uses the Gene Ontology terms to define pathways.

Second are the network-based approaches which rely on protein interaction networks, such as Mutated Core Modules in Cancer (iMCMC).

Thirdly and most relevant to this application note are the combinatorics approaches., A significant advantage of these approaches is they which take no prior information and instead search for patterns in the mutations amongst genes and tumors9,12. Generally, this involves searching for “mutually exclusive” mutations such as between KRAS and EGFR in lung cancers-generally mutations in these two genes do not occur together. The idea behind presumes that both genes are involved in the same pathway, and that affecting this pathway would offer a selective advantage to the tumor cell. Therefore, a mutation in one gene is likely to be observed, but a second mutation offers no further advantage. Examples of such methods include the Weighted Exact Test (WeXT)13 and Discrete Independence Statistic Controlling for Ob-servations with Varying Event Rates (DISCOVER).

The DISCOVER test, used here, first estimates a background tumor mutation rate to reflect the variation between individual tumors. Then, based on the Poisson Binomial distribution, the overlap between pairs of mutated genes is compared and a p-value generated against one of two hypothes–that we either observe statistically more or less mutation overlap than expected by chance. If the overlap is less than that expected by chance we can conclude that the two gene mutations may be exclusive–potentially indicating an overlap in function. Driver mutations tend to be exclusive with a large number of other genes, as they affect a large number of pathways. On the other hand, if the mutation overlap is greater than expected by chance then we can conclude the two gene mutations may co-occur. This is thought to either indicate synergy between the mutations, or potentially simply be an indication of overall levels of the tumor’s genomic disruption. Additionally Copy Number Variation (CNV) mutations will naturally appear as clusters of co-occurring gene pairs, if not excluded from the analysis.

MSK: Using the Application online

If you have not already, visit this page to view the application and follow this overview. You should be greeted with the following screen upon doing so:

The overview displays a network visualization of mutually exclusive, and co-occurring, mutations. The initial sample data display mutation data from the Memorial Sloan Kettering IMPACT study focusing on lung adenocarcinoma. “Search Type” sets the search term criteria (by gene or by pathways) and “Selection Type” allows for either including or excluding neighboring connections to the search term.

Red lines refer to mutations that are mutually exclusive with one another. Mutually exclusive genes are thought to indicate overlapping functional effects of the mutation-i.e. if a mutation in either gene A or gene B both lead to the same (cancerous) effect, then only one mutation is necessary for malignant behavior. Driver mutations, which would lead to a large number of cancerous effects, therefore tend to be mutually exclusive with a large number of other genes. EGFR, a well-known driver mutation shows up quite prominently therefore with a large number of mutually exclusive mutations.

Green lines indicate co-occurrence. These are thought to indicate synergistic effects.

The default visualization display all genes where a relationship has been found. However interactively searching through these results can simply the more complex visualizations, allowing more greater insight into the results.

Another prominent and well-known driver mutation is KRAS. This is somewhat less visible in the results, but we can filter the results to make the KRAS mutation more visible.

First use the search bar to search for and select “KRAS.”
Change the “Selection” option to “First Degree” neighbors.

The display will alter to display only the KRAS gene mutation and its’ immediate connections:

An additional search option offered by the application is to search by Gene Ontologies to search by a specific pathway. The three options are Biological Processes, Cellular Components and Molecular Functions.

To illustrate this utility, we can again use KRAS and EGFR. These two genes are within the Biological Process (domain) “MAPK Cascade.” To see mutually exclusive mutations within this pathway we can filter to only those mutations within it.

Change the “Search Type” option to “Biological Process”
Search for and select “MAPK Cascade.”
Change “Selection Type” to Only Searched For.
Uncheck “Only show genes with significant pairs.”

Next this tutorial will outline the process for uploading custom data to the application.

TCGA: Preparing a custom file

Installation of required packages

Though the number of commands necessary to run the analysis is relatively small, the process is reliant on several packages. Most (CRAN) packages will be automatically pulled down when installing the package, the Bioconductor and DISCOVER package itself will not – as these are drawn from other repositories.

# This package is used to pull additional gene ontology information from MyGene.info's repository
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mygene", version = "3.8")

# Installation of the DISCOVER package, which uses a custom repository
options(repos=c(getOption("repos"), "http://ccb.nki.nl/software/discover/repos/r"))
install.packages("discover",dependencies = TRUE)

# Install the application package, used in conjugation with the Shiny application
devtools::install_github("connorH982/driverAnalytics",dependencies = TRUE)

Additionally, we will be using the TCGAbiolinks package to acquire breast cancer mutation data from The Cancer Genome Atlas Program (TCGA). This package can be installed as below.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("TCGAbiolinks", version = "3.8")

Running Analysis

File preparation

The first step to running the analysis is to prepare an rds file that can be uploaded to the online application for visualization. This is handled by the driverAnalytics package. To build an indicator matrix we will use tidyr. Additionally magrittr, dplyr, and data.table are loaded for data manipulation convenience.

library(magrittr)
library(dplyr)
library(data.table)
library(TCGAbiolinks)
library(discover)
library(tidyr)
library(driverAnalytics)

Next we use the GDCquery_Maf (Mutation Annotation Format) to acquire the TCGA BRCA set. The below commands will also reformat the data into an indicator matrix where the rows denote genes, and the columns denote samples. Each cell is either a 1 (mutated), or 0 (non-mutatnt). This is the format used in the DISCOVER test.

To do this we use the spread function from the tidyr package. First we select the gene names and sample ids (these will be our rows and columns). These are all cases for which there is a mutation at those gene/sample pairs–therefore we label these with a “1” for mutated.

maf <- GDCquery_Maf("BRCA", pipelines = "muse") %>% subset(Sequencer == "Illumina HiSeq 2000")
dd<-maf %>% dplyr::select(Hugo_Symbol,Tumor_Sample_Barcode)
dd$one<-1
dd <- dd[!duplicated(dd),]
print(head(dd))

This is essentially our desired data but in “long” form format, and only including the mutated pairs. Therefore this is where we can use the spread() function to convert from long to wide format, while also using 0 as a fill value (as the only pairs that do not appear are those that did not have a mutation). As a last step we remove the column for gene names from the table and instead use this column for row names.

mtdt<-spread(dd,key = "Tumor_Sample_Barcode",value = "one",fill = 0)

mt<-as.data.frame(mtdt[,-1])
rownames(mt)<-mtdt$Hugo_Symbol

We are now ready to run the discover test. For the sake of performance we will (a) only consider genes with more than 15 observed mutations (others are included in the background probability matrix calculation), and (b) only seek exclusive pairs and not co-occurrence.

Warning: Even under these restrictions the operation is time-consuming and should be run overnight.

res.combined <- runDiscover(mt, q.threshold = 0.05, alternative = "less", min_mutation = 15)

check<-buildAppFile(mt,path_to_file = "BRCA_App_File.rds",res.combined = res.combined)

Uploading the File

Now that we have built the file, we can upload it to our online application. Visit the driverAnalytics Shiny application page. Place the file in an easily accessible location and upload the file on the page. After toggling the display to the uploaded data, the TCGA BRCA results can be explored.

CDH1 and TP53 both standout in the breast cancer TCGA results, having the largest number of observed potential mutual exclusivities. Both of these are known tumor suppressors, with TP53 being mutated is approximately 60% of all cancers and CDH1 being commonly mutated in certain types of breast cancer. Another interesting result is the exclusivity between the DST mutations and CDH1 mutations. Both play a role in epithetial cell adhesion, although DST is not considered prognostic. The mutual exclusivity between CDH1 and DST does suggest an overlap in function though. Therefore it is possible that DST mutations do have effects that can lead to malignant growth, though the mutation does not have the widespread effects of the CDH1 driver mutation.