Contents

1 Introduction

The Cancer Genome Atlas has provided a comprehensive omics study of 33 cancers. The National Cancer Institute (NCI) has made this data pubically available on the Genomic Data Commons (GDC) Data Portal. There are several R packages that download the data, RTCGAToolbox, RTCGA, TCGAbiolinks, TCGA2STAT, MultiAssayExperiment, and TCGA-Assembler.

The motivation for TCGAbrowser was to provide convenience functions and wrappers to analyze the omics datasets by gene expression or mutation subsetting. Typically, biologists are studying one gene in one cancer by overexpression and/or knockdown experiments. TCGAbrowser was designed to subset the datasets by gene expression or gene mutation and perform differential gene expresion, mutation, copy number, protein and survival analysis.

2 Installation

To install TCGAbrowser use the code below.

require(devtools)
devtools::install_github("pcheng84/TCGAbrowser", ref = "multiAssayExperiment")
## 
##   
  
  
   checking for file 'C:\Users\Phil\AppData\Local\Temp\RtmpqOWUf2\remotes960420d2a61\pcheng84-TCGAbrowser-0fd0755/DESCRIPTION' ...
  
   checking for file 'C:\Users\Phil\AppData\Local\Temp\RtmpqOWUf2\remotes960420d2a61\pcheng84-TCGAbrowser-0fd0755/DESCRIPTION' ... 
  
v  checking for file 'C:\Users\Phil\AppData\Local\Temp\RtmpqOWUf2\remotes960420d2a61\pcheng84-TCGAbrowser-0fd0755/DESCRIPTION' (348ms)
## 
  
  
  
-  preparing 'TCGAbrowser':
##    checking DESCRIPTION meta-information ...
  
   checking DESCRIPTION meta-information ... 
  
v  checking DESCRIPTION meta-information
## 
  
  
  
-  checking for LF line-endings in source and make files and shell scripts
## 
  
  
  
-  checking for empty or unneeded directories
## 
  
  
  
-  looking to see if a 'data/datalist' file should be added
## 
  
  
  
-  building 'TCGAbrowser_0.1.2.tar.gz'
## 
  
   
## 
require(TCGAbrowser)

The workflow for TCGAbrowser is shown in the graphic below.

TCGAbrowser functions were designed either to subset the data, perform statistical analysis or for plotting.

Subset functions include rnasubset and mutsubset which that subset the data based on RNA expression and mutation status respectively.

Analysis functions include: diffmut for differential mutation analysis; diffcp for differential copy number analysis; rnadeg for differential RNA expression analysis; rnagsva and rnagsvasig for differential GSVA analysis; rnareact for reactome pathway analysis on the differential RNA expression analysis; rppadeg for differential RPPA analysis.

Plotting functions include: genesurv for Kaplan-Meier survival plots; plotlymut for an Oncoprint-like plot of mutation data; plotlycp for an Oncoprint-like plot of copy number data; rnaheat for a heatmap of differential expressed genes; rnagsvaheat for a heatmap of differential expressed pathways; plotreact for Reactome dotplots, enrichment plots and cnetmap plots; rppaheat for a heatmap of differential expressed proteins.

3 Data download and formating

This vignette assumes the user will use one of the R packages mentioned above to download the TCGA data. In this example, we used curatedTCGAData to download the data. We will transform the mutation data into a binary matrix with qreduceTCGA from TCGAutils and subset the data for just tumor samples using splitAssays.

library(curatedTCGAData)
library(TCGAutils)

lusc <- curatedTCGAData("LUSC", c("Mutation", "RNASeq2GeneNorm", "GISTICT", "RPPAArray"), FALSE)

#convert genome to hg19 format
genome(lusc[[2]]) <- vapply(genome(lusc[[2]]), translateBuild, character(1L))
seqlevelsStyle(lusc[[2]]) <- "NCBI"

#convert mutation into binary matrix
lusc2 <- qreduceTCGA(lusc)

#remove normal samples and just focus on primaries
tnmae <- splitAssays(lusc2, c("01", "11"))

lusc_t <- tnmae[, , grep("^01", names(tnmae))]

4 Subset by gene expression workflow

rnasubset needs 3 parameters: the multiassayexperiment, a gene and a percentage cutoff. This function will intersect the TCGA patient barcode with the sample barcodes from RNAseq.

rnasubset will return a multiassayexperiment object with an assay called “Cohort”. The “Cohort” assay contains all samples from the RNAseq dataset and has 3 rows, the first row identifies if the sample belongs in the “low”, “middle”, or “high” gene expression category, the second row is the rank of gene across all samples, and the third row are the gene expression values themselves.

In this example, we will compare patients in the top 10% of EGFR expression and to the bottom 10% of EGFR expression.

lusc_egfr <- rnasubset(lusc_t, "EGFR", 10)
## harmonizing input:
##   removing 3 colData rownames not in sampleMap 'primary'
## Warning in .local(x, ...): Assuming column order in the data provided 
##  matches the order in 'mapFrom' experiment(s) colnames

4.1 Expression plotting

The output table from rnasubset can be used for downstream analysis and plotting functions. plotlygenelevel and hchartgenelevel are wrappers for plot_ly from plotly and hchart from highcharter respectively. These two functions will plot the gene expression of your gene of interest across all samples with coloring for the low, middle and high groups.

plotlygenelevel(lusc_egfr)
## harmonizing input:
##   removing 3 colData rownames not in sampleMap 'primary'

4.2 Survival Analysis

genesurv needs 2 parameters: the output multiassayexperiment object from rnasubset and the gene name. genesurv is a wrapper for ggsurvplot from survminer. It will plot the Kaplan-Meier curves for the high gene and low gene expression groups and report the p value from the log-rank test.

genesurv(lusc_egfr, "EGFR")

4.3 Differential gene expression analysis

rnadeg needs 2 parameters: the output multiassayexperiment object from rnasubset. rnadeg is a wrapper for voom from limma. It will calculate the genes differentially expressed between the high gene and low gene expression groups.

egfr_deg <- rnadeg(lusc_egfr)
head(egfr_deg)
##           genes    logFC  AveExpr         t      P.Value    adj.P.Val
## EGFR       EGFR 5.710251 6.435829 22.812771 1.971747e-42 2.784501e-38
## IL1RAP   IL1RAP 2.314432 5.077417  9.620414 4.453599e-16 3.144687e-12
## FAT2       FAT2 4.728391 5.034816  9.208779 3.716886e-15 1.632871e-11
## MICALL1 MICALL1 1.599966 6.345514  9.166296 4.625041e-15 1.632871e-11
## PTPN14   PTPN14 1.874528 3.611484  9.045830 8.592199e-15 2.426781e-11
## AHNAK     AHNAK 1.913286 9.462019  8.905248 1.768292e-14 4.161970e-11
##                B
## EGFR    82.26632
## IL1RAP  25.97586
## FAT2    23.75491
## MICALL1 23.80224
## PTPN14  22.90095
## AHNAK   22.50582

rnaheat needs 4 parameters: the output multiassayexperiment object from rnasubset, the output data.frame from rnadeg, the gene name, and the number of genes to plot for the heatmap. rnaheat is a wrapper for Heatmap fromComplexHeatmap. It will plot a heatmap of the top 100 most significant differentially expressed genes, draw a colored column bar to indicate the high gene and low gene expression groups, and a colored column bar to indicate the rank of the gene of interest.

rnaheat(lusc_egfr, egfr_deg, "EGFR", 100)
## Warning in Heatmap(h1, top_annotation = top_ha, name = "color scale",
## show_column_names = T, : NAs introduced by coercion

4.4 Reactome pathway analysis

rnareact needs 1 parameter: the output table from rnasubset. rnareact is wrapper for enrichPathway from ReactomePA. It will calculate enriched pathways from Reactome database.

plotreact needs 4 parameters: the output table from rnareact, the output table from rnadeg, the graph selection which can be ‘dot’, ‘map’, or ‘cnet’, and the number of pathways for plotting. plotreact is a wrapper for dotplot, enrichMap, and cnetplot from ReactomePA.

egfr_react <- rnareact(egfr_deg)
## Loading required package: org.Hs.eg.db
## Loading required package: AnnotationDbi
#dotplot
plotreact(egfr_react, egfr_deg, "dot", 15)

#pathway plot
plotreact(egfr_react, egfr_deg, "map", 15)

#cnet plot
plotreact(egfr_react, egfr_deg, "cnet", 5)

4.5 Differential mutation

diffmut needs a MultiAssayExperiment object from rnasubset or mutsubset. The MultiAssayExperiment should contain an assay with a simplified Mutation from qreduceTCGA or a binary matrix representing mutated and wild-type genes. It will perform a chi-squared test for the proportions of mutated and wild-type genes between the high gene and low gene expression groups from rnasubset or mutated and wild-type groups from mutsubset. Since we use a binary matrix as a representation of mutation, we assume is any mutation in a gene would have the same effect.

plotlymut needs 4 parameters: the MultiAssayExperiment object from rnasuset or mutsubset, the output table from diffmut, the gene name, and the number of genes to plot. It will create an Oncoprint-like plot of the differentially mutated genes and your gene of interest.

egfr_dm <- diffmut(lusc_egfr)
plotlymut(lusc_egfr, egfr_dm, "EGFR", 20)

4.6 Differential copy number analysis

diffcp needs a MultiAssayExperiment object from rnasubset or mutsubset. The MultiAssayExperiment should contain a GISTIC assay where the copy number values are represented by -2, -1, 0, 1, and 2. It will perform a chi-squared test between copy number gain and copy number neutral and another chi-squared test for copy number neutral and copy number loss.

plotlycp needs 4 parameters: the MultiAssayExperiment object from rnasuset or mutsubset, the output table from diffcp, the gene name, and the number of genes to plot. It will create an Oncoprint-like plot of the copy number differences and your gene of interest.

egfr_dcp <- diffcp(lusc_egfr)
plotlycp(lusc_egfr, egfr_dcp, "EGFR", 20)