1 Introduction

The Cancer Genome Atlas has provided a comprehensive omics study of 33 cancers. The National Cancer Institute (NCI) has made this data pubically available on the Genomic Data Commons (GDC) Data Portal. There are several R packages that download the data, RTCGAToolbox, RTCGA, TCGAbiolinks, TCGA2STAT, MultiAssayExperiment, and TCGA-Assembler.

The motivation for TCGAbrowser was to provide convenience functions and wrappers to analyze the omics datasets by gene expression or mutation subsetting. Typically, biologists are studying one gene in one cancer by overexpression and/or knockdown experiments. TCGAbrowser was designed to subset the datasets by gene expression or gene mutation and perform differential gene expresion, mutation, copy number, protein and survival analysis.

2 Installation

To install TCGAbrowser use the code below.

require(devtools)
devtools::install_github("pcheng84/TCGAbrowser")
require(TCGAbrowser)

The workflow for TCGAbrowser is shown in the graphic below.

TCGAbrowser functions were designed either to subset the data, perform statistical analysis or for plotting.

Subset functions include rnasubset and mutsubset which that subset the data based on RNA expression and mutation status respectively.

Analysis functions include: diffmut for differential mutation analysis; diffcp for differential copy number analysis; rnadeg for differential RNA expression analysis; rnagsva and rnagsvasig for differential GSVA analysis; rnareact for reactome pathway analysis on the differential RNA expression analysis; rppadeg for differential RPPA analysis.

Plotting functions include: genesurv for Kaplan-Meier survival plots; plotlymut for an Oncoprint-like plot of mutation data; plotlycp for an Oncoprint-like plot of copy number data; rnaheat for a heatmap of differential expressed genes; rnagsvaheat for a heatmap of differential expressed pathways; plotreact for Reactome dotplots, enrichment plots and cnetmap plots; rppaheat for a heatmap of differential expressed proteins.

3 Data download and formating

This vignette assumes the user will use one of the R packages mentioned above to download the TCGA data. In this example, we used TCGA2STAT to download the data. We also used a customized function to download the RPPA data. Some processing steps are needed for the data.frames to be used in TCGAbrowser.

require(TCGA2STAT)
#download RSEM normalized gene expression values and clinical data
rnaseq.skcm <- getTCGA(disease = "SKCM", data.type = "RNASeq2", clinical = TRUE)

#make data.table version of RNAseq data and shortens sample name
require(data.table)
rna <- data.table(Gene = rownames(rnaseq.skcm$dat), rnaseq.skcm$dat)

#renames RNAseq data
setnames(rna, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(rna)))

#reorders samples alphabetically
setcolorder(rna, order(colnames(rna)))

#download binary matrix of mutations
mut.skcm <- getTCGA(disease = "SKCM", data.type = "Mutation", type = "somatic")

#make data.table version of mutation data
mut <- data.table(Gene = rownames(mut.skcm$dat), mut.skcm$dat)

#download copy number values, remove Y chromosome genes
cnv.skcm <- getTCGA(disease = "SKCM", data.type = "CNV_SNP", filter = "Y")

#make data.table version of copy number data and sets thresholds for loss and gain at -0.2 and 0.2 respectively
cp <- data.table(Gene = rownames(cnv.skcm$dat), cnv.skcm$dat)
for (j in seq_len(ncol(cp))[-1])  {
    set(cp, which(cp[[j]] > 0.2),j,1)
    set(cp, which(cp[[j]] < -0.2),j,-1)
    set(cp, which(cp[[j]] >= -0.2 & cp[[j]] <= 0.2),j,0)
  }
  
setnames(cp, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(cp)))
setcolorder(cp, order(colnames(cp)))

#get rppa data
'%&%' <- function(a, b) paste(a, b, sep="")

getTCGA_rppa = function(disease){
  # disease = "BRCA"
  ldoc <- XML::htmlTreeParse("http://gdac.broadinstitute.org/runs/stddata__latest/", useInternalNodes = T)
  llinks = unlist(XML::xpathApply(ldoc, "//a[@href]", XML::xmlGetAttr, "href"))
  d.links = grep("/data/" %&% disease,llinks,value = T)
  ddoc = XML::htmlTreeParse(d.links, useInternalNodes = T)
  keyWord = "//a[contains(@href, 'RPPA')]"
  plinks = XML::xpathSApply(ddoc, keyWord, XML::xmlGetAttr, 'href')
  plinks = plinks[grepl("Level_3.*.gz$", plinks)]
  download.file(d.links %&% "/" %&% plinks, destfile = "temp.tar.gz")
  untar("temp.tar.gz")
  file.remove("temp.tar.gz")
  infile = sub("^(gdac.broadinstitute.org_.*).tar.gz$", "\\1", plinks) 
  #annot = fread(infile %&% "/" %&% disease %&% ".antibody_annotation.txt")
  if(file.exists(infile %&% "/" %&% disease %&% ".rppa.txt")){
    rppa = read.table(infile %&% "/" %&% disease %&% ".rppa.txt", sep="\t", head=T, stringsAsFactors = F)
    names(rppa) = gsub("\\.","-",names(rppa))
    t.ceref = t(sapply(rppa$'Composite-Element-REF', function(x) unlist(strsplit(x, "\\|")[[1]])))
    rppa$Gene = t.ceref[,1]
    rppa$CE.REF = t.ceref[,2]
    rppa = rppa[,c(ncol(rppa)-1,ncol(rppa),2:(ncol(rppa)-2))]
   } else {
    rppa = "file not found"
  } 
  unlink(infile, recursive=TRUE)
  rppa
}

rppa <- data.table(getTCGA_rppa("SKCM"))
setnames(rppa, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(rppa)))

#put survival dates into one column and convert to years
pat <- data.table(bcr_patient_barcode = rownames(rnaseq.skcm$clinical), rnaseq.skcm$clinical)
pat$days <- as.numeric(ifelse(is.na(pat$daystodeath), pat$daystolastfollowup, pat$daystodeath)) 
pat$TCGA_year <- pat$days/365.25
pat$vitalstatus <- as.numeric(pat$vitalstatus)

#adds RNAseq sample name to clinical data
samp <- data.table(name = colnames(rna), bcr_patient_barcode = sub("(TCGA-.*?-.*?)-.*", "\\1", colnames(rna)))
pat <- merge(pat, samp, by = "bcr_patient_barcode")

4 Subset by gene expression workflow

rnasubset needs 4 parameters: the clinical table, the RNAseq count table, a gene and a percentage cutoff. This function will intersect the TCGA patient barcode with the sample barcodes from RNAseq. If there is more than one sample for one patient, the patient row will be duplicated in the output table. Some patients have a primary tumor and a metastatic tumor sample.

rnasubset will return a data.table with 5 additional columns to the original clinical table. The level column will contain the RSEM values for the gene of interest. The exprs_rank column will contain the rank of gene expression. The high column will contain TRUE values for samples that are in the upper quantile of expression. The low column will contain TRUE values for samples that are in the lower quantile of expression. The gene2 column contains low, middle, high identifiers for which gene expression group the sample belongs to.

In this example, we will compare patients in the top 20% of SOX10 expression and to the bottom 20% of SOX10 expression.

sox10.rna <- rnasubset(pat, rna, "SOX10", 20)

4.1 Expression plotting

The output table from rnasubset can be used for downstream analysis and plotting functions. plotlygenelevel and hchartgenelevel are wrappers for plot_ly from plotly and hchart from highcharter respectively. These two functions will plot the gene expression of your gene of interest across all samples with coloring for the low, middle and high groups.

plotlygenelevel(sox10.rna)

hchartgenelevel(sox10.rna)

4.2 Survival Analysis

genesurv needs 2 parameters: the output table from rnasubset and the gene name. genesurv is a wrapper for ggsurvplot from survminer. It will plot the Kaplan-Meier curves for the high gene and low gene expression groups and report the p value from the log-rank test.

genesurv(sox10.rna, "SOX10")

4.3 Differential gene expression analysis

rnadeg needs 2 parameters: the output table from rnasubset and the RNAseq count table. rnadeg is a wrapper for voom from limma. It will calculate the genes differentially expressed between the high gene and low gene expression groups.

sox10.deg <- rnadeg(sox10.rna, rna)
head(sox10.deg)

##            genes     logFC  AveExpr         t      P.Value    adj.P.Val
## 6348        FGD6 -2.294677 2.240081 -14.25145 6.116085e-32 8.141732e-28
## 19218      USP15 -1.860727 4.691167 -13.83370 1.126107e-30 6.372695e-27
## 2493     C4orf32 -2.338852 2.130889 -13.79884 1.436154e-30 6.372695e-27
## 15893      SGMS1 -2.115494 2.716887 -13.74117 2.147329e-30 7.146311e-27
## 4281  CSGALNACT2 -1.691013 4.097050 -13.43576 1.807036e-29 4.811053e-26
## 5499        ELL2 -2.155000 4.418038 -13.39108 2.467639e-29 5.474868e-26
##              B
## 6348  61.75460
## 19218 59.06146
## 2493  58.64543
## 15893 58.34007
## 4281  56.31973
## 5499  56.00969

rnaheat needs 4 parameters: the output table from rnasubset, the RNAseq count table, the output table from rnadeg, and the gene name. rnaheat is a wrapper for Heatmap fromComplexHeatmap. It will plot a heatmap of the top 100 most significant differentially expressed genes, draw a colored column bar to indicate the high gene and low gene expression groups, and a colored column bar to indicate the rank of the gene of interest.

rnaheat(sox10.rna, rna, sox10.deg, "SOX10")

4.4 GSVA pathway analysis

rnagsva needs 2 parameters: the output table from rnasubset and the RNAseq count table. rnagsva is a wrapper for gsva from GSVA. It will calculate the GSVA enrichment scores for all samples using pathway from the c2BroadSets.

rnagsvasig needs 2 parameters: the output table from rnasubset and the output table from rnagsva. rnagsvasig is a wrapper for lmFit from limma. It will determine the significant differences in GSVA enrichment scores between the high gene and low gene expression group.

rnagsvaheat needs 4 parameters: the output table from rnasubset, the output table from rnagsva, the output table from rnagsvasig, and the gene name. rnagsvaheat is a wrapper for Heatmap from ComplexHeatmap. It will display a heatmap of the top 100 most signifcant pathways with a colored column bar to indicate the high gene and low gene expression groups.

require(GSVAdata)
data(c2BroadSets)
sox10.gsva <- rnagsva(sox10.rna, rna)

## Estimating GSVA scores for 2917 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs in rnaseq data with Poisson kernels
## Using parallel with 4 cores
## 
  |                                                                       
  |                                                                 |   0%

sox10.gsvasig <- rnagsvasig(sox10.rna, sox10.gsva)
head(sox10.gsvasig)

##                                                 logFC       AveExpr
## JIANG_TIP30_TARGETS_DN                      0.5749123 -0.0540496698
## EHLERS_ANEUPLOIDY_DN                        0.6387584 -0.0257061454
## HOEBEKE_LYMPHOID_STEM_CELL_UP              -0.3770043 -0.0033900669
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY         0.3116832 -0.0214516127
## HALMOS_CEBPA_TARGETS_UP                    -0.3823368 -0.0064836597
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP -0.4969431 -0.0001819253
##                                                    t      P.Value
## JIANG_TIP30_TARGETS_DN                      14.45752 1.232265e-32
## EHLERS_ANEUPLOIDY_DN                        13.41198 1.859839e-29
## HOEBEKE_LYMPHOID_STEM_CELL_UP              -12.93016 5.409450e-28
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY         12.91066 6.199391e-28
## HALMOS_CEBPA_TARGETS_UP                    -12.87979 7.691832e-28
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP -12.62644 4.509977e-27
##                                               adj.P.Val        B
## JIANG_TIP30_TARGETS_DN                     3.594517e-29 63.49259
## EHLERS_ANEUPLOIDY_DN                       2.712575e-26 56.27447
## HOEBEKE_LYMPHOID_STEM_CELL_UP              4.487415e-25 52.95033
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY        4.487415e-25 52.81589
## HALMOS_CEBPA_TARGETS_UP                    4.487415e-25 52.60313
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP 2.192600e-24 50.85857

rnagsvaheat(sox10.rna, sox10.gsva, sox10.gsvasig, "SOX10")

4.5 Reactome pathway analysis

rnareact needs 1 parameter: the output table from rnasubset. rnareact is wrapper for enrichPathway from ReactomePA. It will calculate enriched pathways from Reactome database.

plotreact needs 4 parameters: the output table from rnareact, the output table from rnadeg, the graph selection which can be ‘dot’, ‘map’, or ‘cnet’, and the number of pathways for plotting. plotreact is a wrapper for dotplot, enrichMap, and cnetplot from ReactomePA.

sox10.react <- rnareact(sox10.deg)

#dotplot
plotreact(sox10.react, sox10.deg, "dot", 15)

#pathway plot
plotreact(sox10.react, sox10.deg, "map", 15)

#cnet plot
plotreact(sox10.react, sox10.deg, "cnet", 5)

4.6 Differential mutation analysis

diffmut needs 2 parameters: the output table from rnasubset and the binary mutation matrix table. It will perform a chi-squared test between the proportions of mutated and wild-type genes between the high gene and low gene expression groups. The binary mutation matrix is an over-simplification of mutation status of a gene. Here the assumption is every mutation in a gene causes the same effect.

plotlymut needs 4 parameters: the output table from rnasuset, the binary mutation matrix table, the output table from diffmut, and the gene name. It will create an Oncoprint-like plot of the top 20 differentially mutated genes and your gene of interest.

sox10.mut <- diffmut(sox10.rna, mut)
head(sox10.mut)

##      Gene N.high N.hightotal N.low N.lowtotal     p.value FDR
## 1: LILRB4     14          67     2         71 0.002293250   1
## 2:  POTEF      9          67     0         71 0.004382139   1
## 3: ZCCHC5      8          67     0         71 0.008402271   1
## 4:  CCNB3      2          67    13         71 0.008868286   1
## 5:   FASN      0          67     8         71 0.013645791   1
## 6:  STON2      0          67     8         71 0.013645791   1

plotlymut(sox10.rna, mut, sox10.mut, "SOX10")

4.7 Differential copy number analysis

diffcp needs 2 parameters: the output table from rnasubset and the processed copy number table with only -1, 0, and 1 levels. It will perform a chi-squared test between copy number gain and copy number neutral and another chi-squared test for copy number neutral and copy number loss.

plotlycp needs 4 parameters: the output table from rnasubset, the processed copy number table, the output table from diffcp, and the gene name. It will create an Oncoprint-like plot of the top 20 genes with copy number differenes and your gene of interest.

sox10.cp <- diffcp(sox10.rna, cp)
head(sox10.cp)

##        Gene p.value.high  p.value.low FDR.high     FDR.low
## 1:    ACTA2          NaN 1.357802e-03      NaN 0.020450056
## 2: ADAMTS14          NaN 2.843105e-04      NaN 0.008240953
## 3:    ADIRF          NaN 8.323026e-04      NaN 0.015039620
## 4:      ADO          NaN 4.832831e-05      NaN 0.005724420
## 5:   AGAP11          NaN 8.323026e-04      NaN 0.015039620
## 6:    AIFM2          NaN 2.843105e-04      NaN 0.008240953

plotlycp(sox10.rna, cp, sox10.cp, "SOX10")

4.8 Differential RPPA analysis

rppaheat needs 3 parameters: the output table from rnasubset, the RPPA expression table, and the gene name. rppaheat will draw the heatmap for all proteins. rppadeg needs 2 parameters: the output table from rnasubset and the RPPA expression table. rppadeg is a wrapper for lmFit from limma. Since not all samples with RNAseq data have corresponding RPPA data, rppadeg will perform the intersection and calculate the differentially expressed proteins.

rppaheat(sox10.rna, rppa, "SOX10")

sox10.rppa <- rppadeg(sox10.rna, rppa)
head(sox10.rppa)

##               logFC     AveExpr       t     P.Value    adj.P.Val        B
## E-Cadherin 1.068963 -0.08984641 4.56378 1.08384e-05 0.0005193874 2.475999

5 Subset by mutation workflow

mutsubset needs 4 parameters: the clinical table, the RNAseq count table, a binary mutation matrix, and a gene name. It will return a data.table with 4 additional columns to the clinical table. The gene2 column contains if the gene is WT or mutated. The level column contains the RSEM values of the gene. The exprs_rank column contains the expression rank of the gene.

In this workflow, we will use NRAS as an example since it is mutated about 30% in melanoma. First we will identify which samples are mutated from the binary mutation matrix and add a column to the clinical table using mutsubset.

nras.mut <- mutsubset(pat, rna, mut, "NRAS")

5.1 Expression plotting

The expression of NRAS can be plotted with plotlygenelevel or hchartgenelevel identifying the mutated and wild-type NRAS samples.

plotlygenelevel(nras.mut)

hchartgenelevel(nras.mut)

5.2 Survival Analysis

Survival analysis comparing the NRAS mutant and wild-type samples is performed with genesurv.

genesurv(nras.mut, "NRAS")

5.3 Differential gene expression analysis

Differentially expressed genes between the NRAS mutant and wild-type samples is determined with rnadeg.

nras.deg <- rnadeg(nras.mut, rna)
head(nras.deg)

##          genes     logFC    AveExpr         t      P.Value    adj.P.Val
## 19472    WDR17  1.658291 -0.1211830  6.409451 4.806984e-10 1.065468e-06
## 8995  KIAA1755 -1.007829  4.3329956 -4.732267 3.246504e-06 3.722005e-04
## 13161   PDLIM4 -1.647180  3.8021957 -4.401209 1.437817e-05 9.051001e-04
## 6660       FST -1.475561  1.1227029 -4.189131 3.561891e-05 1.590754e-03
## 3240      CCNO -1.236970  0.8462824 -4.144502 4.291137e-05 1.777814e-03
## 5965   FAM178B  1.621504  1.9647182  4.043829 6.493079e-05 2.266442e-03
##               B
## 19472 10.682159
## 8995   4.212745
## 13161  2.817319
## 6660   1.639425
## 3240   1.458504
## 5965   1.478388

rnaheat will produce the heatmap with the top 100 differentially expressed genes between the NRAS wild-type and mutated populations

rnaheat(nras.mut, rna, nras.deg, "NRAS")

5.4 GSVA pathway analysis

The significant pathways between the NRAS wild-type and mutant populations can be determined by rnagsva and rnagsvasig, and visualized with rnagsvaheat.

require(GSVAdata)
data(c2BroadSets)
nras.gsva <- rnagsva(nras.mut, rna)

## Estimating GSVA scores for 2917 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs in rnaseq data with Poisson kernels
## Using parallel with 4 cores
## 
  |                                                                       
  |                                                                 |   0%

nras.gsvasig <- rnagsvasig(nras.mut, nras.gsva)
head(nras.gsvasig)

##                                                          logFC
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION  0.19259257
## CHANDRAN_METASTASIS_UP                              0.09672968
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA                   0.16757530
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2                    0.14966950
## BIOCARTA_PITX2_PATHWAY                              0.18446976
## YANG_BREAST_CANCER_ESR1_BULK_UP                    -0.15130505
##                                                         AveExpr         t
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION -0.014090117  5.404492
## CHANDRAN_METASTASIS_UP                             -0.009443104  5.241218
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA                  -0.016254619  5.212805
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2                   -0.013801843  5.190258
## BIOCARTA_PITX2_PATHWAY                             -0.020002332  5.157885
## YANG_BREAST_CANCER_ESR1_BULK_UP                    -0.020198188 -5.059070
##                                                         P.Value
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION 1.212959e-07
## CHANDRAN_METASTASIS_UP                             2.781074e-07
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA                  3.206500e-07
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2                   3.588370e-07
## BIOCARTA_PITX2_PATHWAY                             4.214687e-07
## YANG_BREAST_CANCER_ESR1_BULK_UP                    6.852993e-07
##                                                       adj.P.Val        B
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION 0.0002458848 7.267885
## CHANDRAN_METASTASIS_UP                             0.0002458848 6.498342
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA                  0.0002458848 6.366461
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2                   0.0002458848 6.262242
## BIOCARTA_PITX2_PATHWAY                             0.0002458848 6.113276
## YANG_BREAST_CANCER_ESR1_BULK_UP                    0.0003221519 5.663487

rnagsvaheat(nras.mut, nras.gsva, nras.gsvasig, "NRAS")

5.5 Reactome pathway analysis

Reactome pathway analysis can also be performed on the NRAS wild-type and mutant populations with rnareact.

nras.react <- rnareact(nras.deg)

#dotplot
plotreact(nras.react, nras.deg, "dot", 15)

#pathway plot
plotreact(nras.react, nras.deg, "map", 15)

#cnet plot
plotreact(nras.react, nras.deg, "cnet", 5)

5.6 Differential mutation analysis

The mutations different between the NRAS wild-type and mutant populations can be determined with diffmut and visualized with plotlymut.

nras.mut2 <- diffmut(nras.mut, mut)
head(nras.mut2)

##        Gene N.high N.hightotal N.low N.lowtotal      p.value          FDR
## 1:     NRAS     96          96     0        246 2.779897e-75 4.252964e-71
## 2:     BRAF      5          96   172        246 1.924594e-26 1.472218e-22
## 3: RAB3GAP1      9          96     1        246 4.775129e-05 2.287547e-01
## 4:     MTTP     13          96     5        246 5.980906e-05 2.287547e-01
## 5:  PKHD1L1     45          96    60        246 8.843756e-05 2.706013e-01
## 6:  CLEC18B      8          96     2        246 8.020348e-04 1.000000e+00

plotlymut(nras.mut, mut, nras.mut2, "NRAS")

5.7 Differential copy number analysis

The copy number alterations betweeh the NRAS wild-type and mutant populations can be determined with diffcp and visualizyed with plotlycp.

nras.cp <- diffcp(nras.mut, cp)
head(nras.cp)

##       Gene p.value.high p.value.low     FDR.high FDR.low
## 1: C7orf34 2.619719e-07   0.8729025 0.0002024957       1
## 2:   CASP2 2.619719e-07   0.8729025 0.0002024957       1
## 3:   CLCN1 2.619719e-07   0.8729025 0.0002024957       1
## 4:   EPHA1 2.619719e-07   0.8729025 0.0002024957       1
## 5: FAM131B 2.619719e-07   0.8729025 0.0002024957       1
## 6:   GSTK1 2.619719e-07   0.8729025 0.0002024957       1

plotlycp(nras.mut, cp, nras.cp, "NRAS")

5.8 Differential RPPA analysis

The differentially expressed proteins between the NRAS wild-type and NRAS mutant populations can be determined by rppadeg and visualized with rppaheat.

rppaheat(nras.mut, rppa, "NRAS")

nras.rppa <- rppadeg(nras.mut, rppa)
head(nras.rppa)

## data frame with 0 columns and 0 rows

TCGAbrowser workflow

Phil Cheng

8 March 2017

Package version: TCGAbrowser 0.1.0

Contents