The Cancer Genome Atlas has provided a comprehensive omics study of 33 cancers. The National Cancer Institute (NCI) has made this data pubically available on the Genomic Data Commons (GDC) Data Portal. There are several R packages that download the data, RTCGAToolbox, RTCGA, TCGAbiolinks, TCGA2STAT, MultiAssayExperiment, and TCGA-Assembler.
The motivation for TCGAbrowser was to provide convenience functions and wrappers to analyze the omics datasets by gene expression or mutation subsetting. Typically, biologists are studying one gene in one cancer by overexpression and/or knockdown experiments. TCGAbrowser was designed to subset the datasets by gene expression or gene mutation and perform differential gene expresion, mutation, copy number, protein and survival analysis.
To install TCGAbrowser use the code below.
require(devtools)
devtools::install_github("pcheng84/TCGAbrowser")
require(TCGAbrowser)
The workflow for TCGAbrowser is shown in the graphic below.
TCGAbrowser functions were designed either to subset the data, perform statistical analysis or for plotting.
Subset functions include rnasubset and mutsubset which that subset the data based on RNA expression and mutation status respectively.
Analysis functions include: diffmut for differential mutation analysis; diffcp for differential copy number analysis; rnadeg for differential RNA expression analysis; rnagsva and rnagsvasig for differential GSVA analysis; rnareact for reactome pathway analysis on the differential RNA expression analysis; rppadeg for differential RPPA analysis.
Plotting functions include: genesurv for Kaplan-Meier survival plots; plotlymut for an Oncoprint-like plot of mutation data; plotlycp for an Oncoprint-like plot of copy number data; rnaheat for a heatmap of differential expressed genes; rnagsvaheat for a heatmap of differential expressed pathways; plotreact for Reactome dotplots, enrichment plots and cnetmap plots; rppaheat for a heatmap of differential expressed proteins.
This vignette assumes the user will use one of the R packages mentioned above to download the TCGA data. In this example, we used TCGA2STAT to download the data. We also used a customized function to download the RPPA data. Some processing steps are needed for the data.frames to be used in TCGAbrowser.
require(TCGA2STAT)
#download RSEM normalized gene expression values and clinical data
rnaseq.skcm <- getTCGA(disease = "SKCM", data.type = "RNASeq2", clinical = TRUE)
#make data.table version of RNAseq data and shortens sample name
require(data.table)
rna <- data.table(Gene = rownames(rnaseq.skcm$dat), rnaseq.skcm$dat)
#renames RNAseq data
setnames(rna, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(rna)))
#reorders samples alphabetically
setcolorder(rna, order(colnames(rna)))
#download binary matrix of mutations
mut.skcm <- getTCGA(disease = "SKCM", data.type = "Mutation", type = "somatic")
#make data.table version of mutation data
mut <- data.table(Gene = rownames(mut.skcm$dat), mut.skcm$dat)
#download copy number values, remove Y chromosome genes
cnv.skcm <- getTCGA(disease = "SKCM", data.type = "CNV_SNP", filter = "Y")
#make data.table version of copy number data and sets thresholds for loss and gain at -0.2 and 0.2 respectively
cp <- data.table(Gene = rownames(cnv.skcm$dat), cnv.skcm$dat)
for (j in seq_len(ncol(cp))[-1]) {
set(cp, which(cp[[j]] > 0.2),j,1)
set(cp, which(cp[[j]] < -0.2),j,-1)
set(cp, which(cp[[j]] >= -0.2 & cp[[j]] <= 0.2),j,0)
}
setnames(cp, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(cp)))
setcolorder(cp, order(colnames(cp)))
#get rppa data
'%&%' <- function(a, b) paste(a, b, sep="")
getTCGA_rppa = function(disease){
# disease = "BRCA"
ldoc <- XML::htmlTreeParse("http://gdac.broadinstitute.org/runs/stddata__latest/", useInternalNodes = T)
llinks = unlist(XML::xpathApply(ldoc, "//a[@href]", XML::xmlGetAttr, "href"))
d.links = grep("/data/" %&% disease,llinks,value = T)
ddoc = XML::htmlTreeParse(d.links, useInternalNodes = T)
keyWord = "//a[contains(@href, 'RPPA')]"
plinks = XML::xpathSApply(ddoc, keyWord, XML::xmlGetAttr, 'href')
plinks = plinks[grepl("Level_3.*.gz$", plinks)]
download.file(d.links %&% "/" %&% plinks, destfile = "temp.tar.gz")
untar("temp.tar.gz")
file.remove("temp.tar.gz")
infile = sub("^(gdac.broadinstitute.org_.*).tar.gz$", "\\1", plinks)
#annot = fread(infile %&% "/" %&% disease %&% ".antibody_annotation.txt")
if(file.exists(infile %&% "/" %&% disease %&% ".rppa.txt")){
rppa = read.table(infile %&% "/" %&% disease %&% ".rppa.txt", sep="\t", head=T, stringsAsFactors = F)
names(rppa) = gsub("\\.","-",names(rppa))
t.ceref = t(sapply(rppa$'Composite-Element-REF', function(x) unlist(strsplit(x, "\\|")[[1]])))
rppa$Gene = t.ceref[,1]
rppa$CE.REF = t.ceref[,2]
rppa = rppa[,c(ncol(rppa)-1,ncol(rppa),2:(ncol(rppa)-2))]
} else {
rppa = "file not found"
}
unlink(infile, recursive=TRUE)
rppa
}
rppa <- data.table(getTCGA_rppa("SKCM"))
setnames(rppa, sub("(TCGA-.*?-.*?-.*?)-.*", "\\1", colnames(rppa)))
#put survival dates into one column and convert to years
pat <- data.table(bcr_patient_barcode = rownames(rnaseq.skcm$clinical), rnaseq.skcm$clinical)
pat$days <- as.numeric(ifelse(is.na(pat$daystodeath), pat$daystolastfollowup, pat$daystodeath))
pat$TCGA_year <- pat$days/365.25
pat$vitalstatus <- as.numeric(pat$vitalstatus)
#adds RNAseq sample name to clinical data
samp <- data.table(name = colnames(rna), bcr_patient_barcode = sub("(TCGA-.*?-.*?)-.*", "\\1", colnames(rna)))
pat <- merge(pat, samp, by = "bcr_patient_barcode")
rnasubset needs 4 parameters: the clinical table, the RNAseq count table, a gene and a percentage cutoff. This function will intersect the TCGA patient barcode with the sample barcodes from RNAseq. If there is more than one sample for one patient, the patient row will be duplicated in the output table. Some patients have a primary tumor and a metastatic tumor sample.
rnasubset will return a data.table with 5 additional columns to the original clinical table. The level column will contain the RSEM values for the gene of interest. The exprs_rank column will contain the rank of gene expression. The high column will contain TRUE values for samples that are in the upper quantile of expression. The low column will contain TRUE values for samples that are in the lower quantile of expression. The gene2 column contains low, middle, high identifiers for which gene expression group the sample belongs to.
In this example, we will compare patients in the top 20% of SOX10 expression and to the bottom 20% of SOX10 expression.
sox10.rna <- rnasubset(pat, rna, "SOX10", 20)
The output table from rnasubset can be used for downstream analysis and plotting functions. plotlygenelevel and hchartgenelevel are wrappers for plot_ly from plotly and hchart from highcharter respectively. These two functions will plot the gene expression of your gene of interest across all samples with coloring for the low, middle and high groups.
plotlygenelevel(sox10.rna)
hchartgenelevel(sox10.rna)
genesurv needs 2 parameters: the output table from rnasubset and the gene name. genesurv is a wrapper for ggsurvplot from survminer. It will plot the Kaplan-Meier curves for the high gene and low gene expression groups and report the p value from the log-rank test.
genesurv(sox10.rna, "SOX10")
rnadeg needs 2 parameters: the output table from rnasubset and the RNAseq count table. rnadeg is a wrapper for voom from limma. It will calculate the genes differentially expressed between the high gene and low gene expression groups.
sox10.deg <- rnadeg(sox10.rna, rna)
head(sox10.deg)
## genes logFC AveExpr t P.Value adj.P.Val
## 6348 FGD6 -2.294677 2.240081 -14.25145 6.116085e-32 8.141732e-28
## 19218 USP15 -1.860727 4.691167 -13.83370 1.126107e-30 6.372695e-27
## 2493 C4orf32 -2.338852 2.130889 -13.79884 1.436154e-30 6.372695e-27
## 15893 SGMS1 -2.115494 2.716887 -13.74117 2.147329e-30 7.146311e-27
## 4281 CSGALNACT2 -1.691013 4.097050 -13.43576 1.807036e-29 4.811053e-26
## 5499 ELL2 -2.155000 4.418038 -13.39108 2.467639e-29 5.474868e-26
## B
## 6348 61.75460
## 19218 59.06146
## 2493 58.64543
## 15893 58.34007
## 4281 56.31973
## 5499 56.00969
rnaheat needs 4 parameters: the output table from rnasubset, the RNAseq count table, the output table from rnadeg, and the gene name. rnaheat is a wrapper for Heatmap fromComplexHeatmap. It will plot a heatmap of the top 100 most significant differentially expressed genes, draw a colored column bar to indicate the high gene and low gene expression groups, and a colored column bar to indicate the rank of the gene of interest.
rnaheat(sox10.rna, rna, sox10.deg, "SOX10")
rnagsva needs 2 parameters: the output table from rnasubset and the RNAseq count table. rnagsva is a wrapper for gsva from GSVA. It will calculate the GSVA enrichment scores for all samples using pathway from the c2BroadSets.
rnagsvasig needs 2 parameters: the output table from rnasubset and the output table from rnagsva. rnagsvasig is a wrapper for lmFit from limma. It will determine the significant differences in GSVA enrichment scores between the high gene and low gene expression group.
rnagsvaheat needs 4 parameters: the output table from rnasubset, the output table from rnagsva, the output table from rnagsvasig, and the gene name. rnagsvaheat is a wrapper for Heatmap from ComplexHeatmap. It will display a heatmap of the top 100 most signifcant pathways with a colored column bar to indicate the high gene and low gene expression groups.
require(GSVAdata)
data(c2BroadSets)
sox10.gsva <- rnagsva(sox10.rna, rna)
## Estimating GSVA scores for 2917 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs in rnaseq data with Poisson kernels
## Using parallel with 4 cores
##
|
| | 0%
sox10.gsvasig <- rnagsvasig(sox10.rna, sox10.gsva)
head(sox10.gsvasig)
## logFC AveExpr
## JIANG_TIP30_TARGETS_DN 0.5749123 -0.0540496698
## EHLERS_ANEUPLOIDY_DN 0.6387584 -0.0257061454
## HOEBEKE_LYMPHOID_STEM_CELL_UP -0.3770043 -0.0033900669
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY 0.3116832 -0.0214516127
## HALMOS_CEBPA_TARGETS_UP -0.3823368 -0.0064836597
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP -0.4969431 -0.0001819253
## t P.Value
## JIANG_TIP30_TARGETS_DN 14.45752 1.232265e-32
## EHLERS_ANEUPLOIDY_DN 13.41198 1.859839e-29
## HOEBEKE_LYMPHOID_STEM_CELL_UP -12.93016 5.409450e-28
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY 12.91066 6.199391e-28
## HALMOS_CEBPA_TARGETS_UP -12.87979 7.691832e-28
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP -12.62644 4.509977e-27
## adj.P.Val B
## JIANG_TIP30_TARGETS_DN 3.594517e-29 63.49259
## EHLERS_ANEUPLOIDY_DN 2.712575e-26 56.27447
## HOEBEKE_LYMPHOID_STEM_CELL_UP 4.487415e-25 52.95033
## KIM_GASTRIC_CANCER_CHEMOSENSITIVITY 4.487415e-25 52.81589
## HALMOS_CEBPA_TARGETS_UP 4.487415e-25 52.60313
## BACOLOD_RESISTANCE_TO_ALKYLATING_AGENTS_UP 2.192600e-24 50.85857
rnagsvaheat(sox10.rna, sox10.gsva, sox10.gsvasig, "SOX10")
rnareact needs 1 parameter: the output table from rnasubset. rnareact is wrapper for enrichPathway from ReactomePA. It will calculate enriched pathways from Reactome database.
plotreact needs 4 parameters: the output table from rnareact, the output table from rnadeg, the graph selection which can be ‘dot’, ‘map’, or ‘cnet’, and the number of pathways for plotting. plotreact is a wrapper for dotplot, enrichMap, and cnetplot from ReactomePA.
sox10.react <- rnareact(sox10.deg)
#dotplot
plotreact(sox10.react, sox10.deg, "dot", 15)
#pathway plot
plotreact(sox10.react, sox10.deg, "map", 15)
#cnet plot
plotreact(sox10.react, sox10.deg, "cnet", 5)
diffmut needs 2 parameters: the output table from rnasubset and the binary mutation matrix table. It will perform a chi-squared test between the proportions of mutated and wild-type genes between the high gene and low gene expression groups. The binary mutation matrix is an over-simplification of mutation status of a gene. Here the assumption is every mutation in a gene causes the same effect.
plotlymut needs 4 parameters: the output table from rnasuset, the binary mutation matrix table, the output table from diffmut, and the gene name. It will create an Oncoprint-like plot of the top 20 differentially mutated genes and your gene of interest.
sox10.mut <- diffmut(sox10.rna, mut)
head(sox10.mut)
## Gene N.high N.hightotal N.low N.lowtotal p.value FDR
## 1: LILRB4 14 67 2 71 0.002293250 1
## 2: POTEF 9 67 0 71 0.004382139 1
## 3: ZCCHC5 8 67 0 71 0.008402271 1
## 4: CCNB3 2 67 13 71 0.008868286 1
## 5: FASN 0 67 8 71 0.013645791 1
## 6: STON2 0 67 8 71 0.013645791 1
plotlymut(sox10.rna, mut, sox10.mut, "SOX10")
diffcp needs 2 parameters: the output table from rnasubset and the processed copy number table with only -1, 0, and 1 levels. It will perform a chi-squared test between copy number gain and copy number neutral and another chi-squared test for copy number neutral and copy number loss.
plotlycp needs 4 parameters: the output table from rnasubset, the processed copy number table, the output table from diffcp, and the gene name. It will create an Oncoprint-like plot of the top 20 genes with copy number differenes and your gene of interest.
sox10.cp <- diffcp(sox10.rna, cp)
head(sox10.cp)
## Gene p.value.high p.value.low FDR.high FDR.low
## 1: ACTA2 NaN 1.357802e-03 NaN 0.020450056
## 2: ADAMTS14 NaN 2.843105e-04 NaN 0.008240953
## 3: ADIRF NaN 8.323026e-04 NaN 0.015039620
## 4: ADO NaN 4.832831e-05 NaN 0.005724420
## 5: AGAP11 NaN 8.323026e-04 NaN 0.015039620
## 6: AIFM2 NaN 2.843105e-04 NaN 0.008240953
plotlycp(sox10.rna, cp, sox10.cp, "SOX10")
rppaheat needs 3 parameters: the output table from rnasubset, the RPPA expression table, and the gene name. rppaheat will draw the heatmap for all proteins. rppadeg needs 2 parameters: the output table from rnasubset and the RPPA expression table. rppadeg is a wrapper for lmFit from limma. Since not all samples with RNAseq data have corresponding RPPA data, rppadeg will perform the intersection and calculate the differentially expressed proteins.
rppaheat(sox10.rna, rppa, "SOX10")
sox10.rppa <- rppadeg(sox10.rna, rppa)
head(sox10.rppa)
## logFC AveExpr t P.Value adj.P.Val B
## E-Cadherin 1.068963 -0.08984641 4.56378 1.08384e-05 0.0005193874 2.475999
mutsubset needs 4 parameters: the clinical table, the RNAseq count table, a binary mutation matrix, and a gene name. It will return a data.table with 4 additional columns to the clinical table. The gene2 column contains if the gene is WT or mutated. The level column contains the RSEM values of the gene. The exprs_rank column contains the expression rank of the gene.
In this workflow, we will use NRAS as an example since it is mutated about 30% in melanoma. First we will identify which samples are mutated from the binary mutation matrix and add a column to the clinical table using mutsubset.
nras.mut <- mutsubset(pat, rna, mut, "NRAS")
The expression of NRAS can be plotted with plotlygenelevel or hchartgenelevel identifying the mutated and wild-type NRAS samples.
plotlygenelevel(nras.mut)
hchartgenelevel(nras.mut)
Survival analysis comparing the NRAS mutant and wild-type samples is performed with genesurv.
genesurv(nras.mut, "NRAS")
Differentially expressed genes between the NRAS mutant and wild-type samples is determined with rnadeg.
nras.deg <- rnadeg(nras.mut, rna)
head(nras.deg)
## genes logFC AveExpr t P.Value adj.P.Val
## 19472 WDR17 1.658291 -0.1211830 6.409451 4.806984e-10 1.065468e-06
## 8995 KIAA1755 -1.007829 4.3329956 -4.732267 3.246504e-06 3.722005e-04
## 13161 PDLIM4 -1.647180 3.8021957 -4.401209 1.437817e-05 9.051001e-04
## 6660 FST -1.475561 1.1227029 -4.189131 3.561891e-05 1.590754e-03
## 3240 CCNO -1.236970 0.8462824 -4.144502 4.291137e-05 1.777814e-03
## 5965 FAM178B 1.621504 1.9647182 4.043829 6.493079e-05 2.266442e-03
## B
## 19472 10.682159
## 8995 4.212745
## 13161 2.817319
## 6660 1.639425
## 3240 1.458504
## 5965 1.478388
rnaheat will produce the heatmap with the top 100 differentially expressed genes between the NRAS wild-type and mutated populations
rnaheat(nras.mut, rna, nras.deg, "NRAS")
The significant pathways between the NRAS wild-type and mutant populations can be determined by rnagsva and rnagsvasig, and visualized with rnagsvaheat.
require(GSVAdata)
data(c2BroadSets)
nras.gsva <- rnagsva(nras.mut, rna)
## Estimating GSVA scores for 2917 gene sets.
## Computing observed enrichment scores
## Estimating ECDFs in rnaseq data with Poisson kernels
## Using parallel with 4 cores
##
|
| | 0%
nras.gsvasig <- rnagsvasig(nras.mut, nras.gsva)
head(nras.gsvasig)
## logFC
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION 0.19259257
## CHANDRAN_METASTASIS_UP 0.09672968
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA 0.16757530
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2 0.14966950
## BIOCARTA_PITX2_PATHWAY 0.18446976
## YANG_BREAST_CANCER_ESR1_BULK_UP -0.15130505
## AveExpr t
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION -0.014090117 5.404492
## CHANDRAN_METASTASIS_UP -0.009443104 5.241218
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA -0.016254619 5.212805
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2 -0.013801843 5.190258
## BIOCARTA_PITX2_PATHWAY -0.020002332 5.157885
## YANG_BREAST_CANCER_ESR1_BULK_UP -0.020198188 -5.059070
## P.Value
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION 1.212959e-07
## CHANDRAN_METASTASIS_UP 2.781074e-07
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA 3.206500e-07
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2 3.588370e-07
## BIOCARTA_PITX2_PATHWAY 4.214687e-07
## YANG_BREAST_CANCER_ESR1_BULK_UP 6.852993e-07
## adj.P.Val B
## REACTOME_RNA_POLYMERASE_I_TRANSCRIPTION_INITIATION 0.0002458848 7.267885
## CHANDRAN_METASTASIS_UP 0.0002458848 6.498342
## MATTIOLI_MGUS_VS_MULTIPLE_MYELOMA 0.0002458848 6.366461
## HOSHIDA_LIVER_CANCER_SUBCLASS_S2 0.0002458848 6.262242
## BIOCARTA_PITX2_PATHWAY 0.0002458848 6.113276
## YANG_BREAST_CANCER_ESR1_BULK_UP 0.0003221519 5.663487
rnagsvaheat(nras.mut, nras.gsva, nras.gsvasig, "NRAS")
Reactome pathway analysis can also be performed on the NRAS wild-type and mutant populations with rnareact.
nras.react <- rnareact(nras.deg)
#dotplot
plotreact(nras.react, nras.deg, "dot", 15)
#pathway plot
plotreact(nras.react, nras.deg, "map", 15)
#cnet plot
plotreact(nras.react, nras.deg, "cnet", 5)
The mutations different between the NRAS wild-type and mutant populations can be determined with diffmut and visualized with plotlymut.
nras.mut2 <- diffmut(nras.mut, mut)
head(nras.mut2)
## Gene N.high N.hightotal N.low N.lowtotal p.value FDR
## 1: NRAS 96 96 0 246 2.779897e-75 4.252964e-71
## 2: BRAF 5 96 172 246 1.924594e-26 1.472218e-22
## 3: RAB3GAP1 9 96 1 246 4.775129e-05 2.287547e-01
## 4: MTTP 13 96 5 246 5.980906e-05 2.287547e-01
## 5: PKHD1L1 45 96 60 246 8.843756e-05 2.706013e-01
## 6: CLEC18B 8 96 2 246 8.020348e-04 1.000000e+00
plotlymut(nras.mut, mut, nras.mut2, "NRAS")
The copy number alterations betweeh the NRAS wild-type and mutant populations can be determined with diffcp and visualizyed with plotlycp.
nras.cp <- diffcp(nras.mut, cp)
head(nras.cp)
## Gene p.value.high p.value.low FDR.high FDR.low
## 1: C7orf34 2.619719e-07 0.8729025 0.0002024957 1
## 2: CASP2 2.619719e-07 0.8729025 0.0002024957 1
## 3: CLCN1 2.619719e-07 0.8729025 0.0002024957 1
## 4: EPHA1 2.619719e-07 0.8729025 0.0002024957 1
## 5: FAM131B 2.619719e-07 0.8729025 0.0002024957 1
## 6: GSTK1 2.619719e-07 0.8729025 0.0002024957 1
plotlycp(nras.mut, cp, nras.cp, "NRAS")
The differentially expressed proteins between the NRAS wild-type and NRAS mutant populations can be determined by rppadeg and visualized with rppaheat.
rppaheat(nras.mut, rppa, "NRAS")
nras.rppa <- rppadeg(nras.mut, rppa)
head(nras.rppa)
## data frame with 0 columns and 0 rows