Introduction

We will be using the RTCGAToolbox, an experimental package for analysis of data from The Cancer Genome Atlas.

The analisis in this workshop follows the RTCGAToolbox vignette, available online. See the RTCGAToolbox vignette.

The Toolbox uses limma and other bioconductor tools to do the analysis of the data. You can get more information and guidance for further analysis from the limma User’s Guide, and the bioconductor website.

The analysis of RNASeq data is similar to microarray analysis. I have given other workshops on the analysis of these kinds of data.

  1. Ortiz-Zuazaga, Humberto (2014): Microarray Analysis With Bioconductor Workshop. figshare. doi:10.6084/m9.figshare.1251183 Retrieved 13:33, Nov 25, 2014 (GMT)

  2. Using limma for microarray and RNA-Seq analysis. A workshop for the Research Design, Biostatistics and Clinical Research Ethics (DBE) key function of the PRCTRC, San Juan, PR, March 7, 2013. Handout

Instalation

To install the R TCGA Toolbox you need to run these commands once:

source("http://bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("RTCGAToolbox")

After that, you can load the toolbox with just one command.

library(RTCGAToolbox)

Obtaining TCGA Data

The toolbox provides functions to examine the TCGA data available at the Broad Institute:

getFirehoseDatasets()
##  [1] "ACC"      "BLCA"     "BRCA"     "CESC"     "CHOL"     "COADREAD"
##  [7] "COAD"     "DLBC"     "ESCA"     "FPPP"     "GBMLGG"   "GBM"     
## [13] "HNSC"     "KICH"     "KIPAN"    "KIRC"     "KIRP"     "LAML"    
## [19] "LGG"      "LIHC"     "LUAD"     "LUSC"     "MESO"     "OV"      
## [25] "PAAD"     "PCPG"     "PRAD"     "READ"     "SARC"     "SKCM"    
## [31] "STAD"     "STES"     "TGCT"     "THCA"     "THYM"     "UCEC"    
## [37] "UCS"      "UVM"

These data are subject to the conditions of use available at the site. Please make sure you comply with all the conditions of use.

Each study has been updated multiple times:

stddata = getFirehoseRunningDates()
stddata
##  [1] "20151101" "20150821" "20150601" "20150402" "20150204" "20141206"
##  [7] "20141017" "20140902" "20140715" "20140614" "20140518" "20140416"
## [13] "20140316" "20140215" "20140115" "20131210" "20131114" "20131010"
## [19] "20130923" "20130809" "20130715" "20130623" "20130606" "20130523"
## [25] "20130508" "20130421" "20130406" "20130326" "20130309" "20130222"
## [31] "20130203" "20130116" "20121221" "20121206" "20121114" "20121102"
## [37] "20121024" "20121020" "20121018" "20121004" "20120913" "20120825"
## [43] "20120804" "20120725" "20120707" "20120623" "20120606" "20120525"
## [49] "20120515" "20120425" "20120412" "20120321" "20120306" "20120217"
## [55] "20120124" "20120110" "20111230" "20111206" "20111128" "20111115"
## [61] "20111026"

Different experiments are updated on different dates.

gisticDate = getFirehoseAnalyzeDates(last=3)
gisticDate
## [1] "20150821" "20150402" "20141017"

We could obtain the data for breast cancer, including copy number variants, the clinical data, RNASeq and SNP with a single call. This would download the data and cache a local copy.

brcaData = getFirehoseData (dataset="BRCA", runDate="20150204",
                            gistic2_Date="20141017",
               Clinic=TRUE, RNAseq_Gene=TRUE, mRNA_Array=FALSE, Mutation=TRUE)

For this workshop, we will work instead with a sample dataset included with the toolbox. It contains simulated clinical, mutation, copy number and RNASeq data for 800 genes in 80 participants.

data(RTCGASample)
## RTCGASample dataset is artificially created for function test.
## It isn't biologically meaninful and it has no relation with any cancer type.
## For real datasets, please use client function to get data from data portal.
RTCGASample
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800    80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685   30
## To export data, you may use getData() function.

Examining the TCGA data

What’s in the data? There is clinical data on each participant.

clinicData <- getData(RTCGASample, type="Clinical")
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800    80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685   30
## To export data, you may use getData() function.
head(clinicData)
##              Composite.Element.REF yearstobirth vitalstatus daystodeath
## TEST.00.0026                 value           53           0          NA
## TEST.00.0052                 value           50           0          NA
## TEST.00.0088                 value           56           0          NA
## TEST.00.0056                 value           56           0          NA
## TEST.00.0023                 value           56           0          NA
## TEST.00.0092                 value           52           0          NA
##              daystolastfollowup neoplasm.diseasestage pathology.T.stage
## TEST.00.0026               1183            stage iiia        stage iiia
## TEST.00.0052                897             stage iib         stage iib
## TEST.00.0088               1000             stage iib         stage iib
## TEST.00.0056               1134             stage iia         stage iia
## TEST.00.0023                794             stage iib         stage iib
## TEST.00.0092               1104             stage iia         stage iia

We also have raw RNASeq read data for some participants.

rnaData <- getData(RTCGASample, type="RNASeqGene")
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800    80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685   30
## To export data, you may use getData() function.
rnaData[1:3,1:5]
##          TEST-00-0011-11A-01A-0000-00 TEST-00-0094-11A-01A-0000-00
## LYZL1                             169                          158
## SEPHS1                              0                            0
## FLJ45983                           54                            0
##          TEST-00-0004-11A-01A-0000-00 TEST-00-0006-11A-01A-0000-00
## LYZL1                             171                          158
## SEPHS1                              0                            0
## FLJ45983                           89                            0
##          TEST-00-0073-11A-01A-0000-00
## LYZL1                             166
## SEPHS1                              0
## FLJ45983                            1

The column names encode the disease status of the samples. We can separate the tumor samples from the control samples by using some R code. A large part of any analysis in R will be manipulating the data to extract information on the samples or the genes.

The limma users guide has example code to extract information from many different kinds of studies, and the U54 BEBiC or the HPCf helpdesk may be able to assist you in creating appropriate data analysis scripts.

Differential gene expression analysis

Run differential expression analysis on tumor vs normal with sample data.

diffGeneExprs = getDiffExpressedGenes(dataObject=RTCGASample)

The real analysis could be run as follows

diffGeneExprs = getDiffExpressedGenes(dataObject=brcaData,DrawPlots=TRUE,
                                   adj.method="BH",adj.pval=0.05,raw.pval=0.05,
                                   logFC=2,hmTopUpN=10,hmTopDownN=10)

Reporting the results

The diffGeneExprs object contains the results for each type of data present in our data object. The sample data only has RNASeq data (not microarray).

for(i in 1:length(diffGeneExprs)) {
   writeLines(diffGeneExprs[[i]]@Dataset);
   print(head(diffGeneExprs[[i]]@Toptable))
}
## RNASeq
##           logFC  AveExpr        t      P.Value    adj.P.Val         B
## TAP2   5.288573 1.743410 76.38279 7.496531e-76 4.760297e-73 150.82307
## GRTP1  6.187648 2.193494 48.25410 2.030875e-60 6.448029e-58 123.44071
## ENPP5  7.215676 2.707012 45.79069 1.110692e-58 2.350964e-56 120.11226
## APH1B  8.118533 3.158063 37.31744 5.796089e-52 6.134194e-50 106.20724
## INSR   8.055541 3.126521 32.82711 7.967149e-48 1.945823e-46  97.36010
## MINPP1 7.097777 2.647495 30.36322 2.431258e-45 3.508747e-44  91.95723

Survival Plots

The clinical data contains the vital status (deceased=1 or unknown=0) and time in days to death or the most recent followup. This data can be used to create a survival plot for a particular gene.

head(clinicData)
##              Composite.Element.REF yearstobirth vitalstatus daystodeath
## TEST.00.0026                 value           53           0          NA
## TEST.00.0052                 value           50           0          NA
## TEST.00.0088                 value           56           0          NA
## TEST.00.0056                 value           56           0          NA
## TEST.00.0023                 value           56           0          NA
## TEST.00.0092                 value           52           0          NA
##              daystolastfollowup neoplasm.diseasestage pathology.T.stage
## TEST.00.0026               1183            stage iiia        stage iiia
## TEST.00.0052                897             stage iib         stage iib
## TEST.00.0088               1000             stage iib         stage iib
## TEST.00.0056               1134             stage iia         stage iia
## TEST.00.0023                794             stage iib         stage iib
## TEST.00.0092               1104             stage iia         stage iia
clinicData = clinicData[,3:5]
# Replace missing followup values with day of death
clinicData[is.na(clinicData[,3]),3] = clinicData[is.na(clinicData[,3]),2]
survData <- data.frame(Samples=rownames(clinicData),
                      Time=as.numeric(clinicData[,3]),
                    Censor=as.numeric(clinicData[,1]))
getSurvival(dataObject=RTCGASample,geneSymbols=c("FCGBP"),sampleTimeCensor=survData)