We will be using the RTCGAToolbox, an experimental package for analysis of data from The Cancer Genome Atlas.
The analisis in this workshop follows the RTCGAToolbox vignette, available online. See the RTCGAToolbox vignette.
The Toolbox uses limma and other bioconductor tools to do the analysis of the data. You can get more information and guidance for further analysis from the limma User’s Guide, and the bioconductor website.
The analysis of RNASeq data is similar to microarray analysis. I have given other workshops on the analysis of these kinds of data.
Ortiz-Zuazaga, Humberto (2014): Microarray Analysis With Bioconductor Workshop. figshare. doi:10.6084/m9.figshare.1251183 Retrieved 13:33, Nov 25, 2014 (GMT)
Using limma for microarray and RNA-Seq analysis. A workshop for the Research Design, Biostatistics and Clinical Research Ethics (DBE) key function of the PRCTRC, San Juan, PR, March 7, 2013. Handout
To install the R TCGA Toolbox you need to run these commands once:
source("http://bioconductor.org/biocLite.R")
biocLite("limma")
biocLite("RTCGAToolbox")
After that, you can load the toolbox with just one command.
library(RTCGAToolbox)
The toolbox provides functions to examine the TCGA data available at the Broad Institute:
getFirehoseDatasets()
## [1] "ACC" "BLCA" "BRCA" "CESC" "CHOL" "COADREAD"
## [7] "COAD" "DLBC" "ESCA" "FPPP" "GBMLGG" "GBM"
## [13] "HNSC" "KICH" "KIPAN" "KIRC" "KIRP" "LAML"
## [19] "LGG" "LIHC" "LUAD" "LUSC" "MESO" "OV"
## [25] "PAAD" "PCPG" "PRAD" "READ" "SARC" "SKCM"
## [31] "STAD" "STES" "TGCT" "THCA" "THYM" "UCEC"
## [37] "UCS" "UVM"
These data are subject to the conditions of use available at the site. Please make sure you comply with all the conditions of use.
Each study has been updated multiple times:
stddata = getFirehoseRunningDates()
stddata
## [1] "20151101" "20150821" "20150601" "20150402" "20150204" "20141206"
## [7] "20141017" "20140902" "20140715" "20140614" "20140518" "20140416"
## [13] "20140316" "20140215" "20140115" "20131210" "20131114" "20131010"
## [19] "20130923" "20130809" "20130715" "20130623" "20130606" "20130523"
## [25] "20130508" "20130421" "20130406" "20130326" "20130309" "20130222"
## [31] "20130203" "20130116" "20121221" "20121206" "20121114" "20121102"
## [37] "20121024" "20121020" "20121018" "20121004" "20120913" "20120825"
## [43] "20120804" "20120725" "20120707" "20120623" "20120606" "20120525"
## [49] "20120515" "20120425" "20120412" "20120321" "20120306" "20120217"
## [55] "20120124" "20120110" "20111230" "20111206" "20111128" "20111115"
## [61] "20111026"
Different experiments are updated on different dates.
gisticDate = getFirehoseAnalyzeDates(last=3)
gisticDate
## [1] "20150821" "20150402" "20141017"
We could obtain the data for breast cancer, including copy number variants, the clinical data, RNASeq and SNP with a single call. This would download the data and cache a local copy.
brcaData = getFirehoseData (dataset="BRCA", runDate="20150204",
gistic2_Date="20141017",
Clinic=TRUE, RNAseq_Gene=TRUE, mRNA_Array=FALSE, Mutation=TRUE)
For this workshop, we will work instead with a sample dataset included with the toolbox. It contains simulated clinical, mutation, copy number and RNASeq data for 800 genes in 80 participants.
data(RTCGASample)
## RTCGASample dataset is artificially created for function test.
## It isn't biologically meaninful and it has no relation with any cancer type.
## For real datasets, please use client function to get data from data portal.
RTCGASample
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800 80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685 30
## To export data, you may use getData() function.
What’s in the data? There is clinical data on each participant.
clinicData <- getData(RTCGASample, type="Clinical")
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800 80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685 30
## To export data, you may use getData() function.
head(clinicData)
## Composite.Element.REF yearstobirth vitalstatus daystodeath
## TEST.00.0026 value 53 0 NA
## TEST.00.0052 value 50 0 NA
## TEST.00.0088 value 56 0 NA
## TEST.00.0056 value 56 0 NA
## TEST.00.0023 value 56 0 NA
## TEST.00.0092 value 52 0 NA
## daystolastfollowup neoplasm.diseasestage pathology.T.stage
## TEST.00.0026 1183 stage iiia stage iiia
## TEST.00.0052 897 stage iib stage iib
## TEST.00.0088 1000 stage iib stage iib
## TEST.00.0056 1134 stage iia stage iia
## TEST.00.0023 794 stage iib stage iib
## TEST.00.0092 1104 stage iia stage iia
We also have raw RNASeq read data for some participants.
rnaData <- getData(RTCGASample, type="RNASeqGene")
## TEST FirehoseData object
## Available data types:
## Clinical: A data frame, dim: 100 7
## RNASeqGene: A matrix withraw read counts or normalized data, dim: 800 80
## GISTIC: A FirehoseGISTIC object to store copy number data
## Mutations: A data.frame, dim: 2685 30
## To export data, you may use getData() function.
rnaData[1:3,1:5]
## TEST-00-0011-11A-01A-0000-00 TEST-00-0094-11A-01A-0000-00
## LYZL1 169 158
## SEPHS1 0 0
## FLJ45983 54 0
## TEST-00-0004-11A-01A-0000-00 TEST-00-0006-11A-01A-0000-00
## LYZL1 171 158
## SEPHS1 0 0
## FLJ45983 89 0
## TEST-00-0073-11A-01A-0000-00
## LYZL1 166
## SEPHS1 0
## FLJ45983 1
The column names encode the disease status of the samples. We can separate the tumor samples from the control samples by using some R code. A large part of any analysis in R will be manipulating the data to extract information on the samples or the genes.
The limma users guide has example code to extract information from many different kinds of studies, and the U54 BEBiC or the HPCf helpdesk may be able to assist you in creating appropriate data analysis scripts.
Run differential expression analysis on tumor vs normal with sample data.
diffGeneExprs = getDiffExpressedGenes(dataObject=RTCGASample)
The real analysis could be run as follows
diffGeneExprs = getDiffExpressedGenes(dataObject=brcaData,DrawPlots=TRUE,
adj.method="BH",adj.pval=0.05,raw.pval=0.05,
logFC=2,hmTopUpN=10,hmTopDownN=10)
The diffGeneExprs object contains the results for each type of data present in our data object. The sample data only has RNASeq data (not microarray).
for(i in 1:length(diffGeneExprs)) {
writeLines(diffGeneExprs[[i]]@Dataset);
print(head(diffGeneExprs[[i]]@Toptable))
}
## RNASeq
## logFC AveExpr t P.Value adj.P.Val B
## TAP2 5.288573 1.743410 76.38279 7.496531e-76 4.760297e-73 150.82307
## GRTP1 6.187648 2.193494 48.25410 2.030875e-60 6.448029e-58 123.44071
## ENPP5 7.215676 2.707012 45.79069 1.110692e-58 2.350964e-56 120.11226
## APH1B 8.118533 3.158063 37.31744 5.796089e-52 6.134194e-50 106.20724
## INSR 8.055541 3.126521 32.82711 7.967149e-48 1.945823e-46 97.36010
## MINPP1 7.097777 2.647495 30.36322 2.431258e-45 3.508747e-44 91.95723
The clinical data contains the vital status (deceased=1 or unknown=0) and time in days to death or the most recent followup. This data can be used to create a survival plot for a particular gene.
head(clinicData)
## Composite.Element.REF yearstobirth vitalstatus daystodeath
## TEST.00.0026 value 53 0 NA
## TEST.00.0052 value 50 0 NA
## TEST.00.0088 value 56 0 NA
## TEST.00.0056 value 56 0 NA
## TEST.00.0023 value 56 0 NA
## TEST.00.0092 value 52 0 NA
## daystolastfollowup neoplasm.diseasestage pathology.T.stage
## TEST.00.0026 1183 stage iiia stage iiia
## TEST.00.0052 897 stage iib stage iib
## TEST.00.0088 1000 stage iib stage iib
## TEST.00.0056 1134 stage iia stage iia
## TEST.00.0023 794 stage iib stage iib
## TEST.00.0092 1104 stage iia stage iia
clinicData = clinicData[,3:5]
# Replace missing followup values with day of death
clinicData[is.na(clinicData[,3]),3] = clinicData[is.na(clinicData[,3]),2]
survData <- data.frame(Samples=rownames(clinicData),
Time=as.numeric(clinicData[,3]),
Censor=as.numeric(clinicData[,1]))
getSurvival(dataObject=RTCGASample,geneSymbols=c("FCGBP"),sampleTimeCensor=survData)