TCGAbiolinks is able to access The National Cancer Institute (NCI) Genomic Data Commons (GDC) thourogh their
GDC Application Programming Interface (API) to search, download and prepare relevant data for data analysis in R.
You may install the stable version from Biocondcutor, or the development version using devtools::install_github(‘BioinformaticsFMRP/TCGAbiolinks’).
Please use Github issues if you want to file bug reports or feature requests.
library(TCGAbiolinks)
library(SummarizedExperiment)
library(dplyr)
library(DT)
There are two available sources to download GDC data using TCGAbiolunks: - GDC Legacy Archive : provides access to an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC), in which uses as references GRCh37 (hg19) and GRCh36 (hg18). - GDC harmonized database: data available was harmonized against GRCh38 (hg38) using GDC Bioinformatics Pipelines which provides methods to the standardization of biospecimen and clinical data.
In this example we will access the harmonized database (legacy = FALSE
) and search for all DNA methylation data for recurrent glioblastoma multiform (GBM) and low grade gliomas (LGG) samples.
query <- GDCquery(project = c("TCGA-GBM", "TCGA-LGG"),
data.category = "DNA Methylation",
legacy = FALSE,
platform = c("Illumina Human Methylation 450"),
sample.type = "Recurrent Solid Tumor"
)
datatable(query$results[[1]],
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
This exmaple shows how the user can search for breast cancer Raw Sequencing Data (“Controlled”) and verify the name of the files and the barcodes associated with it.
query <- GDCquery(project = c("TCGA-BRCA"),
data.category = "Raw Sequencing Data",
sample.type = "Primary solid Tumor")
# Only first 100 to make render faster
datatable(query$results[[1]][1:100,c("file_name","cases")],
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
There are two methods to download GDC data using TCGAbiolunks: - client: this method creates a MANIFEST file and download the data using GDC Data Transfer Tool this method is more reliable but it might be slower compared to the api method. - api: this methods used the GDC Application Programming Interface (API) to downlaod the data. This will create a MANIFEST file and the data downloaded will be compressed into a tar.gz file. If the size and the number of the files are too big this tar.gz will be too big whicih might have a high probability of download failure. To solve that we created the chunks.per.download
argument which will split the files into small chunks, for example, if chunks.per.download is equal to 10 we will download only 10 files inside each tar.gz.
A SummarizedExperiment object has three main matrices that can be accessed using the SummarizedExperiment package):
colData(data)
: stores sample information. TCGAbiolinks will add indexed clinical data and subtype information from marker TCGA papers.assay(data)
: stores molecular datarowRanges(data)
: stores metadata about the features, including their genomic rangesIn this example we will download gene expression data from legacy database (data aligned against genome of reference hg19) using GDC api method and we will show object data and metadata.
query <- GDCquery(project = "TCGA-GBM",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"),
legacy = TRUE)
GDCdownload(query, method = "api", chunks.per.download = 10)
data <- GDCprepare(query)
# Gene expression aligned against hg19.
datatable(as.data.frame(colData(data)),
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
# Only first 100 to make render faster
datatable(assay(data)[1:100,],
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = TRUE)
rowRanges(data)
## GRanges object with 21022 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## A1BG chr19 [58856544, 58864865] - | A1BG
## A2M chr12 [ 9220260, 9268825] - | A2M
## NAT1 chr8 [18027986, 18081198] + | NAT1
## NAT2 chr8 [18248755, 18258728] + | NAT2
## RP11-986E7.7 chr14 [95058395, 95090983] + | RP11-986E7.7
## ... ... ... ... . ...
## FTX chrX [ 73183790, 73513409] - | FTX
## TMED7-TICAM2 chr5 [114914339, 114961858] - | TMED7-TICAM2
## TMED7 chr5 [114949205, 114968689] - | TMED7
## TICAM2 chr5 [114914339, 114961876] - | TICAM2
## SLC25A5-AS1 chrX [118599997, 118603061] - | SLC25A5-AS1
## entrezgene ensembl_gene_id
## <numeric> <character>
## A1BG 1 ENSG00000121410
## A2M 2 ENSG00000175899
## NAT1 9 ENSG00000171428
## NAT2 10 ENSG00000156006
## RP11-986E7.7 12 ENSG00000273259
## ... ... ...
## FTX 100302692 ENSG00000230590
## TMED7-TICAM2 100302736 ENSG00000251201
## TMED7 100302736 ENSG00000134970
## TICAM2 100302736 ENSG00000243414
## SLC25A5-AS1 100303728 ENSG00000224281
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
In this example we will download gene expression quantification from harmonized database (data aligned against genome of reference hg38) using GDC Data Transfer Tool. Also, it shows the object data and metadata.
# Gene expression aligned against hg38
query <- GDCquery(project = "TCGA-GBM",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
barcode = c("TCGA-14-0736-02A-01R-2005-01", "TCGA-06-0211-02A-02R-2005-01"))
GDCdownload(query, method = "client")
data <- GDCprepare(query)
datatable(as.data.frame(colData(data)),
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
datatable(assay(data)[1:100,],
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = TRUE)
datatable(as.data.frame(values(data)),
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = TRUE)
In GDC database the clinical data can be retrieved from two sources:
There are two main differences:
In this example we will fetch clinical indexed data.
clinical <- GDCquery_clinic(project = "TCGA-LUAD", type = "clinical")
datatable(clinical, filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
In this example we will fetch clinical data directly from the clinical XML files.
query <- GDCquery(project = "TCGA-COAD",
data.category = "Clinical",
barcode = c("TCGA-RU-A8FL","TCGA-AA-3972"))
GDCdownload(query)
clinical <- GDCprepare_clinic(query, clinical.info = "patient")
datatable(clinical, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
clinical.drug <- GDCprepare_clinic(query, clinical.info = "drug")
datatable(clinical.drug, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
clinical.radiation <- GDCprepare_clinic(query, clinical.info = "radiation")
datatable(clinical.radiation, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
clinical.admin <- GDCprepare_clinic(query, clinical.info = "admin")
datatable(clinical.admin, options = list(scrollX = TRUE, keys = TRUE), rownames = FALSE)
Some inconsisentecies have been found in the indexed clinical data and are being investigated by the GDC team. These inconsistencies are:
# Get XML files and parse them
clin.query <- GDCquery(project = "TCGA-READ", data.category = "Clinical", barcode = "TCGA-F5-6702")
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
clinical.patient.followup <- GDCprepare_clinic(clin.query, clinical.info = "follow_up")
# Get indexed data
clinical.index <- GDCquery_clinic("TCGA-READ")
select(clinical.patient,vital_status,days_to_death,days_to_last_followup) %>% datatable
select(clinical.patient.followup, vital_status,days_to_death,days_to_last_followup) %>% datatable
# Vital status should be the same in the follow up table
filter(clinical.index,submitter_id == "TCGA-F5-6702") %>% select(vital_status,days_to_death,days_to_last_follow_up) %>% datatable
# Get XML files and parse them
recurrent.samples <- GDCquery(project = "TCGA-LIHC",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts",
sample.type = "Recurrent Solid Tumor")$results[[1]] %>% select(cases)
recurrent.patients <- substr(recurrent.samples$cases,1,12)
clin.query <- GDCquery(project = "TCGA-LIHC", data.category = "Clinical", barcode = recurrent.patients)
GDCdownload(clin.query)
clinical.patient <- GDCprepare_clinic(clin.query, clinical.info = "patient")
# Get indexed data
GDCquery_clinic("TCGA-LIHC") %>% filter(submitter_id %in% recurrent.patients) %>%
select(progression_or_recurrence,days_to_recurrence,tumor_grade) %>% datatable
# XML data
clinical.patient %>% select(bcr_patient_barcode,neoplasm_histologic_grade) %>% datatable
This exmaple will download MAF (mutation annotation files) for variant calling pipeline muse. Pipelines options are: muse, varscan2, somaticsniper, mutect. For more information please access GDC docs.
acc.maf <- GDCquery_Maf("ACC", pipelines = "muse")
# Only first 100 to make render faster
datatable(acc.maf[1:100,],
filter = 'top',
options = list(scrollX = TRUE, keys = TRUE, pageLength = 5),
rownames = FALSE)
devtools::session_info('TCGAbiolinks')
## setting value
## version R Under development (unstable) (2017-01-23 r72020)
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Sao_Paulo
## date 2017-02-04
##
## package * version date
## affy 1.52.0 2016-10-18
## affyio 1.44.0 2016-10-18
## ALL 1.16.0 2016-10-20
## annotate 1.52.1 2016-12-23
## AnnotationDbi 1.36.1 2017-01-24
## ape 4.0 2016-12-01
## aroma.light 3.4.0 2016-10-18
## assertthat 0.1 2013-12-06
## BH 1.62.0-1 2016-11-19
## Biobase * 2.34.0 2016-10-18
## BiocGenerics * 0.20.0 2016-10-18
## BiocInstaller 1.24.0 2016-10-18
## BiocParallel 1.8.1 2016-11-01
## biomaRt 2.30.0 2016-10-18
## Biostrings 2.42.1 2017-01-24
## bitops 1.0-6 2013-08-17
## c3net 1.1.1 2012-07-23
## caTools 1.17.1 2014-09-10
## circlize 0.3.9 2016-09-26
## class 7.3-14 2015-08-30
## cluster 2.0.5 2016-10-08
## clusterProfiler 3.2.11 2017-01-24
## codetools 0.2-15 2016-10-05
## colorspace 1.3-2 2016-12-14
## ComplexHeatmap 1.12.0 2016-10-14
## ConsensusClusterPlus 1.38.0 2016-10-18
## curl 2.3 2016-11-24
## data.table 1.10.4 2017-02-01
## DBI 0.5-1 2016-09-10
## dendextend 1.4.0 2017-01-21
## DEoptimR 1.0-8 2016-11-19
## DESeq 1.26.0 2016-10-18
## dichromat 2.0-0 2013-01-24
## digest 0.6.12 2017-01-27
## diptest 0.75-7 2015-06-08
## dnet 1.0.10 2017-01-27
## DO.db 2.9 2017-01-24
## doParallel 1.0.10 2015-10-14
## DOSE 3.0.10 2017-01-24
## downloader 0.4 2015-07-09
## dplyr * 0.5.0 2016-06-24
## EDASeq 2.8.0 2016-10-18
## edgeR 3.16.5 2016-12-23
## evaluate 0.10 2016-10-11
## exactRankTests 0.8-28 2015-02-20
## fastmatch 1.1-0 2017-01-28
## fgsea 1.0.2 2016-12-23
## flexmix 2.3-13 2015-01-17
## foreach 1.4.3 2015-10-13
## fpc 2.1-10 2015-08-14
## futile.logger 1.4.3 2016-07-10
## futile.options 1.0.0 2010-04-06
## gdata 2.17.0 2015-07-04
## genefilter 1.56.0 2016-10-18
## geneplotter 1.52.0 2016-10-18
## GenomeInfoDb * 1.10.2 2016-12-31
## GenomicAlignments 1.10.0 2017-01-24
## GenomicFeatures 1.26.2 2016-12-23
## GenomicRanges * 1.26.2 2017-01-24
## GetoptLong 0.1.5 2016-09-26
## ggplot2 2.2.1 2016-12-30
## ggpubr 0.1.1 2016-12-05
## ggrepel 0.6.5 2016-11-24
## ggsci 2.0 2016-11-21
## ggthemes 3.3.0 2016-11-24
## GlobalOptions 0.0.10 2016-04-17
## GO.db 3.4.0 2017-01-24
## GOSemSim 2.0.4 2017-01-24
## gplots 3.0.1 2016-03-30
## graph 1.52.0 2016-10-18
## gridBase 0.4-7 2014-02-24
## gridExtra 2.2.1 2016-02-29
## gtable 0.2.0 2016-02-26
## gtools 3.5.0 2015-05-29
## hexbin 1.27.1 2015-08-19
## highr 0.6 2016-05-09
## hms 0.3 2016-11-22
## httr 1.2.1 2016-07-03
## hwriter 1.3.2 2014-09-10
## igraph 1.0.1 2015-06-26
## infotheo 1.2.0 2014-07-26
## IRanges * 2.8.1 2017-01-24
## irlba 2.1.2 2016-09-21
## iterators 1.0.8 2015-10-13
## jsonlite 1.2 2016-12-31
## KEGGgraph 1.32.0 2016-10-18
## KEGGREST 1.14.0 2016-10-18
## kernlab 0.9-25 2016-10-03
## KernSmooth 2.23-15 2015-06-29
## knitr 1.15.8 2017-01-31
## labeling 0.3 2014-08-23
## lambda.r 1.1.9 2016-07-10
## lattice 0.20-34 2016-09-06
## latticeExtra 0.6-28 2016-02-09
## lazyeval 0.2.0 2016-06-12
## limma 3.30.9 2017-01-27
## locfit 1.5-9.1 2013-04-20
## magrittr 1.5 2014-11-22
## markdown 0.7.7 2015-04-22
## MASS 7.3-45 2016-04-21
## matlab 1.0.2 2014-06-24
## Matrix 1.2-8 2017-01-20
## matrixStats 0.51.0 2016-10-09
## maxstat 0.7-24 2016-04-06
## mclust 5.2.2 2017-01-22
## memoise 1.0.0 2016-01-29
## mime 0.5 2016-07-07
## minet 3.32.0 2017-01-24
## modeltools 0.2-21 2013-09-02
## munsell 0.4.3 2016-02-13
## mvtnorm 1.0-5 2016-02-02
## nlme 3.1-130 2017-01-24
## NMF 0.20.6 2015-05-26
## nnet 7.3-12 2016-02-02
## openssl 0.9.6 2016-12-31
## org.Hs.eg.db 3.4.0 2016-10-18
## parmigene 1.0.2 2012-07-23
## pathview 1.14.0 2017-01-24
## pkgmaker 0.22 2014-05-14
## plogr 0.1-1 2016-09-24
## plyr 1.8.4 2016-06-08
## png 0.1-7 2013-12-03
## prabclus 2.2-6 2015-01-14
## preprocessCore 1.36.0 2016-10-18
## qvalue 2.6.0 2016-10-18
## R.methodsS3 1.7.1 2016-02-16
## R.oo 1.21.0 2016-11-01
## R.utils 2.5.0 2016-11-07
## R6 2.2.0 2016-10-05
## RColorBrewer 1.1-2 2014-12-07
## Rcpp 0.12.9.1 2017-01-24
## RCurl 1.95-4.8 2016-03-01
## readr 1.0.0 2016-08-03
## registry 0.3 2015-07-08
## reshape2 1.4.2 2016-10-22
## Rgraphviz 2.18.0 2016-10-18
## rjson 0.2.15 2014-11-03
## rngtools 1.2.4 2014-03-06
## robustbase 0.92-7 2016-12-09
## Rsamtools 1.26.1 2016-11-01
## RSQLite 1.1-2 2017-01-08
## rtracklayer 1.34.1 2017-01-24
## rvest 0.3.2 2016-06-17
## S4Vectors * 0.12.1 2017-01-24
## scales 0.4.1 2016-11-09
## selectr 0.3-1 2016-12-19
## shape 1.4.2 2014-11-05
## ShortRead 1.32.0 2016-10-18
## snow 0.4-2 2016-10-14
## stringi 1.1.2 2016-10-01
## stringr 1.1.0 2016-08-19
## SummarizedExperiment * 1.4.0 2016-10-18
## supraHex 1.12.0 2016-10-18
## survival 2.40-1 2016-10-30
## survminer 0.2.4 2016-12-11
## TCGAbiolinks * 2.3.16 2017-02-01
## tibble 1.2 2016-08-26
## tidyr 0.6.1 2017-01-10
## trimcluster 0.1-2 2012-10-29
## viridis 0.3.4 2016-03-12
## whisker 0.3-2 2013-04-28
## XML 3.98-1.5 2016-11-10
## xml2 1.1.1 2017-01-24
## xtable 1.8-2 2016-02-05
## XVector 0.14.0 2017-01-24
## yaml 2.1.14 2016-11-12
## zlibbioc 1.20.0 2016-10-18
## source
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## cran (@1.36.1)
## CRAN (R 3.3.2)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.4.0)
## CRAN (R 3.2.2)
## CRAN (R 3.4.0)
## CRAN (R 3.4.0)
## CRAN (R 3.4.0)
## Bioconductor
## CRAN (R 3.4.0)
## CRAN (R 3.3.2)
## cran (@1.12.0)
## Bioconductor
## CRAN (R 3.3.2)
## CRAN (R 3.4.0)
## CRAN (R 3.3.0)
## cran (@1.4.0)
## CRAN (R 3.3.2)
## Bioconductor
## CRAN (R 3.2.2)
## cran (@0.6.12)
## CRAN (R 3.3.0)
## cran (@1.0.10)
## Bioconductor
## CRAN (R 3.2.2)
## cran (@3.0.10)
## CRAN (R 3.2.2)
## CRAN (R 3.3.0)
## Bioconductor
## Bioconductor
## CRAN (R 3.3.0)
## cran (@0.8-28)
## CRAN (R 3.4.0)
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.2.2)
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## cran (@0.1.1)
## CRAN (R 3.3.2)
## cran (@2.0)
## CRAN (R 3.4.0)
## CRAN (R 3.3.0)
## Bioconductor
## cran (@2.0.4)
## CRAN (R 3.2.4)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.2)
## CRAN (R 3.2.2)
## CRAN (R 3.3.0)
## CRAN (R 3.4.0)
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.2.2)
## CRAN (R 3.4.0)
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## Bioconductor
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.4.0)
## Github (yihui/knitr@b936c1e)
## CRAN (R 3.2.2)
## CRAN (R 3.3.0)
## CRAN (R 3.4.0)
## CRAN (R 3.2.3)
## CRAN (R 3.3.0)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.2.2)
## CRAN (R 3.2.2)
## CRAN (R 3.4.0)
## CRAN (R 3.3.0)
## CRAN (R 3.4.0)
## CRAN (R 3.3.0)
## cran (@0.7-24)
## CRAN (R 3.4.0)
## CRAN (R 3.2.3)
## CRAN (R 3.3.0)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.4.0)
## CRAN (R 3.2.2)
## CRAN (R 3.4.0)
## CRAN (R 3.3.2)
## Bioconductor
## CRAN (R 3.4.0)
## Bioconductor
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.0)
## Bioconductor
## Bioconductor
## CRAN (R 3.2.3)
## CRAN (R 3.3.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.2.2)
## Github (RcppCore/Rcpp@5a99a86)
## CRAN (R 3.2.3)
## CRAN (R 3.4.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.1)
## Bioconductor
## CRAN (R 3.4.0)
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## Bioconductor
## cran (@1.1-2)
## Bioconductor
## CRAN (R 3.3.0)
## Bioconductor
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## Bioconductor
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.4.0)
## Bioconductor
## Bioconductor
## CRAN (R 3.4.0)
## cran (@0.2.4)
## Github (BioinformaticsFMRP/TCGAbiolinks@7cc7bb0)
## CRAN (R 3.3.0)
## cran (@0.6.1)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.2.2)
## CRAN (R 3.3.2)
## CRAN (R 3.4.0)
## CRAN (R 3.2.3)
## Bioconductor
## CRAN (R 3.3.2)
## Bioconductor