1 Packages used in this tutorial
The major packages used in this tutorial as listed below. Vignettes with example are available in the links provided.
- TCGAbiolinks: http://bioconductor.org/packages/TCGAbiolinks/ (Colaprico et al. 2015)
- ELMER: http://bioconductor.org/packages/ELMER/ (Silva et al. 2018)
2 Creating a MultiAssayExperiment object
A MultiAssayExperiment is a R data structure created to store different assays (RNA-Seq, DNA methylation, ATAC-Sec, …) and samples metadata (age, gender, treatment information, etc.) in one single object. This data structure is used as to create the input for the ELMER analysis, which requires a MultiAssayExperiment containing DNA methylation and gene expression for all samples.
For more information about the MultiAssayExperiment you can check its Bioconductor page: http://bioconductor.org/packages/MultiAssayExperiment/.
We can use the TCGAbiolinks package (http://bioconductor.org/packages/TCGAbiolinks/) to retrieve TCGA DNA methylation and gene expression data from GDC data portal https://portal.gdc.cancer.gov/. TCGAbiolinks searches the data, download it and then transform it into either a matrix or a SummarizedExperiment (http://bioconductor.org/packages/SummarizedExperiment/), which is another Bioconductor data structure to handle a single assay and its samples metadata.
In this tutorial we will download data for 4 TCGA ESCA samples and create a MultiAssayExperiment. The code to downaload all samples aligned to the genome of reference hg38 is available in the next section.
2.1 Downloading DNA methylation array with hg19 annotation
library(TCGAbiolinks)
library(SummarizedExperiment)
# Define samples to be downloaded - to download all samples just remove the barcode argument
samples <- c("TCGA-2H-A9GF","TCGA-2H-A9GG","TCGA-2H-A9GH","TCGA-2H-A9GI")
# DNA methylation data aligned to hg19
query.met <- GDCquery(project = "TCGA-ESCA",
legacy = TRUE,
data.category = "DNA methylation",
barcode = samples, # Filter by barcode
platform = "Illumina Human Methylation 450",
sample.type = "Primary solid Tumor")
GDCdownload(query.met)
methylation <- GDCprepare(query = query.met,
save = TRUE,
save.filename = paste0("TCGA-ESCA_450K_hg19.rda"),
summarizedExperiment = TRUE
)
##
| | 0%
|=========== | 25% ~9 s remaining
|======================= | 50% ~4 s remaining
|=================================== | 75% ~2 s remaining
|===============================================|100% ~0 s remaining
|===============================================|100% Completed after 6 s
## class: RangedSummarizedExperiment
## dim: 485577 4
## metadata(1): data_release
## assays(1): ''
## rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
## rowData names(2): probeID Gene_Symbol
## colnames(4): TCGA-2H-A9GF-01A-11D-A37D-05
## TCGA-2H-A9GG-01A-11D-A37D-05 TCGA-2H-A9GH-01A-11D-A37D-05
## TCGA-2H-A9GI-01A-11D-A37D-05
## colData names(134): sample patient ...
## subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4
## subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7
# Checking DNA methylation beta-values of the first 5 probes (rows)
as.data.frame(assay(methylation)[1:4,])
2.2 Downloading RNA-seq aligned against hg19
# Gene expression
query.exp <- GDCquery(project = "TCGA-ESCA",
data.category = "Gene expression",
data.type = "Gene expression quantification",
platform = "Illumina HiSeq",
file.type = "normalized_results",
experimental.strategy = "RNA-Seq",
barcode = samples,
legacy = TRUE)
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp)
##
|
| | 0%
|
|================ | 25%
|
|================================ | 50%
|
|================================================= | 75%
|
|=================================================================| 100%
## class: RangedSummarizedExperiment
## dim: 21022 4
## metadata(1): data_release
## assays(1): normalized_count
## rownames(21022): A1BG A2M ... TMED7 SLC25A5-AS1
## rowData names(3): gene_id entrezgene ensembl_gene_id
## colnames(4): TCGA-2H-A9GF-01A-11R-A37I-31
## TCGA-2H-A9GG-01A-11R-A37I-31 TCGA-2H-A9GH-01A-11R-A37I-31
## TCGA-2H-A9GI-01A-11R-A37I-31
## colData names(134): sample patient ...
## subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4
## subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7
# Checking expression values of the first 5 genes (rows)
as.data.frame(assay(gene.expression)[1:5,])
2.3 Creating a MultiAssayExperiment
To create a MultiAssayExperiment object, ELMER provides the function createMAE
, which requires the DNA methylation data (either a matrix or a Summarized Experiment object) [function argument met
] and Gene expression data (either a matrix or a Summarized Experiment object) [function argument exp
].
For DNA methylation we only keep distal probes (at least \(2Kbp\) away from TSS) since it tries to infer for distal interactions regulating genes. ELMER provides the function get.feature.probe
to retrieve those set of probes.
The DNA methylation data from the 450k platform has three types of probes cg (CpG loci), ch (non-CpG loci) and rs (SNP assay). The last type of probe can be used for sample identification and tracking and should be excluded for differential methylation analysis according to the ilumina manual (https://www.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/genomestudio/genomestudio-2011-1/genomestudio-methylation-v1-8-user-guide-11319130-b.pdf). Along with the filter based on the distance to TSS, probes with the rs prefix is also removed by get.feature.probe
.
Finally, some probes should also be masked for other reasons presented by Zhou, Laird, and Shen (2017) and listed below (source: https://zwdzwd.github.io/InfiniumAnnotation).
- MASK.mapping - whether the probe is masked for mapping reason. Probes retained should have high quality (>=40 on 0-60 scale) consistent (with designed MAPINFO) * mapping (for both in the case of type I) without INDELs.
- MASK.typeINextBaseSwitch - whether the probe has a SNP in the extension base that causes a color channel switch from the official annotation (described as color-channel-switching, or CCS SNP in the reference). These probes should be processed differently than designed (by summing up both color channels instead of just the annotated color channel).
- MASK.rmsk15 - whether the 15bp 3’-subsequence of the probe overlap with repeat masker, this MASK is NOT recommended.
- MASK.sub25.copy, MASK.sub30.copy, MASK.sub35.copy and MASK.sub40.copy - whether the 25bp, 30bp, 35bp and 40bp 3’-subsequence of the probe is non-unique.
- MASK.snp5.common - whether 5bp 3’-subsequence (including extension for typeII) overlap with any of the common SNPs from dbSNP (global MAF can be under 1%).
- MASK.snp5.GMAF1p - whether 5bp 3’-subsequence (including extension for typeII) overlap with any of the SNPs with global MAF >1%. *MASK.extBase - probes masked for extension base inconsistent with specified color channel (type-I) or CpG (type-II) based on mapping.
- MASK.general - recommended general purpose masking merged from “MASK.sub30.copy”, “MASK.mapping”, “MASK.extBase”, “MASK.typeINextBaseSwitch” and “MASK.snp5.GMAF1p”.
Thus, MASK.general
is used by ELMER to remove the masked probes from the eavluated distal probes.
# 3) Create ELMER input
# A MultiAssayExperiment with samples
# that has both RNA-seq and 450K DNA methylation (distal probes)
library(SummarizedExperiment)
library(ELMER)
# 1) retrieve distal probes
genome <- "hg19"
distal.probes <- get.feature.probe(feature = NULL,
genome = genome,
met.platform = "450K")
ranges(distal.probes)
## IRanges object with 165935 ranges and 0 metadata columns:
## start end width
## <integer> <integer> <integer>
## cg00168193 790667 790668 2
## cg08258224 800083 800084 2
## cg13938959 834183 834184 2
## cg12445832 834295 834296 2
## cg23999112 834356 834357 2
## ... ... ... ...
## cg14931215 21867702 21867703 2
## cg00308367 21868532 21868533 2
## cg00063477 22741795 22741796 2
## cg01900066 22754881 22754882 2
## cg25918849 23753332 23753333 2
## [1] 165935
For Gene expression data the row names accepted are Ensembl gene IDs, since our data has gene name as inout we will change the row names.
# 2) ELMER required ENSEMBL ID as rownames for gene expression
# for hg19, gene symbols|entre_id were the default default for RNA-seq
# so we need to rename our rows to the ensembl_gene_id which is
# already in the rowRanges information
rowRanges(gene.expression)
## GRanges object with 21022 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## A1BG chr19 58856544-58864865 - | A1BG
## A2M chr12 9220260-9268825 - | A2M
## NAT1 chr8 18027986-18081198 + | NAT1
## NAT2 chr8 18248755-18258728 + | NAT2
## SERPINA3 chr14 95078714-95090392 + | SERPINA3
## ... ... ... ... . ...
## FTX chrX 73183790-73513409 - | FTX
## TICAM2 chr5 114914339-114961876 - | TICAM2
## TMED7-TICAM2 chr5 114914339-114961858 - | TMED7-TICAM2
## TMED7 chr5 114949205-114968689 - | TMED7
## SLC25A5-AS1 chrX 118599997-118603061 - | SLC25A5-AS1
## entrezgene ensembl_gene_id
## <numeric> <character>
## A1BG 1 ENSG00000121410
## A2M 2 ENSG00000175899
## NAT1 9 ENSG00000171428
## NAT2 10 ENSG00000156006
## SERPINA3 12 ENSG00000196136
## ... ... ...
## FTX 100302692 ENSG00000230590
## TICAM2 100302736 ENSG00000243414
## TMED7-TICAM2 100302736 ENSG00000251201
## TMED7 100302736 ENSG00000134970
## SLC25A5-AS1 100303728 ENSG00000224281
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
Finally, we can use createMAE
setting the genome of reference (“hg19” or “hg38”) used to add DNA methylation probes metadata and gene metadata information. To perform a correlation between DNA methylation and gene expression it is better to take the log2 of expression data (executed if the argument linearize.exp
is TRUE). Also, if samples are from TCGA
, ELMER will automatically pull samples information from TCGA, otherwise it should be provided by the user using colData
and sampleMap
arguments (tip: you can check the documentation of a function in R with the following command ?createMAE
)
# Create MAE
mae <- createMAE(exp = gene.expression,
met = methylation,
met.platform = "450K",
genome = genome,
linearize.exp = TRUE, # takes log2(expression + 1)
filter.probes = distal.probes,
save = FALSE,
TCGA = TRUE)
You can check the MAE object using some accessor described in this cheatsheet.
## A MultiAssayExperiment object of 2 listed
## experiments with user-defined names and respective classes.
## Containing an ExperimentList class object of length 2:
## [1] DNA methylation: RangedSummarizedExperiment with 135158 rows and 4 columns
## [2] Gene expression: RangedSummarizedExperiment with 21022 rows and 4 columns
## Features:
## experiments() - obtain the ExperimentList instance
## colData() - the primary/phenotype DataFrame
## sampleMap() - the sample availability DataFrame
## `$`, `[`, `[[` - extract colData columns, subset, or experiment
## *Format() - convert into a long or wide DataFrame
## assays() - convert ExperimentList to a SimpleList of matrices
# Since it is TCGA ELMER add autmaticalle the samples information
as.data.frame(colData(mae)[1:4,1:5])
The next sections we will provide the code to download the complete TCGA ESCA dataset.
3 Code to retrieve all TCGA-ESCA data
The code below downloaded all TCGA-ESCA data (DNA methylation and gene expression), but due to the long time to download the data it will not be run.
# 1) get all DNA methylation data aligned annotated with hg38 information
query.met <- GDCquery(project = "TCGA-ESCA",
data.category = "DNA Methylation",
platform = "Illumina Human Methylation 450",
sample.type = "Primary solid Tumor")
GDCdownload(query.met)
methylation <- GDCprepare(query = query.met,
save = TRUE,
save.filename = "TCGA-ESCA_450K_hg38.rda"
)
# 1) get all RNA-seq data aligned against hg38
query.exp <- GDCquery(project = "TCGA-ESCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp,
save = TRUE,
save.filename = "TCGA-ESCA_450K_hg38.rda")
# 3) Create ELMER input
# A MultiAssayExperiment with samples
# that has both RNA-seq and 450K DNA methylation (distal probes)
genome <- "hg38"
distal.probes <- get.feature.probe(feature = NULL,
genome = genome,
met.platform = "450K")
mae <- createMAE(exp = gene.expression,
met = methylation,
met.platform = "450K",
genome = genome,
linearize.exp = TRUE,
filter.probes = distal.probes,
save = TRUE,
save.filename = "TCGA_ESCA_MAE_distal_regions.rda",
TCGA = TRUE)
4 Session Information
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] ELMER_2.9.2 ELMER.data_2.9.3
## [3] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [5] BiocParallel_1.18.0 matrixStats_0.54.0
## [7] Biobase_2.44.0 GenomicRanges_1.36.0
## [9] GenomeInfoDb_1.20.0 IRanges_2.18.1
## [11] S4Vectors_0.22.0 BiocGenerics_0.30.0
## [13] TCGAbiolinks_2.13.3
##
## loaded via a namespace (and not attached):
## [1] backports_1.1.4 circlize_0.4.6
## [3] Hmisc_4.2-0 aroma.light_3.14.0
## [5] BiocFileCache_1.8.0 plyr_1.8.4
## [7] selectr_0.4-1 ConsensusClusterPlus_1.48.0
## [9] lazyeval_0.2.2 splines_3.6.0
## [11] ggplot2_3.2.0 sva_3.32.1
## [13] digest_0.6.20 ensembldb_2.8.0
## [15] foreach_1.4.4 htmltools_0.3.6
## [17] checkmate_1.9.4 magrittr_1.5
## [19] memoise_1.1.0 BSgenome_1.52.0
## [21] cluster_2.1.0 doParallel_1.0.14
## [23] limma_3.40.2 ComplexHeatmap_2.0.0
## [25] Biostrings_2.52.0 readr_1.3.1
## [27] annotate_1.62.0 R.utils_2.9.0
## [29] askpass_1.1 prettyunits_1.0.2
## [31] colorspace_1.4-1 blob_1.2.0
## [33] rvest_0.3.4 rappdirs_0.3.1
## [35] ggrepel_0.8.1 xfun_0.8
## [37] dplyr_0.8.3 crayon_1.3.4
## [39] RCurl_1.95-4.12 jsonlite_1.6
## [41] genefilter_1.66.0 zeallot_0.1.0
## [43] VariantAnnotation_1.30.1 survival_2.44-1.1
## [45] zoo_1.8-6 iterators_1.0.10
## [47] glue_1.3.1 survminer_0.4.4
## [49] gtable_0.3.0 zlibbioc_1.30.0
## [51] XVector_0.24.0 GetoptLong_0.1.7
## [53] shape_1.4.4 scales_1.0.0
## [55] DESeq_1.36.0 DBI_1.0.0
## [57] edgeR_3.26.5 ggthemes_4.2.0
## [59] Rcpp_1.0.1 viridisLite_0.3.0
## [61] htmlTable_1.13.1 xtable_1.8-4
## [63] progress_1.2.2 cmprsk_2.2-8
## [65] clue_0.3-57 foreign_0.8-71
## [67] bit_1.1-14 matlab_1.0.2
## [69] km.ci_0.5-2 Formula_1.2-3
## [71] htmlwidgets_1.3 httr_1.4.0
## [73] RColorBrewer_1.1-2 acepack_1.4.1
## [75] reshape_0.8.8 pkgconfig_2.0.2
## [77] XML_3.98-1.20 R.methodsS3_1.7.1
## [79] nnet_7.3-12 Gviz_1.28.0
## [81] dbplyr_1.4.2 locfit_1.5-9.1
## [83] tidyselect_0.2.5 rlang_0.4.0
## [85] AnnotationDbi_1.46.0 munsell_0.5.0
## [87] tools_3.6.0 downloader_0.4
## [89] generics_0.0.2 RSQLite_2.1.1
## [91] broom_0.5.2 evaluate_0.14
## [93] stringr_1.4.0 yaml_2.2.0
## [95] knitr_1.23 bit64_0.9-7
## [97] survMisc_0.5.5 purrr_0.3.2
## [99] AnnotationFilter_1.8.0 EDASeq_2.18.0
## [101] nlme_3.1-140 R.oo_1.22.0
## [103] xml2_1.2.0 biomaRt_2.41.7
## [105] rstudioapi_0.10 compiler_3.6.0
## [107] plotly_4.9.0 curl_3.3
## [109] png_0.1-7 ggsignif_0.5.0
## [111] tibble_2.1.3 geneplotter_1.62.0
## [113] stringi_1.4.3 GenomicFeatures_1.36.4
## [115] lattice_0.20-38 ProtGenerics_1.16.0
## [117] Matrix_1.2-17 KMsurv_0.1-5
## [119] vctrs_0.2.0 pillar_1.4.2
## [121] GlobalOptions_0.1.0 data.table_1.12.2
## [123] bitops_1.0-6 rtracklayer_1.44.0
## [125] R6_2.4.0 latticeExtra_0.6-28
## [127] hwriter_1.3.2 ShortRead_1.42.0
## [129] gridExtra_2.3 codetools_0.2-16
## [131] dichromat_2.0-0 assertthat_0.2.1
## [133] openssl_1.4.1 rjson_0.2.20
## [135] GenomicAlignments_1.20.1 Rsamtools_2.0.0
## [137] GenomeInfoDbData_1.2.1 MultiAssayExperiment_1.10.4
## [139] mgcv_1.8-28 hms_0.5.0
## [141] rpart_4.1-15 grid_3.6.0
## [143] prettydoc_0.2.1 tidyr_0.8.3
## [145] rmarkdown_1.13 biovizBase_1.32.0
## [147] ggpubr_0.2.1 base64enc_0.1-3
References
Colaprico, Antonio, Tiago C Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S Sabedot, et al. 2015. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71–e71.
Silva, Tiago C, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J Hazelett, Houtan Noushmehr, De-Chen Lin, and Benjamin P Berman. 2018. “ELMER V. 2: An R/Bioconductor Package to Reconstruct Gene Regulatory Networks from Dna Methylation and Transcriptome Profiles.” Bioinformatics 35 (11): 1974–7.
Zhou, Wanding, Peter W Laird, and Hui Shen. 2017. “Comprehensive Characterization, Annotation and Innovative Use of Infinium Dna Methylation Beadchip Probes.” Nucleic Acids Research 45 (4): e22–e22.