1 Packages used in this tutorial

The major packages used in this tutorial as listed below. Vignettes with example are available in the links provided.

TCGAbiolinks: http://bioconductor.org/packages/TCGAbiolinks/ (Colaprico et al. 2015)
ELMER: http://bioconductor.org/packages/ELMER/ (Silva et al. 2018)

2 Creating a MultiAssayExperiment object

A MultiAssayExperiment is a R data structure created to store different assays (RNA-Seq, DNA methylation, ATAC-Sec, …) and samples metadata (age, gender, treatment information, etc.) in one single object. This data structure is used as to create the input for the ELMER analysis, which requires a MultiAssayExperiment containing DNA methylation and gene expression for all samples.

For more information about the MultiAssayExperiment you can check its Bioconductor page: http://bioconductor.org/packages/MultiAssayExperiment/.

We can use the TCGAbiolinks package (http://bioconductor.org/packages/TCGAbiolinks/) to retrieve TCGA DNA methylation and gene expression data from GDC data portal https://portal.gdc.cancer.gov/. TCGAbiolinks searches the data, download it and then transform it into either a matrix or a SummarizedExperiment (http://bioconductor.org/packages/SummarizedExperiment/), which is another Bioconductor data structure to handle a single assay and its samples metadata.

In this tutorial we will download data for 4 TCGA ESCA samples and create a MultiAssayExperiment. The code to downaload all samples aligned to the genome of reference hg38 is available in the next section.

2.1 Downloading DNA methylation array with hg19 annotation

library(TCGAbiolinks)
library(SummarizedExperiment)

# Define samples to be downloaded - to download all samples just remove the barcode argument
samples <- c("TCGA-2H-A9GF","TCGA-2H-A9GG","TCGA-2H-A9GH","TCGA-2H-A9GI")


# DNA methylation data aligned to hg19
query.met <- GDCquery(project =  "TCGA-ESCA",
                      legacy = TRUE,
                      data.category = "DNA methylation",
                      barcode = samples, # Filter by barcode
                      platform = "Illumina Human Methylation 450",
                      sample.type = "Primary solid Tumor")

GDCdownload(query.met)

methylation <- GDCprepare(query = query.met,
                          save = TRUE,
                          save.filename = paste0("TCGA-ESCA_450K_hg19.rda"),
                          summarizedExperiment = TRUE
)

## 
|                                               |  0%                      
|===========                                    | 25% ~9 s remaining       
|=======================                        | 50% ~4 s remaining       
|===================================            | 75% ~2 s remaining       
|===============================================|100% ~0 s remaining       
|===============================================|100%                      Completed after 6 s

# Checking object
methylation

## class: RangedSummarizedExperiment 
## dim: 485577 4 
## metadata(1): data_release
## assays(1): ''
## rownames(485577): cg00000029 cg00000108 ... rs966367 rs9839873
## rowData names(2): probeID Gene_Symbol
## colnames(4): TCGA-2H-A9GF-01A-11D-A37D-05
##   TCGA-2H-A9GG-01A-11D-A37D-05 TCGA-2H-A9GH-01A-11D-A37D-05
##   TCGA-2H-A9GI-01A-11D-A37D-05
## colData names(134): sample patient ...
##   subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4
##   subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7

# Checking DNA methylation beta-values  of the first 5 probes (rows)
as.data.frame(assay(methylation)[1:4,])

2.2 Downloading RNA-seq aligned against hg19

# Gene expression
query.exp <- GDCquery(project = "TCGA-ESCA",
                      data.category = "Gene expression",
                      data.type = "Gene expression quantification",
                      platform = "Illumina HiSeq", 
                      file.type  = "normalized_results",
                      experimental.strategy = "RNA-Seq",
                      barcode = samples,
                      legacy = TRUE)
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp)

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |================                                                 |  25%
  |                                                                       
  |================================                                 |  50%
  |                                                                       
  |=================================================                |  75%
  |                                                                       
  |=================================================================| 100%

# Checking object
gene.expression

## class: RangedSummarizedExperiment 
## dim: 21022 4 
## metadata(1): data_release
## assays(1): normalized_count
## rownames(21022): A1BG A2M ... TMED7 SLC25A5-AS1
## rowData names(3): gene_id entrezgene ensembl_gene_id
## colnames(4): TCGA-2H-A9GF-01A-11R-A37I-31
##   TCGA-2H-A9GG-01A-11R-A37I-31 TCGA-2H-A9GH-01A-11R-A37I-31
##   TCGA-2H-A9GI-01A-11R-A37I-31
## colData names(134): sample patient ...
##   subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.4
##   subtype_GEA.CIN.Integrated.Cluster...MKL.KNN.7

# Checking expression values of the first 5 genes (rows)
as.data.frame(assay(gene.expression)[1:5,])

2.3 Creating a MultiAssayExperiment

To create a MultiAssayExperiment object, ELMER provides the function createMAE, which requires the DNA methylation data (either a matrix or a Summarized Experiment object) [function argument met] and Gene expression data (either a matrix or a Summarized Experiment object) [function argument exp].

For DNA methylation we only keep distal probes (at least \(2Kbp\) away from TSS) since it tries to infer for distal interactions regulating genes. ELMER provides the function get.feature.probe to retrieve those set of probes.

The DNA methylation data from the 450k platform has three types of probes cg (CpG loci), ch (non-CpG loci) and rs (SNP assay). The last type of probe can be used for sample identification and tracking and should be excluded for differential methylation analysis according to the ilumina manual (https://www.illumina.com/content/dam/illumina-support/documents/documentation/software_documentation/genomestudio/genomestudio-2011-1/genomestudio-methylation-v1-8-user-guide-11319130-b.pdf). Along with the filter based on the distance to TSS, probes with the rs prefix is also removed by get.feature.probe.

Finally, some probes should also be masked for other reasons presented by Zhou, Laird, and Shen (2017) and listed below (source: https://zwdzwd.github.io/InfiniumAnnotation).

MASK.mapping - whether the probe is masked for mapping reason. Probes retained should have high quality (>=40 on 0-60 scale) consistent (with designed MAPINFO) * mapping (for both in the case of type I) without INDELs.
MASK.typeINextBaseSwitch - whether the probe has a SNP in the extension base that causes a color channel switch from the official annotation (described as color-channel-switching, or CCS SNP in the reference). These probes should be processed differently than designed (by summing up both color channels instead of just the annotated color channel).
MASK.rmsk15 - whether the 15bp 3’-subsequence of the probe overlap with repeat masker, this MASK is NOT recommended.
MASK.sub25.copy, MASK.sub30.copy, MASK.sub35.copy and MASK.sub40.copy - whether the 25bp, 30bp, 35bp and 40bp 3’-subsequence of the probe is non-unique.
MASK.snp5.common - whether 5bp 3’-subsequence (including extension for typeII) overlap with any of the common SNPs from dbSNP (global MAF can be under 1%).
MASK.snp5.GMAF1p - whether 5bp 3’-subsequence (including extension for typeII) overlap with any of the SNPs with global MAF >1%. *MASK.extBase - probes masked for extension base inconsistent with specified color channel (type-I) or CpG (type-II) based on mapping.
MASK.general - recommended general purpose masking merged from “MASK.sub30.copy”, “MASK.mapping”, “MASK.extBase”, “MASK.typeINextBaseSwitch” and “MASK.snp5.GMAF1p”.

Thus, MASK.general is used by ELMER to remove the masked probes from the eavluated distal probes.

# 3) Create ELMER input
#    A MultiAssayExperiment with samples 
#    that has both RNA-seq and 450K DNA methylation (distal probes)
library(SummarizedExperiment)
library(ELMER)

# 1) retrieve distal probes
genome <- "hg19"
distal.probes <- get.feature.probe(feature = NULL,
                                   genome = genome,
                                   met.platform = "450K") 
ranges(distal.probes)

## IRanges object with 165935 ranges and 0 metadata columns:
##                  start       end     width
##              <integer> <integer> <integer>
##   cg00168193    790667    790668         2
##   cg08258224    800083    800084         2
##   cg13938959    834183    834184         2
##   cg12445832    834295    834296         2
##   cg23999112    834356    834357         2
##          ...       ...       ...       ...
##   cg14931215  21867702  21867703         2
##   cg00308367  21868532  21868533         2
##   cg00063477  22741795  22741796         2
##   cg01900066  22754881  22754882         2
##   cg25918849  23753332  23753333         2

# Number of distal probes
length(distal.probes)

## [1] 165935

For Gene expression data the row names accepted are Ensembl gene IDs, since our data has gene name as inout we will change the row names.

# 2) ELMER required ENSEMBL ID as rownames for gene expression
# for hg19, gene symbols|entre_id were the default default for RNA-seq
# so we need to rename our rows to the ensembl_gene_id which is 
# already in the rowRanges information
rowRanges(gene.expression)

## GRanges object with 21022 ranges and 3 metadata columns:
##                seqnames              ranges strand |      gene_id
##                   <Rle>           <IRanges>  <Rle> |  <character>
##           A1BG    chr19   58856544-58864865      - |         A1BG
##            A2M    chr12     9220260-9268825      - |          A2M
##           NAT1     chr8   18027986-18081198      + |         NAT1
##           NAT2     chr8   18248755-18258728      + |         NAT2
##       SERPINA3    chr14   95078714-95090392      + |     SERPINA3
##            ...      ...                 ...    ... .          ...
##            FTX     chrX   73183790-73513409      - |          FTX
##         TICAM2     chr5 114914339-114961876      - |       TICAM2
##   TMED7-TICAM2     chr5 114914339-114961858      - | TMED7-TICAM2
##          TMED7     chr5 114949205-114968689      - |        TMED7
##    SLC25A5-AS1     chrX 118599997-118603061      - |  SLC25A5-AS1
##                entrezgene ensembl_gene_id
##                 <numeric>     <character>
##           A1BG          1 ENSG00000121410
##            A2M          2 ENSG00000175899
##           NAT1          9 ENSG00000171428
##           NAT2         10 ENSG00000156006
##       SERPINA3         12 ENSG00000196136
##            ...        ...             ...
##            FTX  100302692 ENSG00000230590
##         TICAM2  100302736 ENSG00000243414
##   TMED7-TICAM2  100302736 ENSG00000251201
##          TMED7  100302736 ENSG00000134970
##    SLC25A5-AS1  100303728 ENSG00000224281
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

rownames(gene.expression)  <- rowRanges(gene.expression)$ensembl_gene_id

Finally, we can use createMAE setting the genome of reference (“hg19” or “hg38”) used to add DNA methylation probes metadata and gene metadata information. To perform a correlation between DNA methylation and gene expression it is better to take the log2 of expression data (executed if the argument linearize.exp is TRUE). Also, if samples are from TCGA, ELMER will automatically pull samples information from TCGA, otherwise it should be provided by the user using colData and sampleMap arguments (tip: you can check the documentation of a function in R with the following command ?createMAE)

# Create MAE
mae <- createMAE(exp = gene.expression,
                 met = methylation,
                 met.platform = "450K",
                 genome = genome,
                 linearize.exp = TRUE, # takes log2(expression + 1)
                 filter.probes = distal.probes,
                 save = FALSE,
                 TCGA = TRUE)

You can check the MAE object using some accessor described in this cheatsheet.

mae

## A MultiAssayExperiment object of 2 listed
##  experiments with user-defined names and respective classes. 
##  Containing an ExperimentList class object of length 2: 
##  [1] DNA methylation: RangedSummarizedExperiment with 135158 rows and 4 columns 
##  [2] Gene expression: RangedSummarizedExperiment with 21022 rows and 4 columns 
## Features: 
##  experiments() - obtain the ExperimentList instance 
##  colData() - the primary/phenotype DataFrame 
##  sampleMap() - the sample availability DataFrame 
##  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
##  *Format() - convert into a long or wide DataFrame 
##  assays() - convert ExperimentList to a SimpleList of matrices

# Since it is TCGA ELMER add autmaticalle the samples information
as.data.frame(colData(mae)[1:4,1:5])

# Available information
as.data.frame(colnames(colData(mae)))

The next sections we will provide the code to download the complete TCGA ESCA dataset.

3 Code to retrieve all TCGA-ESCA data

The code below downloaded all TCGA-ESCA data (DNA methylation and gene expression), but due to the long time to download the data it will not be run.

# 1) get all DNA methylation data aligned annotated with hg38 information
query.met <- GDCquery(project =  "TCGA-ESCA",
                      data.category = "DNA Methylation",
                      platform = "Illumina Human Methylation 450",
                      sample.type = "Primary solid Tumor")
GDCdownload(query.met)
methylation <- GDCprepare(query = query.met,
                          save = TRUE,
                          save.filename = "TCGA-ESCA_450K_hg38.rda"
)



# 1) get all RNA-seq data aligned against hg38
query.exp <- GDCquery(project = "TCGA-ESCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - FPKM-UQ")
GDCdownload(query.exp, method = "api", files.per.chunk = 10)
gene.expression <- GDCprepare(query.exp, 
                              save = TRUE, 
                              save.filename = "TCGA-ESCA_450K_hg38.rda")
                              

# 3) Create ELMER input
#    A MultiAssayExperiment with samples 
#    that has both RNA-seq and 450K DNA methylation (distal probes)
genome <- "hg38"
distal.probes <- get.feature.probe(feature = NULL,
                                   genome = genome,
                                   met.platform = "450K") 

mae <- createMAE(exp = gene.expression,
                 met = methylation,
                 met.platform = "450K",
                 genome = genome,
                 linearize.exp = TRUE,
                 filter.probes = distal.probes,
                 save = TRUE,
                 save.filename = "TCGA_ESCA_MAE_distal_regions.rda",
                 TCGA = TRUE)

4 Session Information

sessionInfo()

## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] ELMER_2.9.2                 ELMER.data_2.9.3           
##  [3] SummarizedExperiment_1.14.0 DelayedArray_0.10.0        
##  [5] BiocParallel_1.18.0         matrixStats_0.54.0         
##  [7] Biobase_2.44.0              GenomicRanges_1.36.0       
##  [9] GenomeInfoDb_1.20.0         IRanges_2.18.1             
## [11] S4Vectors_0.22.0            BiocGenerics_0.30.0        
## [13] TCGAbiolinks_2.13.3        
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.4             circlize_0.4.6             
##   [3] Hmisc_4.2-0                 aroma.light_3.14.0         
##   [5] BiocFileCache_1.8.0         plyr_1.8.4                 
##   [7] selectr_0.4-1               ConsensusClusterPlus_1.48.0
##   [9] lazyeval_0.2.2              splines_3.6.0              
##  [11] ggplot2_3.2.0               sva_3.32.1                 
##  [13] digest_0.6.20               ensembldb_2.8.0            
##  [15] foreach_1.4.4               htmltools_0.3.6            
##  [17] checkmate_1.9.4             magrittr_1.5               
##  [19] memoise_1.1.0               BSgenome_1.52.0            
##  [21] cluster_2.1.0               doParallel_1.0.14          
##  [23] limma_3.40.2                ComplexHeatmap_2.0.0       
##  [25] Biostrings_2.52.0           readr_1.3.1                
##  [27] annotate_1.62.0             R.utils_2.9.0              
##  [29] askpass_1.1                 prettyunits_1.0.2          
##  [31] colorspace_1.4-1            blob_1.2.0                 
##  [33] rvest_0.3.4                 rappdirs_0.3.1             
##  [35] ggrepel_0.8.1               xfun_0.8                   
##  [37] dplyr_0.8.3                 crayon_1.3.4               
##  [39] RCurl_1.95-4.12             jsonlite_1.6               
##  [41] genefilter_1.66.0           zeallot_0.1.0              
##  [43] VariantAnnotation_1.30.1    survival_2.44-1.1          
##  [45] zoo_1.8-6                   iterators_1.0.10           
##  [47] glue_1.3.1                  survminer_0.4.4            
##  [49] gtable_0.3.0                zlibbioc_1.30.0            
##  [51] XVector_0.24.0              GetoptLong_0.1.7           
##  [53] shape_1.4.4                 scales_1.0.0               
##  [55] DESeq_1.36.0                DBI_1.0.0                  
##  [57] edgeR_3.26.5                ggthemes_4.2.0             
##  [59] Rcpp_1.0.1                  viridisLite_0.3.0          
##  [61] htmlTable_1.13.1            xtable_1.8-4               
##  [63] progress_1.2.2              cmprsk_2.2-8               
##  [65] clue_0.3-57                 foreign_0.8-71             
##  [67] bit_1.1-14                  matlab_1.0.2               
##  [69] km.ci_0.5-2                 Formula_1.2-3              
##  [71] htmlwidgets_1.3             httr_1.4.0                 
##  [73] RColorBrewer_1.1-2          acepack_1.4.1              
##  [75] reshape_0.8.8               pkgconfig_2.0.2            
##  [77] XML_3.98-1.20               R.methodsS3_1.7.1          
##  [79] nnet_7.3-12                 Gviz_1.28.0                
##  [81] dbplyr_1.4.2                locfit_1.5-9.1             
##  [83] tidyselect_0.2.5            rlang_0.4.0                
##  [85] AnnotationDbi_1.46.0        munsell_0.5.0              
##  [87] tools_3.6.0                 downloader_0.4             
##  [89] generics_0.0.2              RSQLite_2.1.1              
##  [91] broom_0.5.2                 evaluate_0.14              
##  [93] stringr_1.4.0               yaml_2.2.0                 
##  [95] knitr_1.23                  bit64_0.9-7                
##  [97] survMisc_0.5.5              purrr_0.3.2                
##  [99] AnnotationFilter_1.8.0      EDASeq_2.18.0              
## [101] nlme_3.1-140                R.oo_1.22.0                
## [103] xml2_1.2.0                  biomaRt_2.41.7             
## [105] rstudioapi_0.10             compiler_3.6.0             
## [107] plotly_4.9.0                curl_3.3                   
## [109] png_0.1-7                   ggsignif_0.5.0             
## [111] tibble_2.1.3                geneplotter_1.62.0         
## [113] stringi_1.4.3               GenomicFeatures_1.36.4     
## [115] lattice_0.20-38             ProtGenerics_1.16.0        
## [117] Matrix_1.2-17               KMsurv_0.1-5               
## [119] vctrs_0.2.0                 pillar_1.4.2               
## [121] GlobalOptions_0.1.0         data.table_1.12.2          
## [123] bitops_1.0-6                rtracklayer_1.44.0         
## [125] R6_2.4.0                    latticeExtra_0.6-28        
## [127] hwriter_1.3.2               ShortRead_1.42.0           
## [129] gridExtra_2.3               codetools_0.2-16           
## [131] dichromat_2.0-0             assertthat_0.2.1           
## [133] openssl_1.4.1               rjson_0.2.20               
## [135] GenomicAlignments_1.20.1    Rsamtools_2.0.0            
## [137] GenomeInfoDbData_1.2.1      MultiAssayExperiment_1.10.4
## [139] mgcv_1.8-28                 hms_0.5.0                  
## [141] rpart_4.1-15                grid_3.6.0                 
## [143] prettydoc_0.2.1             tidyr_0.8.3                
## [145] rmarkdown_1.13              biovizBase_1.32.0          
## [147] ggpubr_0.2.1                base64enc_0.1-3

References

Colaprico, Antonio, Tiago C Silva, Catharina Olsen, Luciano Garofano, Claudia Cava, Davide Garolini, Thais S Sabedot, et al. 2015. “TCGAbiolinks: An R/Bioconductor Package for Integrative Analysis of Tcga Data.” Nucleic Acids Research 44 (8): e71–e71.

Silva, Tiago C, Simon G Coetzee, Nicole Gull, Lijing Yao, Dennis J Hazelett, Houtan Noushmehr, De-Chen Lin, and Benjamin P Berman. 2018. “ELMER V. 2: An R/Bioconductor Package to Reconstruct Gene Regulatory Networks from Dna Methylation and Transcriptome Profiles.” Bioinformatics 35 (11): 1974–7.

Zhou, Wanding, Peter W Laird, and Hui Shen. 2017. “Comprehensive Characterization, Annotation and Innovative Use of Infinium Dna Methylation Beadchip Probes.” Nucleic Acids Research 45 (4): e22–e22.

Workshop for multi-omics analysis with ELMER - retrieving TCGA data

Benjamin P. Berman, Tiago C. Silva

2019-07-18