1 Instructor names and contact information
2 Workshop Description
3 Workshop goals and objectives
- 3.1 Learning goals
- 3.2 Learning objectives
4 Overview of key packages
- 4.1 curatedTCGAData
- 4.2 TCGAutils
5 Using Docker containers
6 Major Data Classes
- 6.1 RaggedExperiment
- 6.2 MultiAssayExperiment
7 Working with MultiAssayExperiment
- 7.1 API cheat sheet
  - 7.1.1 Building a MultiAssayExperiment from scratch
8 The Cancer Genome Atlas (TCGA) as MultiAssayExperiment objects
9 TCGAutils functionality
- 9.1 “Simplification” of curatedTCGAData objects
- 9.2 Other TCGA data types
10 Plotting, correlation, and other analyses
11 Citing MultiAssayExperiment
12 Session Info

1 Instructor names and contact information

Marcel Ramos11 City University of New York, New York, NY, USA 22 Roswell Park Comprehensive Cancer Center, Buffalo, NY (marcel.ramos@roswellpark.org)
Ludwig Geistlinger33 City University of New York, New York, NY, USA
Levi Waldron44 City University of New York, New York, NY, USA

2 Workshop Description

This workshop demonstrates the facilities made available by companion packages, curatedTCGAData and TCGAutils, to work with TCGA data. Built using the MultiAssayExperiment class, these packages make the management of multiple assays easier and more efficient. The workshop also covers relevant data classes such as RaggedExperiment, SummarizedExperiment, and RangedSummarizedExperiment, which provide efficient and powerful operations for representation of copy number, mutation, variant, and expression data that are represented by different genomic ranges for each specimen.

There is a built version of this workshop available at http://rpubs.com/mramos/curatedTCGAWorkshop. The source is available at https://github.com/waldronlab/curatedTCGAWorkshop.

Presentation slides are available at the link below:

browseURL("https://tinyurl.com/curatedTCGAWorkshop")

2.1 Pre-requisites

Basic knowledge of R syntax
Familiarity with the GRanges and SummarizedExperiment classes
Familiarity with ’omics data types including copy number and gene expression

2.2 Workshop Participation

Participants will have a chance to build a MultiAssayExperiment object from scratch, and will also work with more complex objects provided by the curatedTCGAData package.

2.3 R/Bioconductor packages used

To install the workshop dependencies, assuming you have already installed Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

## Bioconductor 3.8 Stable Release
BiocManager::install(version = "3.8")

BiocManager::install(
    c("org.Hs.eg.db", "mirbase.db", "EnsDb.Hsapiens.v86")
)

BiocManager::install(
    "waldronlab/curatedTCGAWorkshop",
    dependencies = TRUE,
    build_vignettes = TRUE
)

Note, the first line installs experimental data packages not installed automatically by BiocManager.

Here is a list of Bioconductor packages required for this workshop:

library(curatedTCGAData)
library(MultiAssayExperiment)
library(SummarizedExperiment)
library(GenomicRanges)
library(RaggedExperiment)
library(GenomicDataCommons)
library(TCGAutils)
library(UpSetR)
library(mirbase.db)
library(EnsDb.Hsapiens.v86)
library(org.Hs.eg.db)
library(readxl)
library(dplyr)
library(kableExtra)

2.4 Time outline

2h total

Activity	Time
Overview of key packages	30 min
curatedTCGAData multi-assay dataset	30 min
TCGAutils functionality	30 min
Wrap-up and questions	~ 30 min

3 Workshop goals and objectives

3.1 Learning goals

identify advantages of providing integrative data in a analysis-ready platform
overview common data classes available in Bioconductor
gain familiarity with available functionality in TCGAutils for the management and coordination of TGCA data

3.2 Learning objectives

use curatedTCGAData to create on-the-fly TCGA MultiAssayExperiment datasets
create a MultiAssayExperiment for TCGA or other multi’omics data
explore functionality available in TCGAutils with curatedTCGAData objects

4 Overview of key packages

4.1 `curatedTCGAData`

Many tools exist for accessing and downloading The Cancer Genome Atlas (TCGA) data. These include but are not limited to RTCGAToolbox, GenomicDataCommons (package and website), TCGAbiolinks, cBioPortal, and Broad GDAC Firehose. These tools encompass a spectrum of strengths in ease-of-use, integration, and completeness of data. Few tools provide an integrative and user-friendly representation of TCGA data in a widely used analysis platform such as Bioconductor.

The curatedTCGAData experiment data package provides on-the-fly construction of TCGA datasets for 33 different cancer types from the Broad GDAC Firehose pipeline. Generally, it provides data using build hg19. curatedTCGAData facilitates access and integration of TCGA data by providing multi-’omics data objects using the MultiAssayExperiment data class. Where other platforms provide fragmented datasets, curatedTCGAData ensures that all data provided is matched and accounted for within the phenotypic metadata.

A list of available cancer types can be obtained from TCGAutils::diseaseCodes:

data("diseaseCodes")
knitr::kable(diseaseCodes, align = "l", escape = FALSE,
    caption = "List of available cancer types from curatedTCGAData") %>%
    kableExtra::kable_styling(
        bootstrap_options = c("hover", "striped", "responsive"),
        full_width = FALSE
    )

Table 1: List of available cancer types from curatedTCGAData
Study.Abbreviation	Available	SubtypeData	Study.Name
ACC	Yes	Yes	Adrenocortical carcinoma
BLCA	Yes	Yes	Bladder Urothelial Carcinoma
BRCA	Yes	Yes	Breast invasive carcinoma
CESC	Yes	No	Cervical squamous cell carcinoma and endocervical adenocarcinoma
CHOL	Yes	No	Cholangiocarcinoma
CNTL	No	No	Controls
COAD	Yes	Yes	Colon adenocarcinoma
DLBC	Yes	No	Lymphoid Neoplasm Diffuse Large B-cell Lymphoma
ESCA	Yes	No	Esophageal carcinoma
FPPP	No	No	FFPE Pilot Phase II
GBM	Yes	Yes	Glioblastoma multiforme
HNSC	Yes	Yes	Head and Neck squamous cell carcinoma
KICH	Yes	Yes	Kidney Chromophobe
KIRC	Yes	Yes	Kidney renal clear cell carcinoma
KIRP	Yes	Yes	Kidney renal papillary cell carcinoma
LAML	Yes	Yes	Acute Myeloid Leukemia
LCML	No	No	Chronic Myelogenous Leukemia
LGG	Yes	Yes	Brain Lower Grade Glioma
LIHC	Yes	No	Liver hepatocellular carcinoma
LUAD	Yes	Yes	Lung adenocarcinoma
LUSC	Yes	Yes	Lung squamous cell carcinoma
MESO	Yes	No	Mesothelioma
MISC	No	No	Miscellaneous
OV	Yes	Yes	Ovarian serous cystadenocarcinoma
PAAD	Yes	No	Pancreatic adenocarcinoma
PCPG	Yes	No	Pheochromocytoma and Paraganglioma
PRAD	Yes	Yes	Prostate adenocarcinoma
READ	Yes	No	Rectum adenocarcinoma
SARC	Yes	No	Sarcoma
SKCM	Yes	Yes	Skin Cutaneous Melanoma
STAD	Yes	Yes	Stomach adenocarcinoma
TGCT	Yes	No	Testicular Germ Cell Tumors
THCA	Yes	Yes	Thyroid carcinoma
THYM	Yes	No	Thymoma
UCEC	Yes	Yes	Uterine Corpus Endometrial Carcinoma
UCS	Yes	No	Uterine Carcinosarcoma
UVM	Yes	No	Uveal Melanoma

A descriptive table of available ’omics types in curatedTCGAData (thanks to Ludwig G. @lgeistlinger):

## from dataTypesHTML.R
knitr::kable(dataTypes, align = "l", escape = FALSE,
    caption = "Descriptions of data types available by Bioconductor data class") %>%
    kable_styling(bootstrap_options = c("hover", "striped", "responsive"),
        full_width = FALSE) %>% group_rows(index = setNames(iwidth, groups)) %>%
    footnote(symbol = footnote)

Table 2: Descriptions of data types available by Bioconductor data class
ExperimentList data types	Description
SummarizedExperiment*
RNASeqGene	RSEM TPM gene expression values
RNASeq2GeneNorm	Upper quartile normalized RSEM TPM gene expression values
miRNAArray	Probe-level miRNA expression values
miRNASeqGene	Gene-level log2 RPM miRNA expression values
mRNAArray	Unified gene-level mRNA expression values
mRNAArray_huex	Gene-level mRNA expression values from Affymetrix Human Exon Array
mRNAArray_TX_g4502a	Gene-level mRNA expression values from Agilent 244K Array
mRNAArray_TX_ht_hg_u133a	Gene-level mRNA expression values from Affymetrix Human Genome U133 Array
GISTIC_AllByGene	Gene-level GISTIC2 copy number values
GISTIC_ThresholdedByGene	Gene-level GISTIC2 thresholded discrete copy number values
RPPAArray	Reverse Phase Protein Array normalized protein expression values
RangedSummarizedExperiment
GISTIC_Peaks	GISTIC2 thresholded discrete copy number values in recurrent peak regions
SummarizedExperiment with HDF5Array DelayedMatrix
Methylation_methyl27	Probe-level methylation beta values from Illumina HumanMethylation 27K BeadChip
Methylation_methyl450	Probe-level methylation beta values from Infinium HumanMethylation 450K BeadChip
RaggedExperiment
CNASNP	Segmented somatic Copy Number Alteration calls from SNP array
CNVSNP	Segmented germline Copy Number Variant calls from SNP Array
CNASeq	Segmented somatic Copy Number Alteration calls from low pass DNA Sequencing
Mutation*	Somatic mutations calls
CNACGH_CGH_hg_244a	Segmented somatic Copy Number Alteration calls from CGH Agilent Microarray 244A
CNACGH_CGH_hg_415k_g4124a	Segmented somatic Copy Number Alteration calls from CGH Agilent Microarray 415K
^* All can be converted to RangedSummarizedExperiment (except RPPAArray) with TCGAutils

4.2 `TCGAutils`

TCGAutils is a companion package that enhances curatedTCGAData by allowing additional exploration and manipulation of samples and metadata in rows and columns.

Available operations in TCGAutils and MultiAssayExperiment enable user-friendly operations for subsetting, separating, converting, and reshaping of sample and feature TCGA data.

TCGAutils was developed specifically for TCGA data and for curatedTCGAData products. It provides convenience / helper functions in three major areas:

conversion / summarization of row annotations to genomic ranges
identification and separation of samples
translation and interpretation of TCGA identifiers

along with additional reference data sets to explore TCGA data.

## from TCGAutils_cheatHTML.R script
knitr::kable(tabInfo, align = "l", escape = FALSE,
    caption = "Summary of available functionality in TCGAutils") %>%
    kable_styling(bootstrap_options = c("hover", "striped", "responsive"),
        full_width = FALSE) %>% group_rows(index = setNames(iwidth, groups))

Table 3: Summary of available functionality in TCGAutils
Category and Function	Description
MultiAssayExperiment helpers
generateMap	Automatically generate a sampleMap structure from assays
imputeAssay	Impute values in numerical assays based on KNN
mergeColData	Add additional data to the colData slot of MAE
qreduceTCGA	Convert RaggedExperiment assays to RSE based on heuristics
symbolsToRanges	Convert gene symbols to genomic ranges using org.db
mirToRanges	Convert microRNA sequences to genomic ranges with mirbase.db
simplifyTCGA	Use qreduceTCGA, symbolsToRanges, and mirToRanges in succession
curatedTCGAData helpers
getSubtypeMap	Obtain the available subtype information
getClinicalNames	Get a list of clinical variable names for all cancer types
splitAssays	Separate assays based on sample data found in barcodes
sampleTables	Get a list of samples in each assay in a MAE
TCGA Identifiers
TCGAbarcode	Chop TCGA barcode into sections
TCGAbiospec	Get a table of information extracted from a vector of barcodes
TCGAsampleSelect	Indicate which barcodes belong to a specific sample type
UUIDtoBarcode	Translate universal identifiers to TCGA barcodes
UUIDtoUUID	Translate between case and file universal identifiers
barcodeToUUID	Translate TCGA barcodes to universal identifiers
filenameToBarcode	Obtain TCGA barcodes from a vector of TCGA file names
Flat to Bioconductor classes
makeGRangesListFromCopyNumber	Create a GRangesList from a copy number data.frame
makeGRangesListFromExonFiles	Obtain a GRangesList object from a list of individual exon files
makeSummarizedExperimentFromGISTIC	Create a SummarizedExperiment object from a Firehose GISTIC RTCGAToolbox object
Genome Builds
translateBuild	Translate build version name between UCSC and NCBI
extractBuild	Find build in string pattern such as a file name
uniformBuilds	Homogenize a vector of builds based on a threshold for the alternative build name
Reference data
diseaseCodes	Get a table of TCGA cancer codes and subtype availability
sampleTypes	Get a table of sample type codes and their definition
clinicalNames	Obtain a CharacterList of common variable names for each TCGA disease code
getFilename	Obtain a file name string for the relevant data query
Miscellaneous
findGRangesCols	Find the minimum necessary variable names for conversion to GRanges

To better understand how it all fits together, this schematic shows the relationship among all as part of the curatedTCGAData pipeline.

Figure 1: Schematic of curatedTCGAData Pipeline

5 Using Docker containers

Getting started with the proper installation of R and Bioconductor can be tricky.

Bioconductor version	R Version
`Bioc-release 3.8`	R release (`>= 3.5.0` and `< 3.6.0`)
`Bioc-devel 3.9`	R devel (`>= 3.6.0`)

How to install and manage different package version directories? See http://bioconductor.org/install/

Alternatively

We’ve made it easy to get started with the release or development version of Bioconductor using Docker containers. Bioconductor regularly publishes docker images for both Bioconductor release and devel versions, we’ve created a script to load either image using Docker.

Refer to the GitHub repository at https://github.com/waldronlab/bioconductor_devel to use the images and skip the management of installation directories and versioning.

6 Major Data Classes

##(Ranged)SummarizedExperiment

A matrix-like container where rows represent features of interest and columns represent samples. The objects contain one or more assays, each represented by a matrix-like object of numeric or other mode.

Figure 2: A matrix-like container where rows represent features of interest and columns represent samples
The objects contain one or more assays, each represented by a matrix-like object of numeric or other mode.

SummarizedExperiment is the most important Bioconductor class for matrix-like experimental data, including from RNA sequencing and micro array experiments. It can store multiple experimental data matrices of identical dimensions, with associated metadata on the rows/genes/transcripts/other measurements (rowData), column/sample phenotype or clinical data (colData), and the overall experiment (metadata). The derivative class RangedSummarizedExperiment associates a GRanges or GRangesList vector with the rows. These classes supersede the use of ExpressionSet. Note that many other classes for experimental data are actually derived from SummarizedExperiment.

6.1 `RaggedExperiment`

RaggedExperiment is a flexible data representation for segmented copy number, somatic mutations such as represented in .vcf files, and other ragged array schema for genomic location data. Like the GRangesList class from GenomicRanges, RaggedExperiment can be used to represent differing genomic ranges on each of a set of samples. In fact, RaggedExperiment contains a GRangesList:

showClass("RaggedExperiment")
#> Class "RaggedExperiment" [package "RaggedExperiment"]
#> 
#> Slots:
#>                                                       
#> Name:       assays      rowidx      colidx    metadata
#> Class: GRangesList     integer     integer        list
#> 
#> Extends: "Annotated"

However, RaggedExperiment provides a flexible set of Assay methods to support transformation of such data to matrix format.

RaggedExperiment object schematic. Rows and columns represent genomic ranges and samples, respectively. Assay operations can be performed with (from left to right) compactAssay, qreduceAssay, and sparseAssay.

Figure 3: RaggedExperiment object schematic
Rows and columns represent genomic ranges and samples, respectively. Assay operations can be performed with (from left to right) compactAssay, qreduceAssay, and sparseAssay.

6.2 `MultiAssayExperiment`

MultiAssayExperiment is an integrative container for coordinating multi-omics experiment data on a set of biological specimens. As much as possible, its methods adopt the same vocabulary as SummarizedExperiment. A MultiAssayExperiment can contain any number of assays with different representations. Assays may be ID-based, where measurements are indexed identifiers of genes, microRNA, proteins, microbes, etc. Alternatively, assays may be range-based, where measurements correspond to genomic ranges that can be represented as GRanges objects, such as gene expression or copy number. For ID-based assays, there is no requirement that the same IDs be present for different experiments. For range-based assays, there is also no requirement that the same ranges be present for different experiments; furthermore, it is possible for different samples within an experiment to be represented by different ranges. The following data classes have been tested to work as elements of a MultiAssayExperiment:

matrix: the most basic class for ID-based datasets, could be used for example for gene expression summarized per-gene, microRNA, metabolomics, or microbiome data.
SummarizedExperiment and derived methods: described above, could be used for miRNA, gene expression, proteomics, or any matrix-like data where measurements are represented by IDs.
RangedSummarizedExperiment: described above, could be used for gene expression, methylation, or other data types referring to genomic positions.
ExpressionSet: Another rich representation for ID-based datasets, supported only for legacy reasons
RaggedExperiment: described above, for non-rectangular (ragged) ranged-based datasets such as segmented copy number, where segmentation of copy number alterations occurs and different genomic locations in each sample.
RangedVcfStack: For VCF archives broken up by chromosome (see VcfStack class defined in the GenomicFiles package)
DelayedMatrix: An on-disk representation of matrix-like objects for large datasets. It reduces memory usage and optimizes performance with delayed operations. This class is part of the DelayedArray package.

Note that any data class extending these classes, and in fact any data class supporting row and column names and subsetting can be used as an element of a MultiAssayExperiment.

MultiAssayExperiment object schematic. colData provides data about the patients, cell lines, or other biological units, with one row per unit and one column per variable. The experiments are a list of assay datasets of arbitrary class. The sampleMap relates each column (observation) in ExperimentList to exactly one row (biological unit) in colData; however, one row of colData may map to zero, one, or more columns per assay, allowing for missing and replicate assays. sampleMap allows for per-assay sample naming conventions. Metadata can be used to store information in arbitrary format about the MultiAssayExperiment. Green stripes indicate a mapping of one subject to multiple observations across experiments.

Figure 4: MultiAssayExperiment object schematic
colData provides data about the patients, cell lines, or other biological units, with one row per unit and one column per variable. The experiments are a list of assay datasets of arbitrary class. The sampleMap relates each column (observation) in ExperimentList to exactly one row (biological unit) in colData; however, one row of colData may map to zero, one, or more columns per assay, allowing for missing and replicate assays. sampleMap allows for per-assay sample naming conventions. Metadata can be used to store information in arbitrary format about the MultiAssayExperiment. Green stripes indicate a mapping of one subject to multiple observations across experiments.

7 Working with MultiAssayExperiment

7.1 API cheat sheet

The MultiAssayExperiment API for construction, access, subsetting, management, and reshaping to formats for application of R/Bioconductor graphics and analysis packages

Figure 5: The MultiAssayExperiment API for construction, access, subsetting, management, and reshaping to formats for application of R/Bioconductor graphics and analysis packages

7.1.1 Building a MultiAssayExperiment from scratch

To start from scratch building your own MultiAssayExperiment, see the package Coordinating Analysis of Multi-Assay Experiments vignette. The package cheat sheet is also helpful.

If anything is unclear, please ask a question at https://support.bioconductor.org/ or create an issue on the MultiAssayExperiment or the curatedTCGAData issue tracker.

8 The Cancer Genome Atlas (TCGA) as MultiAssayExperiment objects

Most unrestricted TCGA data are available as MultiAssayExperiment objects from the curatedTCGAData package. This represents a lot of harmonization!

curatedTCGAData("ACC")
#>                                    Title DispatchClass
#> 1                    ACC_CNASNP-20160128           Rda
#> 2                    ACC_CNVSNP-20160128           Rda
#> 4          ACC_GISTIC_AllByGene-20160128           Rda
#> 5              ACC_GISTIC_Peaks-20160128           Rda
#> 6  ACC_GISTIC_ThresholdedByGene-20160128           Rda
#> 8        ACC_Methylation-20160128_assays        H5File
#> 9            ACC_Methylation-20160128_se           Rds
#> 10             ACC_miRNASeqGene-20160128           Rda
#> 11                 ACC_Mutation-20160128           Rda
#> 12          ACC_RNASeq2GeneNorm-20160128           Rda
#> 13                ACC_RPPAArray-20160128           Rda
suppressMessages({
    (acc <- curatedTCGAData("ACC",
        assays = c("miRNASeqGene", "RPPAArray", "Mutation",
            "RNASeq2GeneNorm", "CNVSNP", "GISTIC"), dry.run = FALSE))
})
#> A MultiAssayExperiment object of 8 listed
#>  experiments with user-defined names and respective classes. 
#>  Containing an ExperimentList class object of length 8: 
#>  [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns 
#>  [2] ACC_GISTIC_AllByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [3] ACC_GISTIC_Peaks-20160128: RangedSummarizedExperiment with 42 rows and 90 columns 
#>  [4] ACC_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [5] ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns 
#>  [6] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns 
#>  [7] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns 
#>  [8] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

Note. Methylation files will differ depending on Bioconductor version (release vs devel)

8.1 Important clinical information

These objects contain most unrestricted TCGA assay and clinical / pathological data, as well as material curated from the supplements of published TCGA primary papers at the end of the colData columns:

dim(colData(acc))
#> [1]  92 822
tail(colnames(colData(acc)), 10)
#>  [1] "MethyLevel"       "miRNA.cluster"    "SCNA.cluster"    
#>  [4] "protein.cluster"  "COC"              "OncoSign"        
#>  [7] "purity"           "ploidy"           "genome_doublings"
#> [10] "ADS"

The TCGAutils::getClinicalNames function will display relevant clinical column names obtained from RTCGAToolbox that are also commonly found in other cancer types.

(acccol <- getClinicalNames("ACC"))
#>  [1] "years_to_birth"                      
#>  [2] "vital_status"                        
#>  [3] "days_to_death"                       
#>  [4] "days_to_last_followup"               
#>  [5] "tumor_tissue_site"                   
#>  [6] "pathologic_stage"                    
#>  [7] "pathology_T_stage"                   
#>  [8] "pathology_N_stage"                   
#>  [9] "gender"                              
#> [10] "date_of_initial_pathologic_diagnosis"
#> [11] "radiation_therapy"                   
#> [12] "histological_type"                   
#> [13] "residual_tumor"                      
#> [14] "number_of_lymph_nodes"               
#> [15] "race"                                
#> [16] "ethnicity"

all(acccol %in% names(colData(acc)))
#> [1] TRUE

head(colData(acc)[, acccol])
#> DataFrame with 6 rows and 16 columns
#>              years_to_birth vital_status days_to_death days_to_last_followup
#>                   <integer>    <integer>     <integer>             <integer>
#> TCGA-OR-A5J1             58            1          1355                    NA
#> TCGA-OR-A5J2             44            1          1677                    NA
#> TCGA-OR-A5J3             23            0            NA                  2091
#> TCGA-OR-A5J4             23            1           423                    NA
#> TCGA-OR-A5J5             30            1           365                    NA
#> TCGA-OR-A5J6             29            0            NA                  2703
#>              tumor_tissue_site pathologic_stage pathology_T_stage
#>                    <character>      <character>       <character>
#> TCGA-OR-A5J1           adrenal         stage ii                t2
#> TCGA-OR-A5J2           adrenal         stage iv                t3
#> TCGA-OR-A5J3           adrenal        stage iii                t3
#> TCGA-OR-A5J4           adrenal         stage iv                t3
#> TCGA-OR-A5J5           adrenal        stage iii                t4
#> TCGA-OR-A5J6           adrenal         stage ii                t2
#>              pathology_N_stage      gender
#>                    <character> <character>
#> TCGA-OR-A5J1                n0        male
#> TCGA-OR-A5J2                n0      female
#> TCGA-OR-A5J3                n0      female
#> TCGA-OR-A5J4                n1      female
#> TCGA-OR-A5J5                n0        male
#> TCGA-OR-A5J6                n0      female
#>              date_of_initial_pathologic_diagnosis radiation_therapy
#>                                         <integer>       <character>
#> TCGA-OR-A5J1                                 2000                no
#> TCGA-OR-A5J2                                 2004                no
#> TCGA-OR-A5J3                                 2008                no
#> TCGA-OR-A5J4                                 2000                no
#> TCGA-OR-A5J5                                 2000                no
#> TCGA-OR-A5J6                                 2006                no
#>                                 histological_type residual_tumor
#>                                       <character>    <character>
#> TCGA-OR-A5J1 adrenocortical carcinoma- usual type             r0
#> TCGA-OR-A5J2 adrenocortical carcinoma- usual type             r2
#> TCGA-OR-A5J3 adrenocortical carcinoma- usual type             r0
#> TCGA-OR-A5J4 adrenocortical carcinoma- usual type             r2
#> TCGA-OR-A5J5 adrenocortical carcinoma- usual type             r2
#> TCGA-OR-A5J6 adrenocortical carcinoma- usual type             r0
#>              number_of_lymph_nodes                      race
#>                          <integer>               <character>
#> TCGA-OR-A5J1                    NA                     white
#> TCGA-OR-A5J2                     0                     white
#> TCGA-OR-A5J3                     0                     white
#> TCGA-OR-A5J4                     2                     white
#> TCGA-OR-A5J5                    NA                     white
#> TCGA-OR-A5J6                    NA black or african american
#>                       ethnicity
#>                     <character>
#> TCGA-OR-A5J1                 NA
#> TCGA-OR-A5J2 hispanic or latino
#> TCGA-OR-A5J3 hispanic or latino
#> TCGA-OR-A5J4 hispanic or latino
#> TCGA-OR-A5J5 hispanic or latino
#> TCGA-OR-A5J6 hispanic or latino

Reference documentation in Firehose Broad GDAC:

browseURL("https://dx.doi.org/10.7908/C1RX9BD4")

8.2 Subtype Information

Using TCGAutils, getSubtypeMap shows the user the column names associated with published subtype information.

getSubtypeMap(acc)
#>          ACC_annotations     ACC_subtype
#> 1             Patient_ID          SAMPLE
#> 2  histological_subtypes       Histology
#> 3          mrna_subtypes         C1A/C1B
#> 4          mrna_subtypes         mRNA_K4
#> 5                   cimp      MethyLevel
#> 6      microrna_subtypes   miRNA cluster
#> 7          scna_subtypes    SCNA cluster
#> 8       protein_subtypes protein cluster
#> 9   integrative_subtypes             COC
#> 10     mutation_subtypes        OncoSign

(subtypeCols <- getSubtypeMap(acc)[["ACC_subtype"]])
#>  [1] "SAMPLE"          "Histology"       "C1A/C1B"         "mRNA_K4"        
#>  [5] "MethyLevel"      "miRNA cluster"   "SCNA cluster"    "protein cluster"
#>  [9] "COC"             "OncoSign"

## for older versions of TCGAutils
subtypeCols <- gsub("SAMPLE", "patientID", subtypeCols)

colData(acc)[, make.names(subtypeCols)]
#> DataFrame with 92 rows and 10 columns
#>                 patientID   Histology     C1A.C1B
#>               <character> <character> <character>
#> TCGA-OR-A5J1 TCGA-OR-A5J1  Usual Type         C1A
#> TCGA-OR-A5J2 TCGA-OR-A5J2  Usual Type         C1A
#> TCGA-OR-A5J3 TCGA-OR-A5J3  Usual Type         C1A
#> TCGA-OR-A5J4 TCGA-OR-A5J4  Usual Type          NA
#> TCGA-OR-A5J5 TCGA-OR-A5J5  Usual Type         C1A
#> ...                   ...         ...         ...
#> TCGA-PK-A5H9 TCGA-PK-A5H9  Usual Type         C1B
#> TCGA-PK-A5HA TCGA-PK-A5HA  Usual Type         C1B
#> TCGA-PK-A5HB TCGA-PK-A5HB  Usual Type         C1A
#> TCGA-PK-A5HC TCGA-PK-A5HC  Usual Type          NA
#> TCGA-P6-A5OG TCGA-P6-A5OG          NA          NA
#>                                           mRNA_K4        MethyLevel
#>                                       <character>       <character>
#> TCGA-OR-A5J1 steroid-phenotype-high+proliferation         CIMP-high
#> TCGA-OR-A5J2 steroid-phenotype-high+proliferation          CIMP-low
#> TCGA-OR-A5J3               steroid-phenotype-high CIMP-intermediate
#> TCGA-OR-A5J4                                   NA         CIMP-high
#> TCGA-OR-A5J5               steroid-phenotype-high CIMP-intermediate
#> ...                                           ...               ...
#> TCGA-PK-A5H9                steroid-phenotype-low          CIMP-low
#> TCGA-PK-A5HA                steroid-phenotype-low         CIMP-high
#> TCGA-PK-A5HB               steroid-phenotype-high         CIMP-high
#> TCGA-PK-A5HC                                   NA                NA
#> TCGA-P6-A5OG                                   NA                NA
#>              miRNA.cluster SCNA.cluster protein.cluster         COC
#>                <character>  <character>       <integer> <character>
#> TCGA-OR-A5J1       miRNA_1        Quiet              NA        COC3
#> TCGA-OR-A5J2       miRNA_1        Noisy               1        COC3
#> TCGA-OR-A5J3       miRNA_6  Chromosomal               3        COC2
#> TCGA-OR-A5J4       miRNA_6  Chromosomal              NA          NA
#> TCGA-OR-A5J5       miRNA_2  Chromosomal              NA        COC2
#> ...                    ...          ...             ...         ...
#> TCGA-PK-A5H9       miRNA_5        Quiet               3        COC1
#> TCGA-PK-A5HA       miRNA_5  Chromosomal               2        COC1
#> TCGA-PK-A5HB       miRNA_6        Noisy              NA        COC3
#> TCGA-PK-A5HC            NA  Chromosomal              NA          NA
#> TCGA-P6-A5OG            NA           NA              NA          NA
#>                 OncoSign
#>              <character>
#> TCGA-OR-A5J1         CN2
#> TCGA-OR-A5J2    TP53/NF1
#> TCGA-OR-A5J3         CN2
#> TCGA-OR-A5J4         CN1
#> TCGA-OR-A5J5    TP53/NF1
#> ...                  ...
#> TCGA-PK-A5H9    TP53/NF1
#> TCGA-PK-A5HA         CN2
#> TCGA-PK-A5HB    TP53/NF1
#> TCGA-PK-A5HC    TP53/NF1
#> TCGA-P6-A5OG          NA

8.3 Pan-Cancer Data

curatedTCGAData works to build datasets on-the-fly including Pan-Cancer datasets where assays from two or more particular cancer types need to be combined. Here we demonstrate a Pan-Cancer MultiAssayExperiment using ovarian and breast cancer data.

(
ovbrca <- curatedTCGAData(diseaseCode = c("OV", "BRCA"),
    assays = c("RNA*", "Mutation"), dry.run = FALSE)
)

#> A MultiAssayExperiment object of 6 listed
#>  experiments with user-defined names and respective classes. 
#>  Containing an ExperimentList class object of length 6: 
#>  [1] BRCA_Mutation-20160128: RaggedExperiment with 90490 rows and 993 columns 
#>  [2] BRCA_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 1212 columns 
#>  [3] BRCA_RNASeqGene-20160128: SummarizedExperiment with 20502 rows and 878 columns 
#>  [4] OV_Mutation-20160128: RaggedExperiment with 20219 rows and 316 columns 
#>  [5] OV_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 307 columns 
#>  [6] OV_RNASeqGene-20160128: SummarizedExperiment with 19990 rows and 299 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

9 `TCGAutils` functionality

Aside from the available reshaping functions already included in the MultiAssayExperiment package, the TCGAutils package provides additional helper functions for working with TCGA data.

9.1 “Simplification” of `curatedTCGAData` objects

A number of helper functions are available for managing datasets from curatedTCGAData. These include:

9.1.1 Conversions of `SummarizedExperiment` to `RangedSummarizedExperiment` based

on TxDb.Hsapiens.UCSC.hg19.knownGene for:

9.1.1.1 `mirToRanges` for microRNA

mirToRanges(acc)
#> harmonizing input:
#>   removing 80 sampleMap rows not in names(experiments)
#> A MultiAssayExperiment object of 9 listed
#>  experiments with user-defined names and respective classes. 
#>  Containing an ExperimentList class object of length 9: 
#>  [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns 
#>  [2] ACC_GISTIC_AllByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [3] ACC_GISTIC_Peaks-20160128: RangedSummarizedExperiment with 42 rows and 90 columns 
#>  [4] ACC_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [5] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns 
#>  [6] ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns 
#>  [7] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns 
#>  [8] ACC_miRNASeqGene-20160128_ranged: RangedSummarizedExperiment with 1002 rows and 80 columns 
#>  [9] ACC_miRNASeqGene-20160128_unranged: SummarizedExperiment with 44 rows and 80 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

Note about microRNA: You can set ranges for the microRNA assay according to the genomic location of those microRNA, or the locations of their predicted targets, but we don’t do it here. Assigning genomic ranges of microRNA targets would be an easy way to subset them according to their targets.

9.1.1.2 `symbolsToRanges` for gene symbols

symbolsToRanges(acc)
#> 'select()' returned 1:many mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:many mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> 'select()' returned 1:many mapping between keys and columns
#> 'select()' returned 1:1 mapping between keys and columns
#> harmonizing input:
#>   removing 259 sampleMap rows not in names(experiments)
#> A MultiAssayExperiment object of 11 listed
#>  experiments with user-defined names and respective classes. 
#>  Containing an ExperimentList class object of length 11: 
#>  [1] ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 180 columns 
#>  [2] ACC_GISTIC_Peaks-20160128: RangedSummarizedExperiment with 42 rows and 90 columns 
#>  [3] ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns 
#>  [4] ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns 
#>  [5] ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns 
#>  [6] ACC_GISTIC_AllByGene-20160128_ranged: RangedSummarizedExperiment with 19538 rows and 90 columns 
#>  [7] ACC_GISTIC_AllByGene-20160128_unranged: SummarizedExperiment with 5238 rows and 90 columns 
#>  [8] ACC_GISTIC_ThresholdedByGene-20160128_ranged: RangedSummarizedExperiment with 19538 rows and 90 columns 
#>  [9] ACC_GISTIC_ThresholdedByGene-20160128_unranged: SummarizedExperiment with 5238 rows and 90 columns 
#>  [10] ACC_RNASeq2GeneNorm-20160128_ranged: RangedSummarizedExperiment with 17527 rows and 79 columns 
#>  [11] ACC_RNASeq2GeneNorm-20160128_unranged: SummarizedExperiment with 2974 rows and 79 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

9.1.1.3 `qreduceTCGA` for mutation and copy number

qreduceTCGA can convert RaggedExperiment objects to RangedSummarizedExperiment with one row per gene symbol, for: - segmented copy number datasets (“CNVSNP” and “CNASNP”) - somatic mutation datasets (“Mutation”), with a value of 1 for any non-silent mutation and a value of 0 for no mutation or silent mutation

genome(acc[["ACC_Mutation-20160128"]]) <-
    vapply(
        X = genome(acc[["ACC_Mutation-20160128"]]),
        FUN = TCGAutils::translateBuild,
        FUN.VALUE = character(1L)
    )

seqlevelsStyle(acc[["ACC_Mutation-20160128"]]) <- "UCSC"

seqlevelsStyle(acc[["ACC_Mutation-20160128"]])
#> [1] "UCSC"

rowRanges(acc[["ACC_Mutation-20160128"]])
#> GRanges object with 20166 ranges and 0 metadata columns:
#>           seqnames              ranges strand
#>              <Rle>           <IRanges>  <Rle>
#>       [1]     chr1            11561526      +
#>       [2]     chr1            12309384      +
#>       [3]     chr1            33820015      +
#>       [4]     chr1 152785074-152785097      +
#>       [5]     chr1           152800122      +
#>       ...      ...                 ...    ...
#>   [20162]     chr5 131007363-131007364      +
#>   [20163]     chr7   90894459-90894460      +
#>   [20164]     chr9 139581758-139581759      +
#>   [20165]    chr16   90095596-90095597      +
#>   [20166]    chr19   58385798-58385799      +
#>   -------
#>   seqinfo: 24 sequences from hg19 genome; no seqlengths

qreduceTCGA(acc[ , , "ACC_Mutation-20160128"])
#> 'select()' returned 1:1 mapping between keys and columns
#> harmonizing input:
#>   removing 655 sampleMap rows not in names(experiments)
#>   removing 2 colData rownames not in sampleMap 'primary'
#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use
#>   suppressWarnings() to suppress this warning.)
#> harmonizing input:
#>   removing 90 sampleMap rows not in names(experiments)
#> A MultiAssayExperiment object of 1 listed
#>  experiment with a user-defined name and respective class. 
#>  Containing an ExperimentList class object of length 1: 
#>  [1] ACC_Mutation-20160128_simplified: RangedSummarizedExperiment with 22942 rows and 90 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

9.1.1.4 `simplifyTCGA` - combine all

The simplifyTCGA function combines all of the above operations to create a more managable MultiAssayExperiment object and using RangedSummarizedExperiment assays where possible.

(simpa <- TCGAutils::simplifyTCGA(acc))

9.1.2 Identification and separation of samples

What types of samples are in the data?

Solution

The sampleTables function gives you an overview of samples in each assay:

sampleTables(acc)
#> $`ACC_CNVSNP-20160128`
#> 
#> 01 10 11 
#> 90 85  5 
#> 
#> $`ACC_GISTIC_AllByGene-20160128`
#> 
#> 01 
#> 90 
#> 
#> $`ACC_GISTIC_Peaks-20160128`
#> 
#> 01 
#> 90 
#> 
#> $`ACC_GISTIC_ThresholdedByGene-20160128`
#> 
#> 01 
#> 90 
#> 
#> $`ACC_miRNASeqGene-20160128`
#> 
#> 01 
#> 80 
#> 
#> $`ACC_Mutation-20160128`
#> 
#> 01 
#> 90 
#> 
#> $`ACC_RNASeq2GeneNorm-20160128`
#> 
#> 01 
#> 79 
#> 
#> $`ACC_RPPAArray-20160128`
#> 
#> 01 
#> 46

head(sampleTypes)
#>   Code                                      Definition Short.Letter.Code
#> 1   01                             Primary Solid Tumor                TP
#> 2   02                           Recurrent Solid Tumor                TR
#> 3   03 Primary Blood Derived Cancer - Peripheral Blood                TB
#> 4   04    Recurrent Blood Derived Cancer - Bone Marrow              TRBM
#> 5   05                        Additional - New Primary               TAP
#> 6   06                                      Metastatic                TM

How can I separate tumor from normal samples?

Solution

The splitAssays function will separate assay datasets into samples based on TCGA barcode identifiers.

splitAssays(acc)
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Blood Derived Normal' samples
#> Selecting 'Solid Tissue Normal' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> Selecting 'Primary Solid Tumor' samples
#> A MultiAssayExperiment object of 10 listed
#>  experiments with user-defined names and respective classes. 
#>  Containing an ExperimentList class object of length 10: 
#>  [1] 01_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 90 columns 
#>  [2] 10_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 85 columns 
#>  [3] 11_ACC_CNVSNP-20160128: RaggedExperiment with 21052 rows and 5 columns 
#>  [4] 01_ACC_GISTIC_AllByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [5] 01_ACC_GISTIC_Peaks-20160128: RangedSummarizedExperiment with 42 rows and 90 columns 
#>  [6] 01_ACC_GISTIC_ThresholdedByGene-20160128: SummarizedExperiment with 24776 rows and 90 columns 
#>  [7] 01_ACC_miRNASeqGene-20160128: SummarizedExperiment with 1046 rows and 80 columns 
#>  [8] 01_ACC_Mutation-20160128: RaggedExperiment with 20166 rows and 90 columns 
#>  [9] 01_ACC_RNASeq2GeneNorm-20160128: SummarizedExperiment with 20501 rows and 79 columns 
#>  [10] 01_ACC_RPPAArray-20160128: SummarizedExperiment with 192 rows and 46 columns 
#> Features: 
#>  experiments() - obtain the ExperimentList instance 
#>  colData() - the primary/phenotype DataFrame 
#>  sampleMap() - the sample availability DataFrame 
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment 
#>  *Format() - convert into a long or wide DataFrame 
#>  assays() - convert ExperimentList to a SimpleList of matrices

9.1.3 Translation and Interpretation of TCGA barcodes

TCGAutils provides a number of ID translation functions. These allow the user to translate from either file or case UUIDs to TCGA barcodes and back. These functions work by querying the Genomic Data Commons API via the GenomicDataCommons package (thanks to Sean Davis for original template). These include:

UUIDtoBarcode()

UUIDtoBarcode("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", from_type = "case_id")
#>                                case_id submitter_id
#> 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 TCGA-B0-5117

barcodeToUUID()

(xcode <- head(colnames(acc)[["ACC_Mutation-20160128"]], 4))
#> [1] "TCGA-OR-A5J1-01A-11D-A29I-10" "TCGA-OR-A5J2-01A-11D-A29I-10"
#> [3] "TCGA-OR-A5J3-01A-11D-A29I-10" "TCGA-OR-A5J4-01A-11D-A29I-10"
barcodeToUUID(xcode)
#>           submitter_aliquot_ids                          aliquot_ids
#> 32 TCGA-OR-A5J1-01A-11D-A29I-10 352062e7-9b06-41cd-880c-38fe268c9bf3
#> 5  TCGA-OR-A5J2-01A-11D-A29I-10 d97c6076-7e4f-4dbe-85de-12bc0d84d8e8
#> 16 TCGA-OR-A5J3-01A-11D-A29I-10 7691e5bf-7a03-4e6a-9873-74f2c8390a41
#> 39 TCGA-OR-A5J4-01A-11D-A29I-10 4d76daa9-a336-435c-aeef-47344d97ac5b

UUIDtoUUID()

head(UUIDtoUUID("ae55b2d3-62a1-419e-9f9a-5ddfac356db4", to_type = "file_id"))
#>                                case_id                        files.file_id
#> 1 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 d6625424-1503-4735-a7cc-2a3de606278f
#> 2 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 bc62b7e8-1b54-438b-944b-3e0655cdf6ac
#> 3 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 fad3f577-6e5e-4039-b35b-69269f42d488
#> 4 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 48c342b0-e7a2-4a7b-8556-55bcd8ad9ea0
#> 5 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 e8eb689b-c5e1-4a21-acec-5f2ac25dbf97
#> 6 ae55b2d3-62a1-419e-9f9a-5ddfac356db4 8f345db1-686b-4e72-9b90-704978cde526

filenameToBarcode()

library(GenomicDataCommons)
fquery <- files(legacy = TRUE) %>%
    GenomicDataCommons::filter(~ cases.project.project_id == "TCGA-ACC" &
        data_category == "Gene expression" &
        data_type == "Exon quantification")

fnames <- results(fquery)$file_name[1:6]

filenameToBarcode(fnames)

See the TCGAutils help pages for details.

9.2 Other TCGA data types

Helper functions to add TCGA exon files (from legacy archive), copy number and GISTIC copy number calls to MultiAssayExperiment objects are also available in TCGAutils.

10 Plotting, correlation, and other analyses

10.1 How many samples have data for each combination of assays?

Solution

The built-in upsetSamples creates an “upset” Venn diagram to answer this question:

upsetSamples(acc)

In this dataset only 44 samples have all 5 assays, 33 are missing RNA gene expression data, 2 are missing reverse-phase protein array (RPPA), 12 have only mutations and RPPA, etc.

10.2 Kaplan-meier plot stratified by pathology_N_stage

Create a Kaplan-meier plot, using pathology_N_stage as a stratifying variable.

Solution

The colData provides clinical data for things like a Kaplan-Meier plot for overall survival stratified by nodal stage.

library(survival)
Surv(acc$days_to_death, acc$vital_status)

And remove any patients missing overall survival information:

accsurv <- acc[, complete.cases(acc$days_to_death, acc$vital_status), ]

library(survminer)

fit <- survfit(Surv(days_to_death, vital_status) ~ pathology_N_stage, data = colData(accsurv))
ggsurvplot(fit, data = colData(accsurv), risk.table = TRUE)

10.3 Multivariate Cox regression including RNA-seq, copy number, and pathology

Choose the EZH2 gene for demonstration. This subsetting will drop assays with no row named EZH2:

wideacc <- wideFormat(acc["EZH2", , ],
    colDataCols=c("vital_status", "days_to_death", "pathology_N_stage"))

wideacc$y <- Surv(wideacc$days_to_death, wideacc$vital_status)

head(wideacc)

Perform a multivariate Cox regression with EZH2 copy number (gistict), log2-transformed EZH2 expression (RNASeq2GeneNorm), and nodal status (pathology_N_stage) as predictors:

coxph(
    Surv(days_to_death, vital_status) ~
        gistict_EZH2 + log2(RNASeq2GeneNorm_EZH2) + pathology_N_stage,
    data=wideacc
)

We see that EZH2 expression is significantly associated with overal survival (p < 0.001), but EZH2 copy number and nodal status are not. This analysis could easily be extended to the whole genome for discovery of prognostic features by repeated univariate regressions over columns, penalized multivariate regression, etc.

For further detail, see the main MultiAssayExperiment vignette.

11 Citing MultiAssayExperiment

citation("MultiAssayExperiment")
#> 
#> To cite MultiAssayExperiment in publications use:
#> 
#>   Marcel Ramos et al. Software For The Integration Of Multiomics
#>   Experiments In Bioconductor. Cancer Research, 2017 November 1;
#>   77(21); e39-42. DOI: 10.1158/0008-5472.CAN-17-0344
#> 
#> A BibTeX entry for LaTeX users is
#> 
#>   @Article{,
#>     title = {Software For The Integration Of Multi-Omics Experiments In Bioconductor},
#>     author = {Marcel Ramos and Lucas Schiffer and Angela Re and Rimsha Azhar and Azfar Basunia and Carmen Rodriguez Cabrera and Tiffany Chan and Philip Chapman and Sean Davis and David Gomez-Cabrero and Aedin C. Culhane and Benjamin Haibe-Kains and Kasper Hansen and Hanish Kodali and Marie Stephie Louis and Arvind Singh Mer and Markus Reister and Martin Morgan and Vincent Carey and Levi Waldron},
#>     journal = {Cancer Research},
#>     year = {2017},
#>     volume = {77(21); e39-42},
#>   }

12 Session Info

sessionInfo()
#> R Under development (unstable) (2018-11-02 r75536)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 18.04.1 LTS
#> 
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/libf77blas.so.3.10.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas/liblapack.so.3
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] kableExtra_1.0.1            dplyr_0.7.8                
#>  [3] readxl_1.2.0                org.Hs.eg.db_3.7.0         
#>  [5] EnsDb.Hsapiens.v86_2.99.0   ensembldb_2.7.8            
#>  [7] AnnotationFilter_1.7.0      GenomicFeatures_1.35.6     
#>  [9] mirbase.db_1.2.0            AnnotationDbi_1.45.0       
#> [11] UpSetR_1.3.3                TCGAutils_1.3.15           
#> [13] GenomicDataCommons_1.7.3    magrittr_1.5               
#> [15] RaggedExperiment_1.7.4      curatedTCGAData_1.5.7      
#> [17] MultiAssayExperiment_1.9.10 SummarizedExperiment_1.13.0
#> [19] DelayedArray_0.9.8          BiocParallel_1.17.9        
#> [21] matrixStats_0.54.0          Biobase_2.43.1             
#> [23] GenomicRanges_1.35.1        GenomeInfoDb_1.19.1        
#> [25] IRanges_2.17.4              S4Vectors_0.21.10          
#> [27] BiocGenerics_0.29.1         BiocStyle_2.11.0           
#> 
#> loaded via a namespace (and not attached):
#>  [1] ProtGenerics_1.15.0                    
#>  [2] bitops_1.0-6                           
#>  [3] bit64_0.9-7                            
#>  [4] webshot_0.5.1                          
#>  [5] progress_1.2.0                         
#>  [6] httr_1.4.0                             
#>  [7] tools_3.6.0                            
#>  [8] R6_2.3.0                               
#>  [9] DBI_1.0.0                              
#> [10] lazyeval_0.2.1                         
#> [11] colorspace_1.4-0                       
#> [12] tidyselect_0.2.5                       
#> [13] gridExtra_2.3                          
#> [14] prettyunits_1.0.2                      
#> [15] curl_3.3                               
#> [16] bit_1.1-14                             
#> [17] compiler_3.6.0                         
#> [18] rvest_0.3.2                            
#> [19] xml2_1.2.0                             
#> [20] rtracklayer_1.43.1                     
#> [21] bookdown_0.9                           
#> [22] scales_1.0.0                           
#> [23] readr_1.3.1                            
#> [24] rappdirs_0.3.1                         
#> [25] stringr_1.3.1                          
#> [26] digest_0.6.18                          
#> [27] Rsamtools_1.35.2                       
#> [28] rmarkdown_1.11                         
#> [29] XVector_0.23.0                         
#> [30] pkgconfig_2.0.2                        
#> [31] htmltools_0.3.6                        
#> [32] highr_0.7                              
#> [33] rlang_0.3.1                            
#> [34] rstudioapi_0.9.0                       
#> [35] RSQLite_2.1.1                          
#> [36] shiny_1.2.0                            
#> [37] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
#> [38] bindr_0.1.1                            
#> [39] jsonlite_1.6                           
#> [40] RCurl_1.95-4.11                        
#> [41] GenomeInfoDbData_1.2.0                 
#> [42] Matrix_1.2-15                          
#> [43] Rcpp_1.0.0                             
#> [44] munsell_0.5.0                          
#> [45] stringi_1.2.4                          
#> [46] yaml_2.2.0                             
#> [47] zlibbioc_1.29.0                        
#> [48] plyr_1.8.4                             
#> [49] AnnotationHub_2.15.5                   
#> [50] grid_3.6.0                             
#> [51] blob_1.1.1                             
#> [52] promises_1.0.1                         
#> [53] ExperimentHub_1.9.1                    
#> [54] crayon_1.3.4                           
#> [55] lattice_0.20-38                        
#> [56] Biostrings_2.51.2                      
#> [57] hms_0.4.2                              
#> [58] knitr_1.21                             
#> [59] pillar_1.3.1                           
#> [60] codetools_0.2-16                       
#> [61] biomaRt_2.39.2                         
#> [62] XML_3.98-1.16                          
#> [63] glue_1.3.0                             
#> [64] evaluate_0.12                          
#> [65] BiocManager_1.30.4                     
#> [66] httpuv_1.4.5.1                         
#> [67] cellranger_1.1.0                       
#> [68] gtable_0.2.0                           
#> [69] purrr_0.3.0                            
#> [70] assertthat_0.2.0                       
#> [71] ggplot2_3.1.0                          
#> [72] xfun_0.4                               
#> [73] mime_0.6                               
#> [74] xtable_1.8-3                           
#> [75] later_0.7.5                            
#> [76] viridisLite_0.3.0                      
#> [77] tibble_2.0.1                           
#> [78] GenomicAlignments_1.19.1               
#> [79] memoise_1.1.0                          
#> [80] bindrcpp_0.2.2                         
#> [81] interactiveDisplayBase_1.21.0

curatedTCGAData and TCGAutils: integration of TCGA in Bioconductor

February 07, 2019

Contents

1 Instructor names and contact information

2 Workshop Description

2.1 Pre-requisites

2.2 Workshop Participation

2.3 R/Bioconductor packages used

2.4 Time outline

3 Workshop goals and objectives

3.1 Learning goals

3.2 Learning objectives

4 Overview of key packages

4.1 curatedTCGAData

4.2 TCGAutils

5 Using Docker containers

6 Major Data Classes

6.1 RaggedExperiment

6.2 MultiAssayExperiment

7 Working with MultiAssayExperiment

7.1 API cheat sheet

7.1.1 Building a MultiAssayExperiment from scratch

8 The Cancer Genome Atlas (TCGA) as MultiAssayExperiment objects

8.1 Important clinical information

8.2 Subtype Information

8.3 Pan-Cancer Data

9 TCGAutils functionality

9.1 “Simplification” of curatedTCGAData objects

9.1.1 Conversions of SummarizedExperiment to RangedSummarizedExperiment based

9.1.1.1 mirToRanges for microRNA

9.1.1.2 symbolsToRanges for gene symbols

9.1.1.3 qreduceTCGA for mutation and copy number

9.1.1.4 simplifyTCGA - combine all

9.1.2 Identification and separation of samples

9.1.3 Translation and Interpretation of TCGA barcodes

9.2 Other TCGA data types

10 Plotting, correlation, and other analyses

10.1 How many samples have data for each combination of assays?

10.2 Kaplan-meier plot stratified by pathology_N_stage

10.3 Multivariate Cox regression including RNA-seq, copy number, and pathology

11 Citing MultiAssayExperiment

12 Session Info

4.1 `curatedTCGAData`

4.2 `TCGAutils`

6.1 `RaggedExperiment`

6.2 `MultiAssayExperiment`

9 `TCGAutils` functionality

9.1 “Simplification” of `curatedTCGAData` objects

9.1.1 Conversions of `SummarizedExperiment` to `RangedSummarizedExperiment` based

9.1.1.1 `mirToRanges` for microRNA

9.1.1.2 `symbolsToRanges` for gene symbols

9.1.1.3 `qreduceTCGA` for mutation and copy number

9.1.1.4 `simplifyTCGA` - combine all