Bioconductor infrastructures

Juan R Gonzalez1*

1Bioinformatics Research Group in Epidemiology, Barcelona Institute for Global Health, Spain

May, 2021

Abstract

Brief introduction to some of the main Bioconductor Bioconductor data structures (ExpressionSet and SummarizedExperiments).

1 Gettig started
2 Global description
3 Introduction
4 Bioconductor
5 ExpressionSet objects
6 SummarizedExperiment and RangedSummarizedExperiment
7 Session info

1 Gettig started

This document can be reproduced by using the R code and the data (for illustrating purposes and for the exercises) that are available here: https://github.com/isglobal-brge/short_course_omic_R

2 Global description

It is assumed that people who read this material are familiar with omic association analyses (GWAS, gene expression, methylation, RNA-seq, …). For this short course, we have created different vignettes to:

Provide a global overview about Bioconductor and some of the infrastructures used to encapsulate omic data;
Perform differential analyses that can be used for both transcriptomic and epigenomic data;
Visualize significant results using circos plots; and
Perform enrichment analysis

This short course can be seen as an brief introduction of what we can do with R/Bioconductor packages in the particular case of analyzing transcriptomic and/or epigenomic data. Other association analyses such as genomic (e.g. GWAS) or multi-omic data integration can be seen in our book: https://isglobal-brge.github.io/book_omic_association/

Omic association analyses

3 Introduction

This lecture offers a summary of two data structures (ExpressionSet and SummarizedExperiments) that are implemented in Bioconductor for dealing with transcriptomic or epigenomic data. Omic data are typically composed of three datasets: one containing the actual high-dimensional data of omic variables per individuals, annotation data that specifies the characteristics of the variables and phenotypic information that encodes the subject’s traits of interest, covariates and sampling characteristics. For instance, transcriptomic data can be stored in a ExpressionSet object, which is a data structure that contains the transcription values of individuals at each transcription probe, the genomic information for the transcription probes and the phenotypes of the individuals. Specific data is accessed, processed and analyzed with specific functions from diverse packages, conceived as methods acting on the ExpressionSet objects.

Data in Bioconductor

4 Bioconductor

Bioconductor’s goal: Analysis and comprehension of high-throughput genomic data
Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
- Sequencing: RNASeq, ChIPSeq, variants, copy number, …
- Microarrays: expression, SNP, …
- Flow cytometry, proteomics, images, …

4.1 Packages, vignettes, work flows, …

1649 software packages (Jan’19); also…
- ‘Annotation’ packages – static data bases of identifier maps,gene models, pathways, etc; e.g., TxDb.Hsapiens.UCSC.hg19.knownGene
- ’Experiment packages – data sets used to illustrate software functionality, e.g., airway
Discover and navigate via biocViews
Package ‘landing page’
- Title, author / maintainer, short description, citation, installation instructions, …, download statistics
All user-visible functions have help pages, most with runnable examples

4.2 Vignettes

‘Vignettes’ an important feature in Bioconductor – narrative documents illustrating how to use the package, with integrated code
‘Release’ (every six months) and ‘devel’ branches
Support site; videos, recent courses

4.3 Bioconductor packages installation

install.packages("BiocManager")
library(BiocManager)
install(c("DESeq2", "org.Hs.eg.db"))

# or

BiocManager::install("DESeq2")

Remember that Github packages can be install by

install.packages("devtools")
devtools::install_github("isglobal-brge/SNPassoc")

Once installed, the package can be loaded into an R session

library(GenomicRanges)

and the help system queried interactively, as outlined above:

help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges",
         "GenomicRangesHOWTOs")
?GRanges

5 `ExpressionSet` objects

ExpressionSet was one of the first implementations of Bioconductor to manage experiments.
It is discouraged in Bioconductor’s guidelines for the development of current and future packages
However, most publicly available data is available in this structure while future packages are still required to be able to upload and operate with it.
The rows of data are features and columns are subjects.
Information is coordinated across the object’s slots. For instance, subsetting samples in the assay matrix automatically subsets them in the phenotype metadata.

GEO repository (https://www.ncbi.nlm.nih.gov/geo/) contains thousands of transcriptomic experiments that are available in ExpressionSet format. Data can be loaded into R by:

library(GEOquery)
gse69683 <- getGEO("GSE69683", destdir = ".")
gse69683.expr <- gse69683[[1]]

You can directly loaded it into R with:

load("data/GSE69683.Rdata")

This is how an ExpressionSet looks like:

gse69683.expr

ExpressionSet (storageMode: lockedEnvironment)
assayData: 54715 features, 498 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1706836 GSM1706837 ... GSM1707333 (498 total)
  varLabels: title geo_accession ... tissue:ch1 (37 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 27925796 
Annotation: GPL13158

gse69683.expr is an object of class ExpressionSet that has three main slots:
- Transcriptomic data is stored in the assayData
- Phenotypes (i.e. covariates) are in phenoData
- Probe annotation in featuredData.
- There are three other slots protocolData, experimentData and annotation with other information

Gene expression data can be retrieved by

expr <- exprs(gse69683.expr)
dim(expr)

[1] 54715   498

expr[1:5,1:5]

             GSM1706836 GSM1706837 GSM1706838 GSM1706839 GSM1706840
1007_PM_s_at       4.58       5.62       4.90       4.33       5.30
1053_PM_at         6.72       6.44       6.75       6.57       6.64
117_PM_at          8.75       8.44       8.26       9.16       9.29
121_PM_at          5.43       4.53       5.04       5.04       4.85
1255_PM_g_at       2.63       2.48       2.54       2.71       2.59

Phenotypic data (i.e. covariates) are accesed by

pheno <- phenoData(gse69683.expr)
pheno

An object of class 'AnnotatedDataFrame'
  sampleNames: GSM1706836 GSM1706837 ... GSM1707333 (498 total)
  varLabels: title geo_accession ... tissue:ch1 (37 total)
  varMetadata: labelDescription

colnames(pheno)[1:10]

 [1] "title"               "geo_accession"       "status"              "submission_date"     "last_update_date"   
 [6] "type"                "channel_count"       "source_name_ch1"     "organism_ch1"        "characteristics_ch1"

Data are properly organized. So that, we can run any statistical model or method you want

group <- pheno$characteristics_ch1
table(group)

group
        cohort: Healthy, non-smoking cohort: Moderate asthma, non-smoking   cohort: Severe asthma, non-smoking 
                                  87                                   77                                  246 
      cohort: Severe asthma, smoking 
                                  88

boxplot(expr["1007_PM_s_at",] ~ group)

The fData function gets the probes’ annotation that will be required to genome data visualization and post-data analysis

probes <- fData(gse69683.expr)
probes[1:5, 1:5]

                       ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
1007_PM_s_at 1007_PM_s_at U48705      NA            Homo sapiens    Aug 20, 2010
1053_PM_at     1053_PM_at M87338      NA            Homo sapiens    Aug 20, 2010
117_PM_at       117_PM_at X51757      NA            Homo sapiens    Aug 20, 2010
121_PM_at       121_PM_at X69699      NA            Homo sapiens    Aug 20, 2010
1255_PM_g_at 1255_PM_g_at L36861      NA            Homo sapiens    Aug 20, 2010

Subsetting acts as in any other R object. Let us assume we want to select only healthy individuals

sel <- "cohort: Healthy, non-smoking"
mask <- gse69683.expr$characteristics_ch1%in%sel
gse <- gse69683.expr[ , mask]
gse

ExpressionSet (storageMode: lockedEnvironment)
assayData: 54715 features, 87 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1706838 GSM1706840 ... GSM1707332 (87 total)
  varLabels: title geo_accession ... tissue:ch1 (37 total)
  varMetadata: labelDescription
featureData
  featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715 total)
  fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
  fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
  pubMedIds: 27925796 
Annotation: GPL13158

6 `SummarizedExperiment` and `RangedSummarizedExperiment`

It is an extension of ExpressionSet objects.
The SummarizedExperiment package contains two classes: SummarizedExperiment and RangedSummarizedExperiment.
The fundamental difference between the two classes is that the rows of a RangedSummarizedExperiment object represent genomic ranges of interest instead of a DataFrame of feature. The ranges are accesses with rowRanges()

NOTE: Altough it is not necessary for this course, we recommend to have a quick look at this introduction on GenomicRanges.

SummarizedExperiment

airway package contains an example dataset from an RNA-Seq experiment of read counts per gene for airway smooth muscles. These data are stored in a RangedSummarizedExperiment object which contains 8 different experimental and 64,102 gene transcripts.

library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se

class: RangedSummarizedExperiment 
dim: 64102 8 
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

Experiment data is accessed with assay():

names(assays(se))

[1] "counts"

gene.dat <- assays(se)$counts
gene.dat[1:5, 1:5]

                SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003        679        448        873        408       1138
ENSG00000000005          0          0          0          0          0
ENSG00000000419        467        515        621        365        587
ENSG00000000457        260        211        263        164        245
ENSG00000000460         60         55         40         35         78

Phenotypic data is accessed with colData():

colData(se)

DataFrame with 8 rows and 9 columns
           SampleName     cell      dex    albut        Run avgLength Experiment    Sample    BioSample
             <factor> <factor> <factor> <factor>   <factor> <integer>   <factor>  <factor>     <factor>
SRR1039508 GSM1275862  N61311     untrt    untrt SRR1039508       126  SRX384345 SRS508568 SAMN02422669
SRR1039509 GSM1275863  N61311     trt      untrt SRR1039509       126  SRX384346 SRS508567 SAMN02422675
SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126  SRX384349 SRS508571 SAMN02422678
SRR1039513 GSM1275867  N052611    trt      untrt SRR1039513        87  SRX384350 SRS508572 SAMN02422670
SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120  SRX384353 SRS508575 SAMN02422682
SRR1039517 GSM1275871  N080611    trt      untrt SRR1039517       126  SRX384354 SRS508576 SAMN02422673
SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101  SRX384357 SRS508579 SAMN02422683
SRR1039521 GSM1275875  N061011    trt      untrt SRR1039521        98  SRX384358 SRS508580 SAMN02422677

Subset for only those samples treated with dexamethasone

se[, se$dex == "trt"]

class: RangedSummarizedExperiment 
dim: 64102 4 
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

We can also subset a SummarizedExperiment by a given interval or genomic region

roi <- GRanges(seqnames="chr1", ranges=100000:1100000)
# or
roi <- GRanges(seqnames="1", IRanges(start=100000,
                                        end=1100000))
se.roi <- subsetByOverlaps(se, roi)
se.roi

class: RangedSummarizedExperiment 
dim: 74 8 
metadata(1): ''
assays(1): counts
rownames(74): ENSG00000131591 ENSG00000177757 ... ENSG00000272512 ENSG00000273443
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample

It is worth to notice that the chromosome is accessed by using seqnames="1" instead of seqnames="chr1" since they are annotated in the NCBI or Ensembl style:

seqlevelsStyle(se.roi)

[1] "NCBI"    "Ensembl"

NOTE: seqnames="chr1" corresponds to UCSC style

EXERCISE: Recount2 provides data for different RNA-seq experiments. These includes data from GTEx or TCGA projects. We have donwloaded a subset of data corresponding to breast cancer and created a variable called er which encodes the estrogen receptor status (Negative and Positive). The SummarizedExperiment object is called breast and is available in the file data_exercises/breast_tcga.Rdata. Load the data into R and answer the next questions

How many samples are in the SummarizedExperiment object?
And how many genes?
Which is the number of samples with positive estrogen receptor status (variable er)?
Subset the individuals having Negative strogen receptor status and draw a boxplot of the first gene.
Create a SummarizedExperiment object of the genomic region chr6:151.2-151.8Mb. How many genes are in that region? How many of them are annotated? That is, how many of them have a gene symbol name (HINT: use rowRanges() function and remember that mcols() function is used to get acccess to columns in a GRanges object) ?

7 Session info

sessionInfo()

R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Spanish_Spain.1252  LC_CTYPE=Spanish_Spain.1252    LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C                   LC_TIME=Spanish_Spain.1252    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] SummarizedExperiment_1.18.2 DelayedArray_0.14.1         matrixStats_0.57.0          Biobase_2.48.0             
 [5] GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2              S4Vectors_0.26.1           
 [9] BiocGenerics_0.34.0         BiocStyle_2.16.0           

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             knitr_1.29             XVector_0.28.0         magrittr_2.0.1         zlibbioc_1.34.0       
 [6] lattice_0.20-41        rlang_0.4.10           stringr_1.4.0          tools_4.0.2            grid_4.0.2            
[11] xfun_0.16              htmltools_0.5.1.1      yaml_2.2.1             digest_0.6.27          bookdown_0.20         
[16] Matrix_1.2-18          GenomeInfoDbData_1.2.3 BiocManager_1.30.10    bitops_1.0-6           codetools_0.2-16      
[21] RCurl_1.98-1.2         evaluate_0.14          rmarkdown_2.7          stringi_1.4.6          compiler_4.0.2        
[26] magick_2.4.0