ExpressionSet and SummarizedExperiments).
This document can be reproduced by using the R code and the data (for illustrating purposes and for the exercises) that are available here: https://github.com/isglobal-brge/short_course_omic_R
It is assumed that people who read this material are familiar with omic association analyses (GWAS, gene expression, methylation, RNA-seq, …). For this short course, we have created different vignettes to:
This short course can be seen as an brief introduction of what we can do with R/Bioconductor packages in the particular case of analyzing transcriptomic and/or epigenomic data. Other association analyses such as genomic (e.g. GWAS) or multi-omic data integration can be seen in our book: https://isglobal-brge.github.io/book_omic_association/
Omic association analyses
This lecture offers a summary of two data structures (ExpressionSet and SummarizedExperiments) that are implemented in Bioconductor for dealing with transcriptomic or epigenomic data. Omic data are typically composed of three datasets: one containing the actual high-dimensional data of omic variables per individuals, annotation data that specifies the characteristics of the variables and phenotypic information that encodes the subject’s traits of interest, covariates and sampling characteristics. For instance, transcriptomic data can be stored in a ExpressionSet object, which is a data structure that contains the transcription values of individuals at each transcription probe, the genomic information for the transcription probes and the phenotypes of the individuals. Specific data is accessed, processed and analyzed with specific functions from diverse packages, conceived as methods acting on the ExpressionSet objects.
Data in Bioconductor
Bioconductor’s goal: Analysis and comprehension of high-throughput genomic data
Statistical analysis: large data, technological artifacts, designed experiments; rigorous
Comprehension: biological context, visualization, reproducibility
High-throughput
TxDb.Hsapiens.UCSC.hg19.knownGeneairwaybiocViewsinstall.packages("BiocManager")
library(BiocManager)
install(c("DESeq2", "org.Hs.eg.db"))
# or
BiocManager::install("DESeq2")
Remember that Github packages can be install by
install.packages("devtools")
devtools::install_github("isglobal-brge/SNPassoc")
Once installed, the package can be loaded into an R session
library(GenomicRanges)
and the help system queried interactively, as outlined above:
help(package="GenomicRanges")
vignette(package="GenomicRanges")
vignette(package="GenomicRanges",
"GenomicRangesHOWTOs")
?GRanges
ExpressionSet objectsExpressionSet was one of the first implementations of Bioconductor to manage experiments.
It is discouraged in Bioconductor’s guidelines for the development of current and future packages
However, most publicly available data is available in this structure while future packages are still required to be able to upload and operate with it.
The rows of data are features and columns are subjects.
Information is coordinated across the object’s slots. For instance, subsetting samples in the assay matrix automatically subsets them in the phenotype metadata.
GEO repository (https://www.ncbi.nlm.nih.gov/geo/) contains thousands of transcriptomic experiments that are available in ExpressionSet format. Data can be loaded into R by:
library(GEOquery)
gse69683 <- getGEO("GSE69683", destdir = ".")
gse69683.expr <- gse69683[[1]]
You can directly loaded it into R with:
load("data/GSE69683.Rdata")
This is how an ExpressionSet looks like:
gse69683.expr
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54715 features, 498 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1706836 GSM1706837 ... GSM1707333 (498 total)
varLabels: title geo_accession ... tissue:ch1 (37 total)
varMetadata: labelDescription
featureData
featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 27925796
Annotation: GPL13158
gse69683.expr is an object of class ExpressionSet that has three main slots:
Transcriptomic data is stored in the assayData
Phenotypes (i.e. covariates) are in phenoData
Probe annotation in featuredData.
There are three other slots protocolData, experimentData and annotation with other information
Gene expression data can be retrieved by
expr <- exprs(gse69683.expr)
dim(expr)
[1] 54715 498
expr[1:5,1:5]
GSM1706836 GSM1706837 GSM1706838 GSM1706839 GSM1706840
1007_PM_s_at 4.58 5.62 4.90 4.33 5.30
1053_PM_at 6.72 6.44 6.75 6.57 6.64
117_PM_at 8.75 8.44 8.26 9.16 9.29
121_PM_at 5.43 4.53 5.04 5.04 4.85
1255_PM_g_at 2.63 2.48 2.54 2.71 2.59
Phenotypic data (i.e. covariates) are accesed by
pheno <- phenoData(gse69683.expr)
pheno
An object of class 'AnnotatedDataFrame'
sampleNames: GSM1706836 GSM1706837 ... GSM1707333 (498 total)
varLabels: title geo_accession ... tissue:ch1 (37 total)
varMetadata: labelDescription
colnames(pheno)[1:10]
[1] "title" "geo_accession" "status" "submission_date" "last_update_date"
[6] "type" "channel_count" "source_name_ch1" "organism_ch1" "characteristics_ch1"
Data are properly organized. So that, we can run any statistical model or method you want
group <- pheno$characteristics_ch1
table(group)
group
cohort: Healthy, non-smoking cohort: Moderate asthma, non-smoking cohort: Severe asthma, non-smoking
87 77 246
cohort: Severe asthma, smoking
88
boxplot(expr["1007_PM_s_at",] ~ group)
The fData function gets the probes’ annotation that will be required to genome data visualization and post-data analysis
probes <- fData(gse69683.expr)
probes[1:5, 1:5]
ID GB_ACC SPOT_ID Species Scientific Name Annotation Date
1007_PM_s_at 1007_PM_s_at U48705 NA Homo sapiens Aug 20, 2010
1053_PM_at 1053_PM_at M87338 NA Homo sapiens Aug 20, 2010
117_PM_at 117_PM_at X51757 NA Homo sapiens Aug 20, 2010
121_PM_at 121_PM_at X69699 NA Homo sapiens Aug 20, 2010
1255_PM_g_at 1255_PM_g_at L36861 NA Homo sapiens Aug 20, 2010
Subsetting acts as in any other R object. Let us assume we want to select only healthy individuals
sel <- "cohort: Healthy, non-smoking"
mask <- gse69683.expr$characteristics_ch1%in%sel
gse <- gse69683.expr[ , mask]
gse
ExpressionSet (storageMode: lockedEnvironment)
assayData: 54715 features, 87 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM1706838 GSM1706840 ... GSM1707332 (87 total)
varLabels: title geo_accession ... tissue:ch1 (37 total)
varMetadata: labelDescription
featureData
featureNames: 1007_PM_s_at 1053_PM_at ... AFFX-TrpnX-M_at (54715 total)
fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 27925796
Annotation: GPL13158
SummarizedExperiment and RangedSummarizedExperimentIt is an extension of ExpressionSet objects.
The SummarizedExperiment package contains two classes: SummarizedExperiment and RangedSummarizedExperiment.
The fundamental difference between the two classes is that the rows of a RangedSummarizedExperiment object represent genomic ranges of interest instead of a DataFrame of feature. The ranges are accesses with rowRanges()
NOTE: Altough it is not necessary for this course, we recommend to have a quick look at this introduction on GenomicRanges.
SummarizedExperiment
airway package contains an example dataset from an RNA-Seq experiment of read counts per gene for airway smooth muscles. These data are stored in a RangedSummarizedExperiment object which contains 8 different experimental and 64,102 gene transcripts.
library(SummarizedExperiment)
data(airway, package="airway")
se <- airway
se
class: RangedSummarizedExperiment
dim: 64102 8
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
Experiment data is accessed with assay():
names(assays(se))
[1] "counts"
gene.dat <- assays(se)$counts
gene.dat[1:5, 1:5]
SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516
ENSG00000000003 679 448 873 408 1138
ENSG00000000005 0 0 0 0 0
ENSG00000000419 467 515 621 365 587
ENSG00000000457 260 211 263 164 245
ENSG00000000460 60 55 40 35 78
Phenotypic data is accessed with colData():
colData(se)
DataFrame with 8 rows and 9 columns
SampleName cell dex albut Run avgLength Experiment Sample BioSample
<factor> <factor> <factor> <factor> <factor> <integer> <factor> <factor> <factor>
SRR1039508 GSM1275862 N61311 untrt untrt SRR1039508 126 SRX384345 SRS508568 SAMN02422669
SRR1039509 GSM1275863 N61311 trt untrt SRR1039509 126 SRX384346 SRS508567 SAMN02422675
SRR1039512 GSM1275866 N052611 untrt untrt SRR1039512 126 SRX384349 SRS508571 SAMN02422678
SRR1039513 GSM1275867 N052611 trt untrt SRR1039513 87 SRX384350 SRS508572 SAMN02422670
SRR1039516 GSM1275870 N080611 untrt untrt SRR1039516 120 SRX384353 SRS508575 SAMN02422682
SRR1039517 GSM1275871 N080611 trt untrt SRR1039517 126 SRX384354 SRS508576 SAMN02422673
SRR1039520 GSM1275874 N061011 untrt untrt SRR1039520 101 SRX384357 SRS508579 SAMN02422683
SRR1039521 GSM1275875 N061011 trt untrt SRR1039521 98 SRX384358 SRS508580 SAMN02422677
Subset for only those samples treated with dexamethasone
se[, se$dex == "trt"]
class: RangedSummarizedExperiment
dim: 64102 4
metadata(1): ''
assays(1): counts
rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
rowData names(0):
colnames(4): SRR1039509 SRR1039513 SRR1039517 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
We can also subset a SummarizedExperiment by a given interval or genomic region
roi <- GRanges(seqnames="chr1", ranges=100000:1100000)
# or
roi <- GRanges(seqnames="1", IRanges(start=100000,
end=1100000))
se.roi <- subsetByOverlaps(se, roi)
se.roi
class: RangedSummarizedExperiment
dim: 74 8
metadata(1): ''
assays(1): counts
rownames(74): ENSG00000131591 ENSG00000177757 ... ENSG00000272512 ENSG00000273443
rowData names(0):
colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
colData names(9): SampleName cell ... Sample BioSample
It is worth to notice that the chromosome is accessed by using seqnames="1" instead of seqnames="chr1" since they are annotated in the NCBI or Ensembl style:
seqlevelsStyle(se.roi)
[1] "NCBI" "Ensembl"
NOTE: seqnames="chr1" corresponds to UCSC style
EXERCISE: Recount2 provides data for different RNA-seq experiments. These includes data from GTEx or TCGA projects. We have donwloaded a subset of data corresponding to breast cancer and created a variable called er which encodes the estrogen receptor status (Negative and Positive). The SummarizedExperiment object is called breast and is available in the file data_exercises/breast_tcga.Rdata. Load the data into R and answer the next questions
How many samples are in the SummarizedExperiment object?
And how many genes?
Which is the number of samples with positive estrogen receptor status (variable er)?
Subset the individuals having Negative strogen receptor status and draw a boxplot of the first gene.
Create a SummarizedExperiment object of the genomic region chr6:151.2-151.8Mb. How many genes are in that region? How many of them are annotated? That is, how many of them have a gene symbol name (HINT: use rowRanges() function and remember that mcols() function is used to get acccess to columns in a GRanges object) ?
sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
locale:
[1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252 LC_MONETARY=Spanish_Spain.1252
[4] LC_NUMERIC=C LC_TIME=Spanish_Spain.1252
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] SummarizedExperiment_1.18.2 DelayedArray_0.14.1 matrixStats_0.57.0 Biobase_2.48.0
[5] GenomicRanges_1.40.0 GenomeInfoDb_1.24.2 IRanges_2.22.2 S4Vectors_0.26.1
[9] BiocGenerics_0.34.0 BiocStyle_2.16.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.6 knitr_1.29 XVector_0.28.0 magrittr_2.0.1 zlibbioc_1.34.0
[6] lattice_0.20-41 rlang_0.4.10 stringr_1.4.0 tools_4.0.2 grid_4.0.2
[11] xfun_0.16 htmltools_0.5.1.1 yaml_2.2.1 digest_0.6.27 bookdown_0.20
[16] Matrix_1.2-18 GenomeInfoDbData_1.2.3 BiocManager_1.30.10 bitops_1.0-6 codetools_0.2-16
[21] RCurl_1.98-1.2 evaluate_0.14 rmarkdown_2.7 stringi_1.4.6 compiler_4.0.2
[26] magick_2.4.0