We are now ready to demonstrate some of the powerful tools and resources available through the Bioconductor project. The data analysis behind the science being conducted with high-throughput technologies is complex. A successful analysis involves developing appropriate experimental designs, quality assessments, dealing with noisy measurements, conducting statistical analysis (with inference, discovery or machine learning approaches) and finally connecting the results with meaningful biological information. In this lecture we provide an example demonstrating the complexity of the endeavour. Don’t worry if you do not understand all of the biological and computational concepts being presented here. The purpose of this video is to motivate the learning of Bioconductor. Once you complete this course, you will be ready to attack scientific questions that involve analyses that are complex in multiple facets.
First, install the package Homo.sapiens using biocLite().
The package Homo.sapiens contains an organism database object, which has the same name as the package Homo.sapiens.
library(Homo.sapiens)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: GenomeInfoDb
## Loading required package: S4Vectors
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: OrganismDbi
## Loading required package: GenomicFeatures
## Loading required package: GenomicRanges
## Loading required package: GO.db
## Loading required package: DBI
##
## Loading required package: org.Hs.eg.db
##
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
class(Homo.sapiens)
## [1] "OrganismDb"
## attr(,"package")
## [1] "OrganismDbi"
The OrganismDb object Homo.sapiens contains annotation information about human genes, in particular a series of relations between identifiers about human genes. Some of the identifiers can be provided as “keys”, that is, as values to look up in the database, in order to find pieces of information which match that key. We can list all the possible types of keys:
keytypes(Homo.sapiens)
## [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION"
## [5] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [9] "ACCNUM" "ALIAS" "ENZYME" "MAP"
## [13] "PATH" "PMID" "REFSEQ" "SYMBOL"
## [17] "UNIGENE" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [21] "GENENAME" "UNIPROT" "GO" "EVIDENCE"
## [25] "GOALL" "EVIDENCEALL" "ONTOLOGYALL" "OMIM"
## [29] "UCSCKG" "GENEID" "TXID" "TXNAME"
## [33] "EXONID" "EXONNAME" "CDSID" "CDSNAME"
There are also columns in the database, not all of which are keys. To list all the columns:
columns(Homo.sapiens)
## [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION"
## [5] "ENTREZID" "PFAM" "IPI" "PROSITE"
## [9] "ACCNUM" "ALIAS" "CHR" "CHRLOC"
## [13] "CHRLOCEND" "ENZYME" "MAP" "PATH"
## [17] "PMID" "REFSEQ" "SYMBOL" "UNIGENE"
## [21] "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "GENENAME"
## [25] "UNIPROT" "GO" "EVIDENCE" "GOALL"
## [29] "EVIDENCEALL" "ONTOLOGYALL" "OMIM" "UCSCKG"
## [33] "CDSID" "CDSNAME" "CDSCHROM" "CDSSTRAND"
## [37] "CDSSTART" "CDSEND" "EXONID" "EXONNAME"
## [41] "EXONCHROM" "EXONSTRAND" "EXONSTART" "EXONEND"
## [45] "GENEID" "TXID" "EXONRANK" "TXNAME"
## [49] "TXTYPE" "TXCHROM" "TXSTRAND" "TXSTART"
## [53] "TXEND"
Two widely used catalogs of genes are NCBI Entrez Gene and EBI ENSEMBL.
We can use Homo.sapiens to count the number of unique Entrez identifiers. To do so, we use the keys() function, which takes two arguments: the database object, and the name of the type of key we are interested in. We use head() to limit the output:
head(keys(Homo.sapiens, keytype = "ENTREZID"))
## [1] "1" "2" "3" "9" "10" "11"
How many unique Entrez identifiers are there? Hint: remember the unique() function
length(unique(keys(Homo.sapiens,keytype = "ENTREZID")))
## [1] 56340
Substitute keytype “ENSEMBL” in the keys command of the previous problem. How many unique Ensembl identifiers are found?
length(unique(keys(Homo.sapiens,keytype = "ENSEMBL")))
## [1] 28423
We can use the select() function to look up specific values in the database. The following line of code looks up the gene symbol, the Ensembl ID, the Entrez ID, and the chromosome number of the gene with the given Entrez ID, “123”.
select(Homo.sapiens,keys = "123", keytype = "ENTREZID", columns = c("SYMBOL","ENSEMBL","ENTREZID","CHR"))
## ENTREZID CHR SYMBOL ENSEMBL
## 1 123 9 PLIN2 ENSG00000147872
Note that we ask for four columns: c("SYMBOL", "ENSEMBL", "ENTREZID", "CHR"). We know to use this by exploring the possibilities by typing columns(Homo.sapiens). You can get more information on what these column names are by typing, for example, ?SYMBOL.
What is the Ensembl ID of the gene with the Entrez ID “9575”? Type in the full text name including ENSG… (As a side note: you might notice this gene has an interesting symbol name which gives a clue to its function in the cell.)
select(Homo.sapiens,keys = "9575", keytype = "ENTREZID", columns = c("SYMBOL","ENSEMBL","ENTREZID","CHR"))
## ENTREZID CHR SYMBOL ENSEMBL
## 1 9575 4 CLOCK ENSG00000134852
We can list genes associated with certain biological processes or molecular functions if we know the right vocabulary. One such vocabulary is the “Gene Ontology”, often shortened “GO”. One term of interest in GO is “circadian rhythm”. We use the following command to enumerate genes by Entrez ID annotated to this process:
tab = select(Homo.sapiens, key="circadian rhythm", keytype="TERM", columns=c("ENTREZID"))
How many unique Entrez IDs are associated with circadian rhythm according to this GO term?
length(unique(tab$ENTREZID))
## [1] 91
We will now load the sample ExpressionSet object from the module and give it a short variable name.
library(Biobase)
data(sample.ExpressionSet)
sample.ExpressionSet
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 500 features, 26 samples
## element names: exprs, se.exprs
## protocolData: none
## phenoData
## sampleNames: A B ... Z (26 total)
## varLabels: sex type score
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation: hgu95av2
samp = sample.ExpressionSet
We can now perform convenient operations on the object, where any associated information about the columns or rows remains linked to the experiment information.
Note that you can access the information about the samples with:
pData(samp)
## sex type score
## A Female Control 0.75
## B Male Case 0.40
## C Male Control 0.73
## D Male Case 0.42
## E Female Case 0.93
## F Male Control 0.22
## G Male Case 0.96
## H Male Case 0.79
## I Female Case 0.37
## J Male Control 0.63
## K Male Case 0.26
## L Female Control 0.36
## M Male Case 0.41
## N Male Case 0.80
## O Female Case 0.10
## P Female Control 0.41
## Q Female Case 0.16
## R Male Control 0.72
## S Male Case 0.17
## T Female Case 0.74
## U Male Control 0.35
## V Female Control 0.77
## W Male Control 0.27
## X Male Control 0.98
## Y Female Case 0.94
## Z Female Case 0.32
For example, to access information about the sex of the samples:
pData(samp)$sex
## [1] Female Male Male Male Female Male Male Male Female Male
## [11] Male Female Male Male Female Female Female Male Male Female
## [21] Male Female Male Male Female Female
## Levels: Female Male
you could also use shorthand
samp$sex
## [1] Female Male Male Male Female Male Male Male Female Male
## [11] Male Female Male Male Female Female Female Male Male Female
## [21] Male Female Male Male Female Female
## Levels: Female Male
The experimental data matrix is stored in exprs(samp). We will refer to this as the experiment data in the following problems.
Make a new ExpressionSet which contains the subset of the samples which are female. Remember you can subset the ExpressionSet object directly with [ and ]. What is the sum of the first row of experiment data, accessed by exprs(), for only the female samples?
samp.f = samp[,samp$sex=="Female"]
sum(exprs(samp.f[1,]))
## [1] 1231.46
Look up the experiment data of the sample ExpressionSet with experimentData(). Suppose we have a question about the experiment, who can we email for more information? Type in the email address.
experimentData(samp)
## Experiment data
## Experimenter name: Pierre Fermat
## Laboratory: Francis Galton Lab
## Contact information: pfermat@lab.not.exist
## Title: Smoking-Cancer Experiment
## URL: www.lab.not.exist
## PMIDs:
##
## Abstract: A 8 word abstract is available. Use 'abstract' method.
## notes:
## notes:
## An example object of expression set (exprSet) class
Use the annotation() function to look up the annotation information about the experiment data. What is the annotation string associated with ‘samp’ (do not include the quote marks)? This is often important if we need more detailed information about the features.
annotation(samp)
## [1] "hgu95av2"
Note that because this annotation is the same for all experiments using this microarray product, we don’t need to carry around the information in each object. Instead we make it available as an annotation project. So, for example, to get the gene symbols we can do the following:
#library(BiocInstaller)
#biocLite("hgu95av2.db")
library(hgu95av2.db)
##
fns = featureNames(samp)
annot = select (hgu95av2.db, keys=fns, keytype="PROBEID", columns="SYMBOL")
## this map is not one to one, so pick one:
geneSymbols = annot [ match(fns, annot$PROBEID), "SYMBOL"]
What is the correlation - the Pearson correlation given by cor() - of the ‘score’ information for the samples and the experiment data for feature ‘31489_at’? Compute the correlation on the full ‘samp’ dataset, not the subsetted dataset of females. Note that, regardless of re-ordering of columns or rows, by using ExpressionSet objects, we can obtain the same answer with a single line of code.
cor(exprs(samp)["31489_at",], pData(samp)$score, method = "pearson")
## [1] 0.1376892
# or simply
cor(samp$score, exprs(samp)["31489_at",])
## [1] 0.1376892