Bioconductor is based on R. Three key reasons for this are:
In summary, R’s ease-of-use and central role in statistics and “data science” make it a natural choice for a tool-set for use by biologists and statisticians confronting genome-scale experimental data. Since the Bioconductor project’s inception in 2001, it has kept pace with growing volumes and complexity of data emerging in genome-scale biology.
R combines functional and object-oriented programming paradigms.1
square = function(x) x^2
is valid R code that defines the symbol square as a
function that computes the second power of its input. The body of the
function is the program code x^2, in which x
is a “free variable”. Once square has been defined in this
way, square(3) has value 9. We say the
square function has been evaluated on argument
3. In R, all computations proceed by evaluation of
functions.
library(Homo.sapiens)
class(Homo.sapiens)
## [1] "OrganismDb"
## attr(,"package")
## [1] "OrganismDbi"
methods(class=class(Homo.sapiens))
## [1] asBED asGFF cds
## [4] cdsBy coerce<- columns
## [7] dbconn dbfile distance
## [10] exons exonsBy extractUpstreamSeqs
## [13] fiveUTRsByTranscript genes getTxDbIfAvailable
## [16] intronsByTranscript isActiveSeq isActiveSeq<-
## [19] keys keytypes mapIds
## [22] mapToTranscripts metadata microRNAs
## [25] promoters resources select
## [28] selectByRanges selectRangesById seqinfo
## [31] show taxonomyId threeUTRsByTranscript
## [34] transcripts transcriptsBy tRNAs
## [37] TxDb TxDb<-
## see '?methods' for accessing help and source code
We say that Homo.sapiens is an instance
of the OrganismDb class. Every instance of
this class will respond meaningfully to the methods listed above. Each
method is implemented as an R function. What the function does depends
upon the class of its arguments. Of special note at this juncture are
the methods genes, exons,
transcripts which will yield information about fundamental
components of genomes.
These methods will succeed for human and for other model organisms such
as Mus musculus, S. cerevisiae, C. elegans,
and others for which the Bioconductor project and its contributors have
defined OrganismDb representations.
This section can be skipped on a first reading.
We can perform object-oriented functional programming with R by writing R code. A basic approach is to create “scripts” that define all the steps underlying processes of data import and analysis. When scripts are written in such a way that they only define functions and data structures, it becomes possible to package them for convenient distribution to other users confronting similar data management and data analysis problems.
The R software packaging protocol specifies how source code in R and other languages can be organized together with metadata and documentation to foster convenient testing and redistribution. For example, an early version of the package defining this document had the folder layout given below:
├── DESCRIPTION (text file with metadata on provenance, licensing)
├── NAMESPACE (text file defining imports and exports)
├── R (folder for R source code)
├── README.md (optional for github face page)
├── data (folder for exemplary data)
├── man (folder for detailed documentation)
├── tests (folder for formal software testing code)
└── vignettes (folder for high-level documentation)
├── biocOv1.Rmd
├── biocOv1.html
The packaging protocol document “Writing R Extensions” provides full
details. The R command R CMD build [foldername] will
operate on the contents of a package folder to create an archive that
can be added to an R installation using
R CMD INSTALL [archivename]. The R studio system performs
these tasks with GUI elements.
The packaging protocol helps us to isolate software that performs a limited set of operations, and to identify the version of a program collection that is inherently changing over time. There is no objective way to determine whether a set of operations is the right size for packaging. Some very useful packages carry out only a small number of tasks, while others have very broad scope. What is important is that the package concept permits modularization of software. This is important in two dimensions: scope and time. Modularization of scope is important to allow parallel independent development of software tools that address distinct problems. Modularization in time is important to allow identification of versions of software whose behavior is stable.
The figure below is a snapshot of the build report for the development branch of Bioconductor.
The six-column subtable in the upper half of the display includes a column “Installed pkgs”, with entry 1857 for the linux platform. This number varies between platforms and is generally increasing over time for the devel branch.
Bioconductor’s core developer group works hard to develop data structures that allow users to work conveniently with genomes and genome-scale data. Structures are devised to support the main phases of experimentation in genome scale biology:
In this course we will review the objects and functions that you can use to perform these and related tasks in your own research.