Bioconductor R Packages for Genomics Data

This is an example of a genomic analyses for the Austin R Users Group Demonstration on July 22, 2015. This code and tutorial is a supplement to the presentation “Biconductor R Packages for Genomic Data”.

What is Bioconductor?

Bioconductor is a bioinformatics software consortium of academics and professionals who provide tools for the comprehensive analysis of high-throughput genomic data using the programming language R. Bioconductor, an completely open-source project, started in 2001 and currently has over 1,000 R packages for bioinformatics. Bioconductor is also open development (meaning you can look under the hood at the code base and change or make contributions yourself). One unique contribution of the Bioconductor community is the creation of Amazon Machine Images and a series of Docker images which will run Bioconductor packages. This enables repeatable analyses. See more in the upcoming discussion on how to safe-guard your analyses and avoid ‘disaster’ or repeatablility problems. For some packages you will need to create your own AMI or Docker containers.

Bioconductor is supported by the Fred Hutchinson Cancer Research Center and support can be contacted at support.bioconductor.org. There are also regularly scheduled conferences in which you can meet and mingle with other bioinformaticians and researchers who use and create Bioconductor R packages. Its a dynamic and wonderful community to be part of–so if you can take advantage of opportunities. The core team and participants in Bioconductor include scientists from business and academic institutions such as Oracle, Novartis (Switzerland), Paterson Institute of Bioinformatics (UK) among others.

Downloading and Installing Bioconductor

The first task in using Bioconductor is to download it. The code for downloading bioconductor and installing it is listed below. The next step, downloading the package we will use for gene visualization, GenomeGraphs will be accomplished using Bioconductor. These commands can be performed either in RStudio IDE or at the command line. It is a good idea to use Bioconductor to download these files because the release schedule for Bioconductor differs from the CRAN directory. In addition if you use Bioconductor you can download ‘in development’ releases and test them yourself for new features. A final reason to use Bioconductor is that it will automatically download the most common R scripts used for genomics analysis with Bioconductor packages.

A couple of pages on the Bioconductor site which are helpful:

For basic workflows:

http://bioconductor.org/help/workflows/sequencing/

http://bioconductor.org/developers/how-to/workflows/

For detailed information about using Docker:

http://bioconductor.org/help/docker/

For detailed information about using Amazon Machine Images and other Cloud Computing Options:

http://bioconductor.org/help/bioconductor-cloud-ami/

source("http://bioconductor.org/biocLite.R")

## Bioconductor version 3.0 (BiocInstaller 1.16.5), ?biocLite for help
## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install

biocLite()

## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.3.
## Old packages: 'caret', 'cluster', 'codetools', 'crayon', 'e1071',
##   'gridExtra', 'lattice', 'MASS', 'rattle', 'rversions', 'TTR'

biocLite(c("GenomeGraphs"))

## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.3.
## Installing package(s) 'GenomeGraphs'

## 
## The downloaded binary packages are in
##  /var/folders/3g/k_y108k10l9729971mkb2ql40000gn/T//RtmpFaj8Fu/downloaded_packages

## Old packages: 'caret', 'cluster', 'codetools', 'crayon', 'e1071',
##   'gridExtra', 'lattice', 'MASS', 'rattle', 'rversions', 'TTR'

Using GenomeGraphs Package: Plotting Genomic Information from Ensembl

Next we perform some simple genomics analysis (looking this time at RNA transcripts or what we call mRNA). The analysis is demonstrated looking at the DNA Sequence for EHM2, a gene known to be over expressed in metastatic skin cancer. For this analysis we use the Bioconductor R package GenomeGraphs which provides tools for visualizing high-dimensional genomic data.

Citation: Durinck S and Bullard J. GenomeGraphs: Plotting genomic information from Ensembl. R package version 1.28.0.

Here’s some summary information on EHM2

Gene name: erythrocyte membrane protein band 4.1 like 4B

Short name: EHM2

Location: Chromosome 9, location: 9q31.3

Key Traits: Over-expressed in highly metastatic cells 2 (that is where the short name is derived)

Found in the following tissues and cancer types: skin, liver, bone marrow, kidney and skin cancer.

The visualization performed here will give us some information about the transcripts from this gene. There are two we will show in this demonstration. Try not to focus too much on the genomics for this talk as we are focused on learning the software. However there are many good tutorials and courses on genomics that can be found both on Coursera & edX free courseware websites. In addition, you can see me afterwards for more discussion.

Our first plot will just show the relative general location of introns and exons along the gene axis. This is a way to visually assess what information might be important for example looking at a mRNA expressed gene mutation. Again, though we are focusing on the R code here. Please see me after the talk for questions on genomics.

The code below calls the GenomeGraphs library and imports a part of a gene sequence.

library(GenomeGraphs)

## Loading required package: biomaRt
## Loading required package: grid

#example 1  Plotting a gene and its variants
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene <- makeGene(id = "ENSG00000095203", type="ensembl_gene_id", biomart = mart)
gdPlot(gene)

For our second example we plot the gene transcript for Genesmin along with its variant (that is where there is a version 2 of the mRNA output). Sometimes this becomes very important in genomics as variants can cause disease.

#example 2 Plotting complex data about a gene
transcript <- makeTranscript(id = "ENSG00000095203", type="ensembl_gene_id", biomart = mart)
gdPlot(list(gene, transcript))

A Second Example Using Comparative Genomic Hybridization Visualization

A technique called array comparative genomic hybridization (also microarray-based comparative genomic, hybridization or CGH) is a technqiue for the detection of chromosomal copy number changes. The next visualization is of CGH array data combined with a visualization of gene segments. The goal of the visualization would be to look for chromosomal abnormalities in the form of excess chromosomal copy numbers, missing chromosomal regions or translocated chromosomal regions.

Our final visualization below shows some interesting information. The letters signify the following: (a) image of a typical cytogenetic stain of a chromosome, (b) gene expression, (c) cloning of polymorphims, (d) mRNA transcripts, (e) chromosomal location and (f) genesmin represented by a line with 5’ & 3’ ends.

data("exampleData", package="GenomeGraphs")
minbase <- 180292097
maxbase <- 180492096
genesplus <- makeGeneRegion(start = minbase, end = maxbase, strand = "+", chromosome = "3", biomart=mart)
genesmin <- makeGeneRegion(start = minbase, end = maxbase, strand = "-", chromosome = "3", biomart=mart)
seg <- makeSegmentation(segStart[[1]], segEnd[[1]], segments[[1]], dp = DisplayPars(color = "black", lwd=2,lty = "solid"))
cop <- makeGenericArray(intensity = cn, probeStart = probestart,trackOverlay = seg, dp = DisplayPars(size=3, color = "seagreen", type="dot"))
ideog <- makeIdeogram(chromosome = 3)
expres <- makeGenericArray(intensity = intensity, probeStart = exonProbePos,dp = DisplayPars(color="darkred", type="point"))
genomeAxis <- makeGenomeAxis(add53 = TRUE, add35=TRUE)
plotSystemTime <- system.time(gdPlot(list(a=ideog,b=expres,c=cop,d=genesplus,e=genomeAxis,f=genesmin), minBase = minbase, maxBase =maxbase, labelCex=2))

plotSystemTime

##    user  system elapsed 
##   0.358   0.006   0.801

Options to Improve Wall-Clock Time

Running the code on a single transcript or hybridization data, does not take a very long wall-clock time (~0.4 seconds), but if we were processing and visualizing multiple genes in a large dataset–which is the most common case–it would be very slow. For example most genome-wide mRNA analyses will look for at a low number for example, 5 genes, still a very small number, but theoretically increasing wall-clock time to a few seconds. The number of mRNA transcripts to be analzyed however can be much higher as there are 20,000-25,000 protein-coding genes in the human genome. This means at least 25,000 or more mRNA could be compared. Several variants of mRNA exist and so the actual number is higher. Choosing only 10,000 such genes would create a wall-clock time of at minimum over an hour on a this example where I use a single-node with 2 threads.

Paired with variable expression levels, the fact that we have 46 chromosomes to analyze and replicates to accomplish. One can see why genomics data is ‘big data’. There are a number of ways to address these problems including using C/C++ scripts to speed up calculations, using Amazon Machine Images, using Docker on a cloud platform to containerize workflows and run them on clusters. Another tutorial in this repository uses some parallel R as well as clusters (see Practical Machine Learning Approach to Reducing Dimensionality). Also take a peek at the Bioconductor site for more information on using Docker from R Studio or the command-line.