This is an example of a genomic analyses for the Austin R Users Group Demonstration on July 22, 2015. This code and tutorial is a supplement to the presentation “Biconductor R Packages for Genomic Data”.

What is Bioconductor?

Bioconductor is a bioinformatics software consortium of academics and professionals who provide tools for the comprehensive analysis of high-throughput genomic data using the programming language R. Bioconductor, an completely open-source project, started in 2001 and currently has over 1,000 R packages for bioinformatics. Bioconductor is also open development (meaning you can look under the hood at the code base and change or make contributions yourself). One unique contribution of the Bioconductor community is the creation of Amazon Machine Images and a series of Docker images which will run Bioconductor packages. This enables repeatable analyses. See more in the upcoming discussion on how to safe-guard your analyses and avoid ‘disaster’ or repeatablility problems. For some packages you will need to create your own AMI or Docker containers.

Bioconductor is supported by the Fred Hutchinson Cancer Research Center and support can be contacted at support.bioconductor.org. There are also regularly scheduled conferences in which you can meet and mingle with other bioinformaticians and researchers who use and create Bioconductor R packages. Its a dynamic and wonderful community to be part of–so if you can take advantage of opportunities. The core team and participants in Bioconductor include scientists from business and academic institutions such as Oracle, Novartis (Switzerland), Paterson Institute of Bioinformatics (UK) among others.

Downloading and Installing Bioconductor

The first task in using Bioconductor is to download it. The code for downloading bioconductor and installing it is listed below. The next step, downloading the package we will use for gene visualization, GenomeGraphs will be accomplished using Bioconductor. These commands can be performed either in RStudio IDE or at the command line. It is a good idea to use Bioconductor to download these files because the release schedule for Bioconductor differs from the CRAN directory. In addition if you use Bioconductor you can download ‘in development’ releases and test them yourself for new features. A final reason to use Bioconductor is that it will automatically download the most common R scripts used for genomics analysis with Bioconductor packages.

A couple of pages on the Bioconductor site which are helpful:

For basic workflows:

http://bioconductor.org/help/workflows/sequencing/

http://bioconductor.org/developers/how-to/workflows/

For detailed information about using Docker:

http://bioconductor.org/help/docker/

For detailed information about using Amazon Machine Images and other Cloud Computing Options:

http://bioconductor.org/help/bioconductor-cloud-ami/

source("http://bioconductor.org/biocLite.R")
## Bioconductor version 3.0 (BiocInstaller 1.16.5), ?biocLite for help
## A new version of Bioconductor is available after installing the most
##   recent version of R; see http://bioconductor.org/install
biocLite()
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.3.
## Old packages: 'caret', 'cluster', 'codetools', 'crayon', 'e1071',
##   'gridExtra', 'lattice', 'MASS', 'rattle', 'rversions', 'TTR'
biocLite(c("GenomeGraphs"))
## BioC_mirror: http://bioconductor.org
## Using Bioconductor version 3.0 (BiocInstaller 1.16.5), R version 3.1.3.
## Installing package(s) 'GenomeGraphs'
## 
## The downloaded binary packages are in
##  /var/folders/3g/k_y108k10l9729971mkb2ql40000gn/T//RtmpFaj8Fu/downloaded_packages
## Old packages: 'caret', 'cluster', 'codetools', 'crayon', 'e1071',
##   'gridExtra', 'lattice', 'MASS', 'rattle', 'rversions', 'TTR'

Using GenomeGraphs Package: Plotting Genomic Information from Ensembl

Next we perform some simple genomics analysis (looking this time at RNA transcripts or what we call mRNA). The analysis is demonstrated looking at the DNA Sequence for EHM2, a gene known to be over expressed in metastatic skin cancer. For this analysis we use the Bioconductor R package GenomeGraphs which provides tools for visualizing high-dimensional genomic data.

Citation: Durinck S and Bullard J. GenomeGraphs: Plotting genomic information from Ensembl. R package version 1.28.0.

Here’s some summary information on EHM2

Gene name: erythrocyte membrane protein band 4.1 like 4B

Short name: EHM2

Location: Chromosome 9, location: 9q31.3

Key Traits: Over-expressed in highly metastatic cells 2 (that is where the short name is derived)

Found in the following tissues and cancer types: skin, liver, bone marrow, kidney and skin cancer.

The visualization performed here will give us some information about the transcripts from this gene. There are two we will show in this demonstration. Try not to focus too much on the genomics for this talk as we are focused on learning the software. However there are many good tutorials and courses on genomics that can be found both on Coursera & edX free courseware websites. In addition, you can see me afterwards for more discussion.

Our first plot will just show the relative general location of introns and exons along the gene axis. This is a way to visually assess what information might be important for example looking at a mRNA expressed gene mutation. Again, though we are focusing on the R code here. Please see me after the talk for questions on genomics.

The code below calls the GenomeGraphs library and imports a part of a gene sequence.

library(GenomeGraphs)
## Loading required package: biomaRt
## Loading required package: grid
#example 1  Plotting a gene and its variants
mart <- useMart("ensembl", dataset="hsapiens_gene_ensembl")
gene <- makeGene(id = "ENSG00000095203", type="ensembl_gene_id", biomart = mart)
gdPlot(gene)

For our second example we plot the gene transcript for Genesmin along with its variant (that is where there is a version 2 of the mRNA output). Sometimes this becomes very important in genomics as variants can cause disease.

#example 2 Plotting complex data about a gene
transcript <- makeTranscript(id = "ENSG00000095203", type="ensembl_gene_id", biomart = mart)
gdPlot(list(gene, transcript))

A Second Example Using Comparative Genomic Hybridization Visualization

A technique called array comparative genomic hybridization (also microarray-based comparative genomic, hybridization or CGH) is a technqiue for the detection of chromosomal copy number changes. The next visualization is of CGH array data combined with a visualization of gene segments. The goal of the visualization would be to look for chromosomal abnormalities in the form of excess chromosomal copy numbers, missing chromosomal regions or translocated chromosomal regions.

Our final visualization below shows some interesting information. The letters signify the following: (a) image of a typical cytogenetic stain of a chromosome, (b) gene expression, (c) cloning of polymorphims, (d) mRNA transcripts, (e) chromosomal location and (f) genesmin represented by a line with 5’ & 3’ ends.

data("exampleData", package="GenomeGraphs")
minbase <- 180292097
maxbase <- 180492096
genesplus <- makeGeneRegion(start = minbase, end = maxbase, strand = "+", chromosome = "3", biomart=mart)
genesmin <- makeGeneRegion(start = minbase, end = maxbase, strand = "-", chromosome = "3", biomart=mart)
seg <- makeSegmentation(segStart[[1]], segEnd[[1]], segments[[1]], dp = DisplayPars(color = "black", lwd=2,lty = "solid"))
cop <- makeGenericArray(intensity = cn, probeStart = probestart,trackOverlay = seg, dp = DisplayPars(size=3, color = "seagreen", type="dot"))
ideog <- makeIdeogram(chromosome = 3)
expres <- makeGenericArray(intensity = intensity, probeStart = exonProbePos,dp = DisplayPars(color="darkred", type="point"))
genomeAxis <- makeGenomeAxis(add53 = TRUE, add35=TRUE)
plotSystemTime <- system.time(gdPlot(list(a=ideog,b=expres,c=cop,d=genesplus,e=genomeAxis,f=genesmin), minBase = minbase, maxBase =maxbase, labelCex=2))