We will use the following libraries to demonstrate visualization of NGS data.

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

## Loading required package: GenomicFeatures
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

We will try four ways to look at NGS coverage: using the standalone Java program IGV, using simple plot commands, and using the Gviz and ggbio packages in Bioconductor.

IGV

Copy these files from the R library directory to the current working directory. First set the working directory to the source file location. We need to use the Rsamtools library to index the BAM files for using IGV.

file.copy(from=fl1,to=basename(fl1))

## [1] FALSE

file.copy(from=fl2,to=basename(fl2))

## [1] FALSE

library(Rsamtools)

## Loading required package: XVector
## Loading required package: Biostrings

indexBam(basename(fl1))

##       untreated1_chr4.bam 
## "untreated1_chr4.bam.bai"

indexBam(basename(fl2))

##       untreated3_chr4.bam 
## "untreated3_chr4.bam.bai"

IGV is freely available for download here: https://www.broadinstitute.org/igv/home

You will need to provide an email, and then you will get a download link.

Using IGV, look for gene lgs.

Note that if you have trouble downloading IGV, another option for visualization is the UCSC Genome Browser: http://genome.ucsc.edu/cgi-bin/hgTracks

The UCSC Genome Browser is a great resource, having many tracks involving gene annotations, conservation over multiple species, and the ENCODE epigenetic tracks already available. However, the UCSC Genome Browser requires that you upload your genomic files to their server, or put your data on a publicly available server. This is not always possible if you are working with confidential data.

Simple plot

Next we will look at the same gene using the simple plot function in R.

library(GenomicRanges)

Note: if you are using Bioconductor version 14, paired with R 3.1, you should also load this library. You do not need to load this library, and it will not be available to you, if you are using Bioconductor version 13, paired with R 3.0.x.

library(GenomicAlignments)

We read in the alignments from the file fl1. Then we use the coverage function to tally up the basepair coverage. We then extract the subset of coverage which overlaps our gene of interest, and convert this coverage from an RleList into a numeric vector. Remember from Week 2, that Rle objects are compressed, such that repeating numbers are stored as a number and a length.

x <- readGAlignments(fl1)
xcov <- coverage(x)
z <- GRanges("chr4",IRanges(456500,466000))
# Bioconductor 2.14
xcov[z]

## RleList of length 1
## $chr4
## integer-Rle of length 9501 with 1775 runs
##   Lengths: 1252   10   52    4    7    2 ...   10    7   12  392   75 1041
##   Values :    0    2    3    4    5    6 ...    3    2    1    0    1    0

# Bioconductor 2.13
xcov$chr4[ranges(z)]

## integer-Rle of length 9501 with 1775 runs
##   Lengths: 1252   10   52    4    7    2 ...   10    7   12  392   75 1041
##   Values :    0    2    3    4    5    6 ...    3    2    1    0    1    0

xnum <- as.numeric(xcov$chr4[ranges(z)])
plot(xnum)

We can do the same for another file:

y <- readGAlignmentPairs(fl2)

## Warning in .make_GAlignmentPairs_from_GAlignments(gal, use.mcols = use.mcols): 63 pairs (0 proper, 63 not proper) were dropped because the seqname
##   or strand of the alignments in the pair were not concordant.
##   Note that a GAlignmentPairs object can only hold concordant pairs at the
##   moment, that is, pairs where the 2 alignments are on the opposite strands
##   of the same chromosome.

ycov <- coverage(y)
ynum <- as.numeric(ycov$chr4[ranges(z)])
plot(xnum, type="l", col="blue", lwd=2)
lines(ynum, col="red", lwd=2)

We can zoom in on a single exon:

plot(xnum, type="l", col="blue", lwd=2, xlim=c(6200,6600))
lines(ynum, col="red", lwd=2)

Extracting the gene of interest using the transcript database

Suppose we are interested in visualizing the gene lgs. We can extract it from the transcript database TxDb.Dmelanogaster.UCSC.dm3.ensGene on Bioconductor, but first we need to look up the Ensembl gene name. We will use the functions that we learned in Week 7 to find the name.

# biocLite("biomaRt")
library(biomaRt)
m <- useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
lf <- listFilters(m)
lf[grep("name", lf$description, ignore.case=TRUE),]

##                             name
## 1                chromosome_name
## 17         with_flybasename_gene
## 19   with_flybasename_transcript
## 23  with_flybasename_translation
## 52            external_gene_name
## 66              flybasename_gene
## 67        flybasename_transcript
## 68       flybasename_translation
## 93                 wikigene_name
## 107               go_parent_name
## 208               so_parent_name
##                                      description
## 1                                Chromosome name
## 17                   with FlyBaseName gene ID(s)
## 19             with FlyBaseName transcript ID(s)
## 23                with FlyBaseName protein ID(s)
## 52          Associated Gene Name(s) [e.g. BRCA2]
## 66           FlyBaseName Gene ID(s) [e.g. cul-2]
## 67  FlyBaseName Transcript ID(s) [e.g. cul-2-RB]
## 68     FlyBaseName Protein ID(s) [e.g. cul-2-PB]
## 93                 WikiGene Name(s) [e.g. Ir21a]
## 107                             Parent term name
## 208                             Parent term name

map <- getBM(mart = m,
  attributes = c("ensembl_gene_id", "flybasename_gene"),
  filters = "flybasename_gene", 
  values = "lgs")
map

##   ensembl_gene_id flybasename_gene
## 1     FBgn0039907              lgs

Now we extract the exons for each gene, and then the exons for the gene lgs.

library(GenomicFeatures)
grl <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
gene <- grl[[map$ensembl_gene_id[1]]]
gene

## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames           ranges strand |   exon_id   exon_name
##          <Rle>        <IRanges>  <Rle> | <integer> <character>
##   [1]     chr4 [457583, 459544]      - |     63350        <NA>
##   [2]     chr4 [459601, 459791]      - |     63351        <NA>
##   [3]     chr4 [460074, 462077]      - |     63352        <NA>
##   [4]     chr4 [462806, 463015]      - |     63353        <NA>
##   [5]     chr4 [463490, 463780]      - |     63354        <NA>
##   [6]     chr4 [463839, 464533]      - |     63355        <NA>
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

Finally we can plot these ranges to see what it looks like:

rg <- range(gene)
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1)

But actually, the gene is on the minus strand. We should add a line which corrects for minus strand genes:

plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1, 
       code=ifelse(as.character(strand(gene)[1]) == "+", 2, 1))

Gviz

We will briefly show two packages for visualizing genomic data in Bioconductor. Note that each of these have extensive vignettes for plotting many kinds of data. We will show here how to make the coverage plots as before:

#biocLite("Gviz")
library(Gviz)

## Loading required package: grid

gtrack <- GenomeAxisTrack()
atrack <- AnnotationTrack(gene, name = "Gene Model")
plotTracks(list(gtrack, atrack))

Extract the coverage. Gviz expects that data will be provided as GRanges objects, so we convert the RleList coverage to a GRanges object:

xgr <- as(xcov, "GRanges")
ygr <- as(ycov, "GRanges")
dtrack1 <- DataTrack(xgr[xgr %over% z], name = "sample 1")
dtrack2 <- DataTrack(ygr[ygr %over% z], name = "sample 2")
plotTracks(list(gtrack, atrack, dtrack1, dtrack2))

plotTracks(list(gtrack, atrack, dtrack1, dtrack2), type="polygon")

ggbio

#biocLite("ggbio")
library(ggbio)

## Loading required package: ggplot2
## Need specific help about ggbio? try mailing 
##  the maintainer or visit http://tengfei.github.com/ggbio/
## 
## Attaching package: 'ggbio'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
##     stat_identity, xlim

autoplot(gene)

autoplot(fl1, which=z)

## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

autoplot(fl2, which=z)

## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

Footnotes

IGV https://www.broadinstitute.org/igv/home
Gviz http://www.bioconductor.org/packages/release/bioc/html/Gviz.html
ggbio http://www.bioconductor.org/packages/release/bioc/html/ggbio.html
UCSC Genome Browser: zooms and scrolls over chromosomes, showing the work of annotators worldwide http://genome.ucsc.edu/
Ensembl genome browser: genome databases for vertebrates and other eukaryotic species http://ensembl.org
Roadmap Epigenome browser: public resource of human epigenomic data http://www.epigenomebrowser.org http://genomebrowser.wustl.edu/ http://epigenomegateway.wustl.edu/
“Sashimi plots” for RNA-Seq http://genes.mit.edu/burgelab/miso/docs/sashimi.html
Circos: designed for visualizing genomic data in a cirlce http://circos.ca/
SeqMonk: a tool to visualise and analyse high throughput mapped sequence data http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/

——————————————————————-

Exercises

We recall the HepG2 GRanges from the ERBS package, which provides very limited information in the narrowPeak format on the locations and sizes of ESRRA binding to DNA from a liver cell line.

A higher resolution view of the ESRRA binding events is available in the bigWig format here (319 MB). The bigWig format gives us coverage information, similar to the coverage we saw from RNA-seq in the previous videos.

Download that bigWig file to your current working folder and import to a GRanges

library(rtracklayer)
# this next line takes a few minutes to run
h2bw = import("wgEncodeSydhTfbsHepg2ErraForsklnStdSig.bigWig")
h2bw

## GRanges object with 53093512 ranges and 1 metadata column:
##              seqnames               ranges strand   |            score
##                 <Rle>            <IRanges>  <Rle>   |        <numeric>
##          [1]     chr1       [10446, 10456]      *   |                0
##          [2]     chr1       [10457, 10465]      *   | 5.40000009536743
##          [3]     chr1       [10466, 10468]      *   | 5.30000019073486
##          [4]     chr1       [10469, 10474]      *   | 5.19999980926514
##          [5]     chr1       [10475, 10476]      *   | 5.30000019073486
##          ...      ...                  ...    ... ...              ...
##   [53093508]     chrY [59033190, 59033190]      *   | 10.6999998092651
##   [53093509]     chrY [59033191, 59033192]      *   | 10.8000001907349
##   [53093510]     chrY [59033193, 59033246]      *   | 10.8999996185303
##   [53093511]     chrY [59033247, 59033269]      *   | 5.40000009536743
##   [53093512]     chrY [59033282, 59033299]      *   |                0
##   -------
##   seqinfo: 25 sequences from an unspecified genome

Note that the output in bigWig presents much smaller genomic intervals than the narrowPeak data. We now have the coverage of ChIP-seq reads at each location in the genome which passed some minimal threshold (not the same as “peaks” as we will see below). If the coverage is identical at adjacent basepairs, then the region will be as wide as possible. The coverage information is present in the ‘score’ metadata column.

Question 3.10.1

What is the median width in bases of ranges in h3bw?

summary(width(h2bw))

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     1.00     2.00     4.00    48.44    25.00 19940.00

Question 3.10.2

We can compare the bigwig intervals to the reported peaks

# if you don't have ERBS: install_github("genomicsclass/ERBS")
library(ERBS) 
data(HepG2)
fo = findOverlaps(h2bw, HepG2)
inpeak = queryHits(fo)

What is the median score for the h2bw bigWig data in regions identified as peaks in HepG2?

sum(duplicated(inpeak))

## [1] 0

## no duplicated indices here, so:
median(h2bw[ inpeak ]$score)

## [1] 45.7

Question 3.10.3

What is the median score for the h2bw bigWig data in regions NOT identified as peaks in HepG2?

median(h2bw[ -inpeak ]$score)

## [1] 4.2

Question 3.10.4

Find the location of gene with the SYMBOL ‘ESRRA’ in hg19 (hint: ‘CHRLOC’ column in Homo.sapiens), and the narrowPeak range that includes it in HepG2. Obtain the scores from h2bw that lie in this peak. What is the maximum score in this peak region?

library(Homo.sapiens)

## Loading required package: OrganismDbi
## Loading required package: GO.db
## Loading required package: DBI
## 
## Loading required package: org.Hs.eg.db
## 
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene

select(Homo.sapiens, keys="ESRRA", keytype="SYMBOL",columns="CHRLOC")

##   SYMBOL   CHRLOC CHRLOCCHR
## 1  ESRRA 64073044        11

narrind = queryHits(findOverlaps(HepG2, GRanges("chr11", IRanges(64073044, width=1))))

bwind = queryHits(fo)[ subjectHits(fo)==narrind]
max( h2bw$score[ bwind ] )

## [1] 202.1

Question 3.10.5

We have now found the coverage of the ChIP-seq peak nearest to the ESRRA gene. We can write this as:

peakcov = h2bw[ queryHits(fo)[subjectHits(fo) == 5] ]

We can plot the coverage vs the middle point of each range:

plot( 0.5*( start(peakcov) + end(peakcov) ), peakcov$score)

From the information in ‘peakcov’, what is the location of the basepair with the highest coverage?

peakcov[ which.max(peakcov$score) ]

## GRanges object with 1 range and 1 metadata column:
##       seqnames               ranges strand |            score
##          <Rle>            <IRanges>  <Rle> |        <numeric>
##   [1]    chr11 [64072320, 64072320]      * | 202.100006103516
##   -------
##   seqinfo: 25 sequences from an unspecified genome

Note that this information (just the location of the highest peak, not the coverage details) is annotated in the narrowPeak file. The “peak” column of HepG2 annotates the location of the highest coverage relative to the leftmost point of the range:

start(HepG2[5]) + HepG2[5]$peak

## [1] 64072320

Visualizing NGS data

IGV