The following lab will describe how to count NGS reads which fall into genomic features. We want to end up with a count matrix which has rows corresponding to genomic ranges and columns which correspond to different experiments or samples. As an example, we will use an RNA-Seq experiment, with files in the pasillaBamSubset Bioconductor data package. However, the same functions can be used for DNA-Seq, ChIP-Seq, etc.

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

We load a transcript database object. These are prebuilt in R for various well studied organisms, for example TxDb.Hsapiens.UCSC.hg19.knownGene. In addition the makeTranscriptDbFromGFF file can be used to import GFF or GTF gene models. We use the exonsBy function to get a GRangesList object of the exons for each gene.

txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
grl <- exonsBy(txdb, by="gene") # to get genes from the list
grl[100] # in order to view the 100th gene, each line is an exon.

## GRangesList object of length 1:
## $FBgn0000286 
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   [2]    chr2L [4877289, 4879196]      - |      8516        <NA>
##   [3]    chr2L [4880294, 4880472]      - |      8517        <NA>
##   [4]    chr2L [4880378, 4880472]      - |      8518        <NA>
##   [5]    chr2L [4881215, 4882492]      - |      8519        <NA>
##   [6]    chr2L [4882865, 4883113]      - |      8520        <NA>
##   [7]    chr2L [4882889, 4883113]      - |      8521        <NA>
##   [8]    chr2L [4882889, 4883341]      - |      8522        <NA>
## 
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

grl[[100]] # in order to filter out only the GRangesList out of 100th exon data

## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   [2]    chr2L [4877289, 4879196]      - |      8516        <NA>
##   [3]    chr2L [4880294, 4880472]      - |      8517        <NA>
##   [4]    chr2L [4880378, 4880472]      - |      8518        <NA>
##   [5]    chr2L [4881215, 4882492]      - |      8519        <NA>
##   [6]    chr2L [4882865, 4883113]      - |      8520        <NA>
##   [7]    chr2L [4882889, 4883113]      - |      8521        <NA>
##   [8]    chr2L [4882889, 4883341]      - |      8522        <NA>
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

grl[[100]][1] # in order to filter out the GRanges of exon no.1 of the 100th gene.

## GRanges object with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr2L [4876890, 4879196]      - |      8515        <NA>
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

These functions in the pasillaBamSubset package just point us to the BAM files.

fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()
fl1

## [1] "C:/Users/Prasanth/Documents/R/win-library/3.2/pasillaBamSubset/extdata/untreated1_chr4.bam"

We need the following libraries for counting BAM files.

library(Rsamtools)

## Loading required package: XVector
## Loading required package: Biostrings

library(GenomicRanges)
library(GenomicAlignments)

We specify the files using the BamFileList function. For more fine grain control, you can tell the read counting functions how many reads to load at once with the yieldSize argument.

fls <- BamFileList(c(fl1, fl2))
names(fls) <- c("first","second")

The following function counts the overlaps of the reads in the BAM files which are found in the features in the genes of Drosophila. We tell the counting function to ignore the strand, i.e., to allow minus strand reads to count in plus strand genes, and vice versa.

so1 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE)
so1

## class: SummarizedExperiment 
## dim: 15682 2 
## exptData(0):
## assays(1): counts
## rownames(15682): FBgn0000003 FBgn0000008 ... FBgn0264726
##   FBgn0264727
## rowRanges metadata column names(0):
## colnames(2): first second
## colData names(0):

We can examine the count matrix, which is stored in the assay slot:

head(assay(so1))

##             first second
## FBgn0000003     0      0
## FBgn0000008     0      0
## FBgn0000014     0      0
## FBgn0000015     0      0
## FBgn0000017     0      0
## FBgn0000018     0      0

colSums(assay(so1))

##  first second 
## 156469 122872

The other parts of a SummarizedExperiment.

rowRanges(so1) # rowData will become rowRanges in the next release of Bioc

## GRangesList object of length 15682:
## $FBgn0000003 
## GRanges object with 1 range and 2 metadata columns:
##       seqnames             ranges strand |   exon_id   exon_name
##          <Rle>          <IRanges>  <Rle> | <integer> <character>
##   [1]    chr3R [2648220, 2648518]      + |     45123        <NA>
## 
## $FBgn0000008 
## GRanges object with 13 ranges and 2 metadata columns:
##        seqnames               ranges strand   | exon_id exon_name
##    [1]    chr2R [18024494, 18024531]      +   |   20314      <NA>
##    [2]    chr2R [18024496, 18024713]      +   |   20315      <NA>
##    [3]    chr2R [18024938, 18025756]      +   |   20316      <NA>
##    [4]    chr2R [18025505, 18025756]      +   |   20317      <NA>
##    [5]    chr2R [18039159, 18039200]      +   |   20322      <NA>
##    ...      ...                  ...    ... ...     ...       ...
##    [9]    chr2R [18058283, 18059490]      +   |   20326      <NA>
##   [10]    chr2R [18059587, 18059757]      +   |   20327      <NA>
##   [11]    chr2R [18059821, 18059938]      +   |   20328      <NA>
##   [12]    chr2R [18060002, 18060339]      +   |   20329      <NA>
##   [13]    chr2R [18060002, 18060346]      +   |   20330      <NA>
## 
## ...
## <15680 more elements>
## -------
## seqinfo: 15 sequences (1 circular) from dm3 genome

colData(so1)

## DataFrame with 2 rows and 0 columns

colData(so1)$sample <- c("one","two")
colData(so1)

## DataFrame with 2 rows and 1 column
##             sample
##        <character>
## first          one
## second         two

metadata(rowRanges(so1))

## $genomeInfo
## $genomeInfo$`Db type`
## [1] "TxDb"
## 
## $genomeInfo$`Supporting package`
## [1] "GenomicFeatures"
## 
## $genomeInfo$`Data source`
## [1] "UCSC"
## 
## $genomeInfo$Genome
## [1] "dm3"
## 
## $genomeInfo$Organism
## [1] "Drosophila melanogaster"
## 
## $genomeInfo$`UCSC Table`
## [1] "ensGene"
## 
## $genomeInfo$`Resource URL`
## [1] "http://genome.ucsc.edu/"
## 
## $genomeInfo$`Type of Gene ID`
## [1] "Ensembl gene ID"
## 
## $genomeInfo$`Full dataset`
## [1] "yes"
## 
## $genomeInfo$`miRBase build ID`
## [1] NA
## 
## $genomeInfo$transcript_nrow
## [1] "29173"
## 
## $genomeInfo$exon_nrow
## [1] "76920"
## 
## $genomeInfo$cds_nrow
## [1] "62135"
## 
## $genomeInfo$`Db created by`
## [1] "GenomicFeatures package from Bioconductor"
## 
## $genomeInfo$`Creation time`
## [1] "2015-03-19 14:00:50 -0700 (Thu, 19 Mar 2015)"
## 
## $genomeInfo$`GenomicFeatures version at creation time`
## [1] "1.19.32"
## 
## $genomeInfo$`RSQLite version at creation time`
## [1] "1.0.0"
## 
## $genomeInfo$DBSCHEMAVERSION
## [1] "1.1"

We can do some basic exploratory data analysis of the counts:

x <- assay(so1)[,1]
hist(x[x > 0], col="grey")

hist(x[x > 0 & x < 10000], col="grey")

plot(assay(so1) + 1, log="xy")

The second file should actually be counted in a special manner, as it contains pairs of reads which come from a single fragment. We do not want to count these twice, so we set singleEnd = FALSE. Additionally, we specify fragments = TRUE which counts reads if only one of the pair aligns to the features, and the other pair aligns to no feature.

# ?untreated3_chr4
# ?summarizeOverlaps
fls <- BamFileList(fl2)
so2 <- summarizeOverlaps(features=grl,
                         reads=fls,
                         ignore.strand=TRUE,
                         singleEnd=FALSE, 
                         fragments=TRUE)
colSums(assay(so2)) # it is only half of the previous (so1)

## untreated3_chr4.bam 
##               65591

colSums(assay(so1)) # this shows that it reads paired ends

##  first second 
## 156469 122872

plot(assay(so1)[,2], assay(so2)[,1], xlim=c(0,5000), ylim=c(0,5000),
     xlab="single end counting", ylab="paired end counting")
abline(0,1)
abline(0,.5)

Footnotes

Methods for counting reads which overlap features

Bioconductor packages:

summarizeOverlaps in the GenomicAlignments package

http://www.bioconductor.org/packages/release/bioc/html/GenomicAlignments.html

featureCounts in the Rsubread package

Liao Y, Smyth GK, Shi W., “featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.” Bioinformatics. 2014 http://www.ncbi.nlm.nih.gov/pubmed/24227677 http://bioinf.wehi.edu.au/featureCounts/

Command line tools:

htseq-count, a program in the htseq Python package

Simon Anders, Paul Theodor Pyl, Wolfgang Huber. HTSeq — A Python framework to work with high-throughput sequencing data bioRxiv preprint (2014), doi: 10.1101/002824

http://www-huber.embl.de/users/anders/HTSeq/doc/count.html

bedtools https://code.google.com/p/bedtools/
bedops https://code.google.com/p/bedops/

——————————————————–

Exercise

Question

We will first examine the read counts in genes generated with summarizeOverlaps using:

genes(), which gives a range of each gene, from the start position to the end position exonsBy(), which produces a GRangesList, with a GRanges of the exons for each gene.

Load the package with the BAM files and the transcript databse:

library(psillaBamSubset)

library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)

txdb = TxDb.Dmelanogaster.UCSC.dm3.ensGene

Create the genes objects and subset to chromosome 4:

g = genes(txdb)
g = g[seqnames(g) == "chr4"]

Create GRangesList with exons by gene, nd subset to the same genes as in ‘g’:

grl = exonsBy(txdb, by = "gene")
grl = grl[names(g)]

Test for the same names:

all.equal(names(g), names(grl))

## [1] TRUE

Load the BamFile we have been using of single-end RNA-seq reads:

bf = BamFile(untreated1_chr4())

Question 3.5B.1

Run summarizeOverlaps() on the reads in the BamFile 'bf', once using the genes object 'g', and once using the GRangesList with the exons per gene, 'grl'. Because the experiment was not strand-specific, use the ignore.strand=TRUE setting. You don’t have to specify any of the other arguments besides ignore.strand.

Plot the reads from g and from grl in a scatterplot, on the log scale. Add an abline with a=0, b=1. Which feature set generated lower counts?

What is the average ratio of the counts in grl over the counts in g, after removing the features where g had zero counts?

so1 = summarizeOverlaps(features = g,
                        reads = bf,
                        ignore.strand=TRUE)
so2 = summarizeOverlaps(features = grl,
                        reads = bf,
                        ignore.strand = TRUE)
plot(assay(so1), assay(so2), log = "xy")

## Warning in xy.coords(x, y, xlabel, ylabel, log): 6 x values <= 0 omitted
## from logarithmic plot

## Warning in xy.coords(x, y, xlabel, ylabel, log): 3 y values <= 0 omitted
## from logarithmic plot

# this gives a warning about 0 counts, do the following instead: 
plot(assay(so1)+1, assay(so2)+1, log = "xy")
abline(a=0,b=1)

x = assay(so1)[,1]
y = assay(so2)[,1]
mean ((y/x)[x>0])

## [1] 0.8929273

# is the x is not selected non-zero, it will give a warning since the ratio will become invalid.

The GRangesList feature set gives a lower count value on average, because this method of counting does not count the reads which fall in introns. The reads have to fall in exons to count, if we use the exons as features. There are many options for counting modes which can be specified by summarizeOverlaps, including fully customized counting rules which the user provides as a function. See ?summarizeOverlaps for more details.

The reason for prefering the lower count, is that reads which fall in introns might not be from the fully mature mRNA, but might be contamination of precursor mRNA or possibly from some other source.

Question 3.5B.2

The number of reads which align to a gene depend upon, among other factors, the sequencing depth (the number of reads total), and the size of the gene.

Using the count table (one column) prepared with the GRangesList ‘grl’ in the problem above, divide the counts by the sum of counts. Now multiply by 1 million. This operation scales each column of the count table such that we get the number of reads expected if all the samples were sequenced to have 1 million reads mapping to genes on chromosome 4. If the sequencing depth was 1 million, the scaled count would be equal. If the sequencing depth was higher, the scaled count will now be lower. If the sequencing depth was lower, the scaled count will now be higher.

Remember, we are only looking at chromosome 4, but suppose now that all the genes were on chromosome 4 (just ignore the other genes), this quantity would represent the RPM (or FPM), the number of reads (or fragments) per million mapped reads. (“FPM” and “fragments” are used more generally, because in a paired-end experiment two reads represent one observed fragment, and in a single-end experiment one read represents one fragment.)

Supposing that all the genes were on chromosome 4, what is the FPM you calculated for the first gene, FBgn0002521?

mr = 1e6 *(length(so2)/sum(assay(so2)[,1]))

count = assay(so2)
# here, we can just use sum() because there is only one column
# otherwise we would use: sweep(count, 2, colSums(count), "/")

fpm = (count/sum(count)) *1e6
head(fpm,1)

##             untreated1_chr4.bam
## FBgn0002521            4275.607

Question 3.5B.3

We said that the count depends on the number of mapped reads, and also on the size of the gene. If you think about reads randomly arising from fragmented RNA molecules, the number of reads (or fragments) should be proportional to the number of basepairs in the exons (roughly), if all other variables are held equal (gene expression and sequencing depth).

Using ‘grl’ with the functions reduce() and width(), come up with the number of basepairs in the exons of each gene. What is the summary() of this number? The mean should be 4383. The reduce() call is necessary to not count basepairs from overlapping exons.

Divide the FPM by the number of basepairs in the exons of each gene, and then finally multiply by 1000. This is called the FPKM, the number of fragments per kilobase of exonic basepairs, per million mapped reads (again we suppose that we’ve observed all the genes, rather than just the ones on chromosome 4).

What is the FPKM for the first gene, FBgn0002521?

sm = summary(reduce(ranges(grl)))
mean(sm[,2])

## [1] 4383.387

ebp = sum(width(reduce(grl)))
count = assay(so2)
fpm = (count/sum(count)) * 1e6
fpkm = (fpm/ebp) * 1e3

head(fpkm)

##             untreated1_chr4.bam
## FBgn0002521         1782.245648
## FBgn0004607            7.325300
## FBgn0004859          432.025865
## FBgn0005558           15.578409
## FBgn0005561            8.363675
## FBgn0005666          114.105006

Note that there are more robust ways of estimating FPM and FPKM, for example using a more robust estimator for the sequencing depth than the sum of counts, and taking into account multiple transcript isoforms, which might be expressed at different levels. Here we have performed the simplest kind of sequencing depth and gene length normalization.

NGS read counting