true

Introduction: Interval data on ER binding

A 2006 Nature Genetics paper (v38, n11) by Carroll, Meyer, Song et al. describes the notion that a group of genes whose expression correlates with the expression of the estrogen receptor (ER) gene forms a signature of a breast cancer subtype called "luminal". This finding motivates interest in determining the set of all genomic elements to which ER binds, and this has been carried out using ChIP-seq experiments. An important finding of the Carroll et al. paper was that ER binding in the MCF-7 cell line occurred only rarely promoter-proximal regions. This study therefore unearthed previously unexplored cis-regulatory sites.

We will show how to relate binding peak locations to gene transcription start sites in two cell lines analyzed in the ENCODE project.

Acquisition of binding data

The examples shown here are reported binding sites for estrogen related receptor binding sites derinved from ChIPSeq experiments on two of the main cell lines used by ENCODE: HepG2 and GM12878. In this section we focus on the analysis that comes after the genomic regions are defined. These data can be found as NarrowPeak files on the public repositories but we have created a R package with GRanges objects already created. If you have not done so already you can install it like this:

library(devtools)
install_github("genomicsclass/ERBS")

Once installed you can attach two object, one for each cell line, and view it's content:

library(ERBS)
data(HepG2)  # cell line of liver origin
data(GM12878)  # immortalized B cell
HepG2  # locations of ER binding peaks
## GRanges object with 303 ranges and 7 metadata columns:
##         seqnames              ranges strand |      name     score       col
##            <Rle>           <IRanges>  <Rle> | <numeric> <integer> <logical>
##     [1]     chr2   20335378-20335787      * |        NA         0      <NA>
##     [2]    chr20       328285-329145      * |        NA         0      <NA>
##     [3]     chr6 168135432-168136587      * |        NA         0      <NA>
##     [4]    chr19     1244419-1245304      * |        NA         0      <NA>
##     [5]    chr11   64071828-64073069      * |        NA         0      <NA>
##     ...      ...                 ...    ... .       ...       ...       ...
##   [299]     chr4     1797182-1797852      * |        NA         0      <NA>
##   [300]     chr1 110198573-110199126      * |        NA         0      <NA>
##   [301]    chr17   17734052-17734469      * |        NA         0      <NA>
##   [302]     chr1   48306453-48306908      * |        NA         0      <NA>
##   [303]    chr12 123867207-123867554      * |        NA         0      <NA>
##         signalValue    pValue      qValue      peak
##           <numeric> <numeric>   <numeric> <integer>
##     [1]      58.251    75.899 6.14371e-72       195
##     [2]      10.808    69.685 5.02806e-66       321
##     [3]      17.103    54.311 7.93067e-51       930
##     [4]      12.427    43.855 1.35976e-40       604
##     [5]      10.850    40.977 7.33386e-38       492
##     ...         ...       ...         ...       ...
##   [299]       9.681    10.057 1.42334e-08       402
##   [300]       7.929    10.047 1.44208e-08       197
##   [301]       5.864     9.990 1.63892e-08       227
##   [302]       5.660     9.948 1.79941e-08       211
##   [303]      13.211     9.918 1.92180e-08       163
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Note that these object contain two main parts, the regions of the genome which we can extract with granges:

granges(HepG2)
## GRanges object with 303 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##     [1]     chr2   20335378-20335787      *
##     [2]    chr20       328285-329145      *
##     [3]     chr6 168135432-168136587      *
##     [4]    chr19     1244419-1245304      *
##     [5]    chr11   64071828-64073069      *
##     ...      ...                 ...    ...
##   [299]     chr4     1797182-1797852      *
##   [300]     chr1 110198573-110199126      *
##   [301]    chr17   17734052-17734469      *
##   [302]     chr1   48306453-48306908      *
##   [303]    chr12 123867207-123867554      *
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

and other information that we can obtain with either the mcols or values functions:

values(HepG2) ##or mcols(HepG2)
## DataFrame with 303 rows and 7 columns
##          name     score       col signalValue    pValue      qValue      peak
##     <numeric> <integer> <logical>   <numeric> <numeric>   <numeric> <integer>
## 1          NA         0        NA      58.251    75.899 6.14371e-72       195
## 2          NA         0        NA      10.808    69.685 5.02806e-66       321
## 3          NA         0        NA      17.103    54.311 7.93067e-51       930
## 4          NA         0        NA      12.427    43.855 1.35976e-40       604
## 5          NA         0        NA      10.850    40.977 7.33386e-38       492
## ...       ...       ...       ...         ...       ...         ...       ...
## 299        NA         0        NA       9.681    10.057 1.42334e-08       402
## 300        NA         0        NA       7.929    10.047 1.44208e-08       197
## 301        NA         0        NA       5.864     9.990 1.63892e-08       227
## 302        NA         0        NA       5.660     9.948 1.79941e-08       211
## 303        NA         0        NA      13.211     9.918 1.92180e-08       163

# Basic operations

These object are of class GRanges defined in the GenomicsRanges package:

class(HepG2)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

There are a large number of tools available to use for this class and the best way to learn about them is through the vignette browseVignettes("GenomicRanges")

Here we start with some very basic example. Suppose these regions are ordered and we want to work with just the first 10. We can subset these objects as you would expect:

granges( HepG2[1:10] )
## GRanges object with 10 ranges and 0 metadata columns:
##        seqnames              ranges strand
##           <Rle>           <IRanges>  <Rle>
##    [1]     chr2   20335378-20335787      *
##    [2]    chr20       328285-329145      *
##    [3]     chr6 168135432-168136587      *
##    [4]    chr19     1244419-1245304      *
##    [5]    chr11   64071828-64073069      *
##    [6]    chr20   22410891-22411863      *
##    [7]    chr19   45439028-45439525      *
##    [8]     chr2   16938364-16938840      *
##    [9]    chr16   70191209-70192150      *
##   [10]     chr3   53290050-53290602      *
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Now what if we want to subset the object by chromosomes? We can access the chromosomes with seqnames. Note that not all organisms have chromosomes so Bioconductor uses a more genreal name:

chr <- seqnames(HepG2)
chr
## factor-Rle of length 303 with 292 runs
##   Lengths:     1     1     1     1     1 ...     1     1     1     1     1
##   Values : chr2  chr20 chr6  chr19 chr11 ... chr4  chr1  chr17 chr1  chr12
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000247 chrUn_gl000248 chrUn_gl000249

Note that chr is not a factor or character vector as we would expect. Instead run-length encoding is used. The reason for this is that it provides a much more efficient way to store large granges, for example those obtained from stroing short reads. Note also that we have chromosome names that are not the standard chromosome names such as chrUn_gl000247. The human genome actually has some sequences that have not been mapped into one of the chromosomes and are given names like these.

The Rle class behaves like factors in several useful ways. For example we can tabulate:

table(chr)[1:24]
## chr
##  chr1  chr2  chr3  chr4  chr5  chr6  chr7  chr8  chr9 chr10 chr11 chr12 chr13 
##    18    38    15     4     8    24    14    11    12     9     9    13     2 
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22  chrX  chrY 
##     8     5    31    21     6    16    27     3     5     4     0

We can also subset the object to data based on these. Here is the subset of regions on chr20:

granges( HepG2[ chr == "chr20" ] )
## GRanges object with 27 ranges and 0 metadata columns:
##        seqnames            ranges strand
##           <Rle>         <IRanges>  <Rle>
##    [1]    chr20     328285-329145      *
##    [2]    chr20 22410891-22411863      *
##    [3]    chr20 56039583-56040249      *
##    [4]    chr20 16455811-16456232      *
##    [5]    chr20   3140243-3140774      *
##    ...      ...               ...    ...
##   [23]    chr20   5591571-5592037      *
##   [24]    chr20 25519664-25520238      *
##   [25]    chr20 19900951-19901275      *
##   [26]    chr20 35156796-35157140      *
##   [27]    chr20 25036720-25037716      *
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Genomic Ranges tools

Before we continue on to finding gene near our binding sites we are going to construct a consensus GRanges. Specifically, we are going to construct an object represnenting the regions that are reported for both cell lines. We can easily find these regions using the findOverlaps function:

res = findOverlaps(HepG2,GM12878)
res
## Hits object with 75 hits and 0 metadata columns:
##        queryHits subjectHits
##        <integer>   <integer>
##    [1]         1          12
##    [2]         2          78
##    [3]         4         777
##    [4]         5           8
##    [5]         8          13
##    ...       ...         ...
##   [71]       285         621
##   [72]       287         174
##   [73]       291        1855
##   [74]       294         512
##   [75]       300         144
##   -------
##   queryLength: 303 / subjectLength: 1873

TO BE CONTINUED

erbs = HepG2[queryHits(res)]
erbs = granges(erbs)

Acquisition of gene transcription start sites

We can use Homo.sapiens:

library(Homo.sapiens)
## Loading required package: OrganismDbi
## Loading required package: GO.db
## 
## Loading required package: org.Hs.eg.db
## 
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
ghs = genes(Homo.sapiens)
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
ghs
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames              ranges strand |          GENEID
##            <Rle>           <IRanges>  <Rle> | <CharacterList>
##       1    chr19   58858172-58874214      - |               1
##      10     chr8   18248755-18258723      + |              10
##     100    chr20   43248163-43280376      - |             100
##    1000    chr18   25530930-25757445      - |            1000
##   10000     chr1 243651535-244006886      - |           10000
##     ...      ...                 ...    ... .             ...
##    9991     chr9 114979995-115095944      - |            9991
##    9992    chr21   35736323-35743440      + |            9992
##    9993    chr22   19023795-19109967      - |            9993
##    9994     chr6   90539619-90584155      + |            9994
##    9997    chr22   50961997-50964905      - |            9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Explain genes have direction and what is a transcription start site

tssgr = resize(ghs,1)
granges(ghs)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##     1    chr19 58858172-58874214      -
##    10     chr8 18248755-18258723      +
##   100    chr20 43248163-43280376      -
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
granges(tssgr)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##     1    chr19  58874214      -
##    10     chr8  18248755      +
##   100    chr20  43280376      -
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
index <- precede(HepG2,ghs)
HepG2[1:3,]
## GRanges object with 3 ranges and 7 metadata columns:
##       seqnames              ranges strand |      name     score       col
##          <Rle>           <IRanges>  <Rle> | <numeric> <integer> <logical>
##   [1]     chr2   20335378-20335787      * |        NA         0      <NA>
##   [2]    chr20       328285-329145      * |        NA         0      <NA>
##   [3]     chr6 168135432-168136587      * |        NA         0      <NA>
##       signalValue    pValue      qValue      peak
##         <numeric> <numeric>   <numeric> <integer>
##   [1]      58.251    75.899 6.14371e-72       195
##   [2]      10.808    69.685 5.02806e-66       321
##   [3]      17.103    54.311 7.93067e-51       930
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
granges(HepG2)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames              ranges strand
##          <Rle>           <IRanges>  <Rle>
##   [1]     chr2   20335378-20335787      *
##   [2]    chr20       328285-329145      *
##   [3]     chr6 168135432-168136587      *
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
granges(ghs)[index[1:3],]
## GRanges object with 3 ranges and 0 metadata columns:
##         seqnames              ranges strand
##            <Rle>           <IRanges>  <Rle>
##    9741     chr2   20232411-20251789      -
##   57761    chr20       361308-378203      +
##    4301     chr6 168227671-168372700      +
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
d = distance(HepG2,ghs[index])

If we want the closest to a start site, regardless of befor or after, for example if overlaps. We want something different.

We can create a granges of TSS like this:

tssgr = resize(ghs,1)

Finding nearest gene for each binding event

The distanceToNearest function from GenomicFeatures will create a Hits object with metadata giving the positive distance between peak location and nearest TSS. We'll discuss the Hits object in the video.

d = distanceToNearest(HepG2, tssgr)
d
## Hits object with 303 hits and 1 metadata column:
##         queryHits subjectHits |  distance
##         <integer>   <integer> | <integer>
##     [1]         1       22817 |     83588
##     [2]         2       19883 |       914
##     [3]         3        8191 |     60951
##     [4]         4       13475 |      2669
##     [5]         5        6316 |         0
##     ...       ...         ... .       ...
##   [299]       299        6699 |      2142
##   [300]       300        9555 |         0
##   [301]       301       17983 |      5855
##   [302]       302       11212 |    155653
##   [303]       303       11132 |      1149
##   -------
##   queryLength: 303 / subjectLength: 23056
##note this gives an error
try(d[,3])
## Error : subscript contains out-of-bounds indices
###instead we have to do this
dists = values(d)$distance

We will call a distance negative if the peak is 5' to the nearest TSS. The density estimate given below shows that the vast majority of events are remote from the 1kb region around TSS; we use dashed lines to denote that region.

index = subjectHits(d)
sdists = ifelse(end(HepG2) < start(tssgr[index]), dists, -dists)
hist(sdists, xlim=c(-100000,100000), main="Density of d(ER binding peak, nearest TSS)" ,breaks=seq(min(sdists),max(sdists),len=1000))
abline(v=-c(10000,10000), lty=2)

Getting information about genes

Annotating genes

Get information on genes

index <- subjectHits(d)[dists<1000]
dists = values(d)$distance
##better way to do this?
geneids <- mcols(tssgr[index])$GENEID
library(Homo.sapiens)
?select
## Help on topic 'select' was found in the following packages:
## 
##   Package               Library
##   biomaRt               /home/jhuang/R/x86_64-pc-linux-gnu-library/4.1
##   AnnotationDbi         /home/jhuang/R/x86_64-pc-linux-gnu-library/4.1
##   dplyr                 /usr/local/lib/R/site-library
## 
## 
## Using the first match ...
columns(Homo.sapiens)
##  [1] "ACCNUM"       "ALIAS"        "CDSCHROM"     "CDSEND"       "CDSID"       
##  [6] "CDSNAME"      "CDSSTART"     "CDSSTRAND"    "DEFINITION"   "ENSEMBL"     
## [11] "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"       "EVIDENCE"    
## [16] "EVIDENCEALL"  "EXONCHROM"    "EXONEND"      "EXONID"       "EXONNAME"    
## [21] "EXONRANK"     "EXONSTART"    "EXONSTRAND"   "GENEID"       "GENENAME"    
## [26] "GENETYPE"     "GO"           "GOALL"        "GOID"         "IPI"         
## [31] "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL"  "PATH"        
## [36] "PFAM"         "PMID"         "PROSITE"      "REFSEQ"       "SYMBOL"      
## [41] "TERM"         "TXCHROM"      "TXEND"        "TXID"         "TXNAME"      
## [46] "TXSTART"      "TXSTRAND"     "TXTYPE"       "UCSCKG"       "UNIPROT"
keytypes(Homo.sapiens)
##  [1] "ACCNUM"       "ALIAS"        "CDSID"        "CDSNAME"      "DEFINITION"  
##  [6] "ENSEMBL"      "ENSEMBLPROT"  "ENSEMBLTRANS" "ENTREZID"     "ENZYME"      
## [11] "EVIDENCE"     "EVIDENCEALL"  "EXONID"       "EXONNAME"     "GENEID"      
## [16] "GENENAME"     "GENETYPE"     "GO"           "GOALL"        "GOID"        
## [21] "IPI"          "MAP"          "OMIM"         "ONTOLOGY"     "ONTOLOGYALL" 
## [26] "PATH"         "PFAM"         "PMID"         "PROSITE"      "REFSEQ"      
## [31] "SYMBOL"       "TERM"         "TXID"         "TXNAME"       "UCSCKG"      
## [36] "UNIPROT"
geneids <- as.character(geneids)
select(Homo.sapiens,keys=geneids,columns="GENENAME",keytype="GENEID")
## 'select()' returned 1:1 mapping between keys and columns
##       GENEID
## 1      80023
## 2       2101
## 3       7086
## 4       5478
## 5     286262
## 6       2875
## 7      79696
## 8      79763
## 9      60493
## 10    130074
## 11     80344
## 12     58484
## 13    724033
## 14     64284
## 15      4967
## 16    693226
## 17      8508
## 18     26873
## 19      6648
## 20      5833
## 21      9377
## 22    136853
## 23      3032
## 24     91252
## 25        95
## 26    374291
## 27     84331
## 28    114770
## 29     81572
## 30       518
## 31    126295
## 32      9563
## 33      2665
## 34      5524
## 35     10656
## 36     10116
## 37     10939
## 38     58476
## 39      2063
## 40     90121
## 41     81620
## 42     54963
## 43      8731
## 44      6666
## 45    400548
## 46     10312
## 47     23498
## 48 100134015
## 49     56910
## 50     51491
## 51     79814
## 52      8328
## 53      8502
## 54    119559
## 55 100616250
## 56     84883
## 57     10131
## 58     54800
## 59        39
## 60      8028
## 61     57478
## 62     56261
## 63      4705
## 64     25831
## 65     51660
## 66    346606
## 67     63924
## 68       183
## 69     79675
## 70     57019
## 71     23408
## 72     55256
## 73        20
## 74     55090
## 75      3364
## 76     11165
## 77       670
## 78     51112
## 79     56181
## 80    347734
## 81      8165
## 82      2948
##                                                                        GENENAME
## 1                                                                   neurensin 2
## 2                                               estrogen related receptor alpha
## 3                                                                 transketolase
## 4                                                    peptidylprolyl isomerase A
## 5                                                                       taperin
## 6                                                glutamic--pyruvic transaminase
## 7                                           zinc finger C2HC-type containing 1C
## 8                                           isochorismatase domain containing 2
## 9                                                         FAST kinase domains 5
## 10                                 family with sequence similarity 168 member B
## 11                                           DDB1 and CUL4 associated factor 11
## 12                                          NLR family CARD domain containing 4
## 13                                                                microRNA 663a
## 14                                            RAB17, member RAS oncogene family
## 15                                                   oxoglutarate dehydrogenase
## 16                                                                 microRNA 641
## 17                                                            nipsnap homolog 1
## 18                                              5-oxoprolinase, ATP-hydrolysing
## 19                                                       superoxide dismutase 2
## 20                               phosphate cytidylyltransferase 2, ethanolamine
## 21                                              cytochrome c oxidase subunit 5A
## 22                scavenger receptor cysteine rich family member with 4 domains
## 23 hydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit beta
## 24                                           solute carrier family 39 member 13
## 25                                                               aminoacylase 1
## 26                               NADH:ubiquinone oxidoreductase core subunit S7
## 27                             MAPK regulated corepressor interacting protein 2
## 28                                          peptidoglycan recognition protein 2
## 29                                               p53 and DNA damage regulated 1
## 30                                      ATP synthase membrane subunit c locus 3
## 31                                                       zinc finger protein 57
## 32                     hexose-6-phosphate dehydrogenase/glucose 1-dehydrogenase
## 33                                                 GDP dissociation inhibitor 2
## 34                                  protein phosphatase 2 phosphatase activator
## 35           KH RNA binding domain containing, signal transduction associated 3
## 36                                                              fem-1 homolog B
## 37                                     AFG3 like matrix AAA peptidase subunit 2
## 38                                tumor protein p53 inducible nuclear protein 2
## 39                                nuclear receptor subfamily 2 group F member 6
## 40                                              TSR2 ribosome maturation factor
## 41                             chromatin licensing and DNA replication factor 1
## 42                                             uridine-cytidine kinase 1 like 1
## 43                                              RNA guanine-7 methyltransferase
## 44                                              SRY-box transcription factor 12
## 45                                  long intergenic non-protein coding RNA 2139
## 46              T cell immune regulator 1, ATPase H+ transporting V0 subunit a3
## 47                                        3-hydroxyanthranilate 3,4-dioxygenase
## 48                                                        UBOX5 antisense RNA 1
## 49                              StAR related lipid transfer domain containing 7
## 50                                                      NOP16 nucleolar protein
## 51                                                                   agmatinase
## 52                       growth factor independent 1B transcriptional repressor
## 53                                                                plakophilin 4
## 54                                                               sideroflexin 4
## 55                                                                microRNA 3960
## 56                          apoptosis inducing factor mitochondria associated 2
## 57                                            TNF receptor associated protein 1
## 58                                                  kelch like family member 24
## 59                                               acetyl-CoA acetyltransferase 2
## 60                       MLLT10 histone lysine methyltransferase DOT1L cofactor
## 61                                              ubiquitin specific peptidase 31
## 62                                    glycerophosphocholine phosphodiesterase 1
## 63                                   NADH:ubiquinone oxidoreductase subunit A10
## 64                                    HECT domain E3 ubiquitin protein ligase 1
## 65                                             mitochondrial pyruvate carrier 1
## 66                                         monoacylglycerol O-acyltransferase 3
## 67                                     cell death inducing DFFA like effector c
## 68                                                              angiotensinogen
## 69                                                        FAST kinase domains 1
## 70                                       cytokine induced apoptosis inhibitor 1
## 71                                                                    sirtuin 5
## 72                                                   acireductone dioxygenase 1
## 73                                    ATP binding cassette subfamily A member 2
## 74                                                   mediator complex subunit 9
## 75                                              HUS1 checkpoint clamp component
## 76                                                            nudix hydrolase 3
## 77                                                      biphenyl hydrolase like
## 78                              trafficking protein particle complex subunit 12
## 79                                       mitochondrial fission regulator 1 like
## 80                                           solute carrier family 35 member B2
## 81                                                 A-kinase anchoring protein 1
## 82                                               glutathione S-transferase mu 4