A 2006 Nature Genetics paper (v38, n11) by Carroll, Meyer, Song et al. describes the notion that a group of genes whose expression correlates with the expression of the estrogen receptor (ER) gene forms a signature of a breast cancer subtype called "luminal". This finding motivates interest in determining the set of all genomic elements to which ER binds, and this has been carried out using ChIP-seq experiments. An important finding of the Carroll et al. paper was that ER binding in the MCF-7 cell line occurred only rarely promoter-proximal regions. This study therefore unearthed previously unexplored cis-regulatory sites.
We will show how to relate binding peak locations to gene transcription start sites in two cell lines analyzed in the ENCODE project.
The examples shown here are reported binding sites for estrogen related receptor binding sites derinved from ChIPSeq experiments on two of the main cell lines used by ENCODE: HepG2 and GM12878. In this section we focus on the analysis that comes after the genomic regions are defined. These data can be found as NarrowPeak files on the public repositories but we have created a R package with GRanges objects already created. If you have not done so already you can install it like this:
library(devtools)
install_github("genomicsclass/ERBS")
Once installed you can attach two object, one for each cell line, and view it's content:
library(ERBS)
data(HepG2) # cell line of liver origin
data(GM12878) # immortalized B cell
HepG2 # locations of ER binding peaks
## GRanges object with 303 ranges and 7 metadata columns:
## seqnames ranges strand | name score col
## <Rle> <IRanges> <Rle> | <numeric> <integer> <logical>
## [1] chr2 20335378-20335787 * | NA 0 <NA>
## [2] chr20 328285-329145 * | NA 0 <NA>
## [3] chr6 168135432-168136587 * | NA 0 <NA>
## [4] chr19 1244419-1245304 * | NA 0 <NA>
## [5] chr11 64071828-64073069 * | NA 0 <NA>
## ... ... ... ... . ... ... ...
## [299] chr4 1797182-1797852 * | NA 0 <NA>
## [300] chr1 110198573-110199126 * | NA 0 <NA>
## [301] chr17 17734052-17734469 * | NA 0 <NA>
## [302] chr1 48306453-48306908 * | NA 0 <NA>
## [303] chr12 123867207-123867554 * | NA 0 <NA>
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 58.251 75.899 6.14371e-72 195
## [2] 10.808 69.685 5.02806e-66 321
## [3] 17.103 54.311 7.93067e-51 930
## [4] 12.427 43.855 1.35976e-40 604
## [5] 10.850 40.977 7.33386e-38 492
## ... ... ... ... ...
## [299] 9.681 10.057 1.42334e-08 402
## [300] 7.929 10.047 1.44208e-08 197
## [301] 5.864 9.990 1.63892e-08 227
## [302] 5.660 9.948 1.79941e-08 211
## [303] 13.211 9.918 1.92180e-08 163
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Note that these object contain two main parts, the regions of the genome which we can extract with granges:
granges(HepG2)
## GRanges object with 303 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 20335378-20335787 *
## [2] chr20 328285-329145 *
## [3] chr6 168135432-168136587 *
## [4] chr19 1244419-1245304 *
## [5] chr11 64071828-64073069 *
## ... ... ... ...
## [299] chr4 1797182-1797852 *
## [300] chr1 110198573-110199126 *
## [301] chr17 17734052-17734469 *
## [302] chr1 48306453-48306908 *
## [303] chr12 123867207-123867554 *
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
and other information that we can obtain with either the mcols or values functions:
values(HepG2) ##or mcols(HepG2)
## DataFrame with 303 rows and 7 columns
## name score col signalValue pValue qValue peak
## <numeric> <integer> <logical> <numeric> <numeric> <numeric> <integer>
## 1 NA 0 NA 58.251 75.899 6.14371e-72 195
## 2 NA 0 NA 10.808 69.685 5.02806e-66 321
## 3 NA 0 NA 17.103 54.311 7.93067e-51 930
## 4 NA 0 NA 12.427 43.855 1.35976e-40 604
## 5 NA 0 NA 10.850 40.977 7.33386e-38 492
## ... ... ... ... ... ... ... ...
## 299 NA 0 NA 9.681 10.057 1.42334e-08 402
## 300 NA 0 NA 7.929 10.047 1.44208e-08 197
## 301 NA 0 NA 5.864 9.990 1.63892e-08 227
## 302 NA 0 NA 5.660 9.948 1.79941e-08 211
## 303 NA 0 NA 13.211 9.918 1.92180e-08 163
These object are of class GRanges defined in the GenomicsRanges package:
class(HepG2)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
There are a large number of tools available to use for this class and the best way to learn about them is through the vignette browseVignettes("GenomicRanges")
Here we start with some very basic example. Suppose these regions are ordered and we want to work with just the first 10. We can subset these objects as you would expect:
granges( HepG2[1:10] )
## GRanges object with 10 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 20335378-20335787 *
## [2] chr20 328285-329145 *
## [3] chr6 168135432-168136587 *
## [4] chr19 1244419-1245304 *
## [5] chr11 64071828-64073069 *
## [6] chr20 22410891-22411863 *
## [7] chr19 45439028-45439525 *
## [8] chr2 16938364-16938840 *
## [9] chr16 70191209-70192150 *
## [10] chr3 53290050-53290602 *
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Now what if we want to subset the object by chromosomes? We can access the chromosomes with seqnames. Note that not all organisms have chromosomes so Bioconductor uses a more genreal name:
chr <- seqnames(HepG2)
chr
## factor-Rle of length 303 with 292 runs
## Lengths: 1 1 1 1 1 ... 1 1 1 1 1
## Values : chr2 chr20 chr6 chr19 chr11 ... chr4 chr1 chr17 chr1 chr12
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
Note that chr is not a factor or character vector as we would expect. Instead run-length encoding is used. The reason for this is that it provides a much more efficient way to store large granges, for example those obtained from stroing short reads. Note also that we have chromosome names that are not the standard chromosome names such as chrUn_gl000247. The human genome actually has some sequences that have not been mapped into one of the chromosomes and are given names like these.
The Rle class behaves like factors in several useful ways. For example we can tabulate:
table(chr)[1:24]
## chr
## chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13
## 18 38 15 4 8 24 14 11 12 9 9 13 2
## chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22 chrX chrY
## 8 5 31 21 6 16 27 3 5 4 0
We can also subset the object to data based on these. Here is the subset of regions on chr20:
granges( HepG2[ chr == "chr20" ] )
## GRanges object with 27 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 328285-329145 *
## [2] chr20 22410891-22411863 *
## [3] chr20 56039583-56040249 *
## [4] chr20 16455811-16456232 *
## [5] chr20 3140243-3140774 *
## ... ... ... ...
## [23] chr20 5591571-5592037 *
## [24] chr20 25519664-25520238 *
## [25] chr20 19900951-19901275 *
## [26] chr20 35156796-35157140 *
## [27] chr20 25036720-25037716 *
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Before we continue on to finding gene near our binding sites we are going to construct a consensus GRanges. Specifically, we are going to construct an object represnenting the regions that are reported for both cell lines. We can easily find these regions using the findOverlaps function:
res = findOverlaps(HepG2,GM12878)
res
## Hits object with 75 hits and 0 metadata columns:
## queryHits subjectHits
## <integer> <integer>
## [1] 1 12
## [2] 2 78
## [3] 4 777
## [4] 5 8
## [5] 8 13
## ... ... ...
## [71] 285 621
## [72] 287 174
## [73] 291 1855
## [74] 294 512
## [75] 300 144
## -------
## queryLength: 303 / subjectLength: 1873
TO BE CONTINUED
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)
We can use Homo.sapiens:
library(Homo.sapiens)
## Loading required package: OrganismDbi
## Loading required package: GO.db
##
## Loading required package: org.Hs.eg.db
##
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
ghs = genes(Homo.sapiens)
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
ghs
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | GENEID
## <Rle> <IRanges> <Rle> | <CharacterList>
## 1 chr19 58858172-58874214 - | 1
## 10 chr8 18248755-18258723 + | 10
## 100 chr20 43248163-43280376 - | 100
## 1000 chr18 25530930-25757445 - | 1000
## 10000 chr1 243651535-244006886 - | 10000
## ... ... ... ... . ...
## 9991 chr9 114979995-115095944 - | 9991
## 9992 chr21 35736323-35743440 + | 9992
## 9993 chr22 19023795-19109967 - | 9993
## 9994 chr6 90539619-90584155 + | 9994
## 9997 chr22 50961997-50964905 - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Explain genes have direction and what is a transcription start site
tssgr = resize(ghs,1)
granges(ghs)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr19 58858172-58874214 -
## 10 chr8 18248755-18258723 +
## 100 chr20 43248163-43280376 -
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
granges(tssgr)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 1 chr19 58874214 -
## 10 chr8 18248755 +
## 100 chr20 43280376 -
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
index <- precede(HepG2,ghs)
HepG2[1:3,]
## GRanges object with 3 ranges and 7 metadata columns:
## seqnames ranges strand | name score col
## <Rle> <IRanges> <Rle> | <numeric> <integer> <logical>
## [1] chr2 20335378-20335787 * | NA 0 <NA>
## [2] chr20 328285-329145 * | NA 0 <NA>
## [3] chr6 168135432-168136587 * | NA 0 <NA>
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 58.251 75.899 6.14371e-72 195
## [2] 10.808 69.685 5.02806e-66 321
## [3] 17.103 54.311 7.93067e-51 930
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
granges(HepG2)[1:3,]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr2 20335378-20335787 *
## [2] chr20 328285-329145 *
## [3] chr6 168135432-168136587 *
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
granges(ghs)[index[1:3],]
## GRanges object with 3 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## 9741 chr2 20232411-20251789 -
## 57761 chr20 361308-378203 +
## 4301 chr6 168227671-168372700 +
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
d = distance(HepG2,ghs[index])
If we want the closest to a start site, regardless of befor or after, for example if overlaps. We want something different.
We can create a granges of TSS like this:
tssgr = resize(ghs,1)
The distanceToNearest function from GenomicFeatures will create a Hits object with metadata giving the positive distance between peak location and nearest TSS. We'll discuss the Hits object in the video.
d = distanceToNearest(HepG2, tssgr)
d
## Hits object with 303 hits and 1 metadata column:
## queryHits subjectHits | distance
## <integer> <integer> | <integer>
## [1] 1 22817 | 83588
## [2] 2 19883 | 914
## [3] 3 8191 | 60951
## [4] 4 13475 | 2669
## [5] 5 6316 | 0
## ... ... ... . ...
## [299] 299 6699 | 2142
## [300] 300 9555 | 0
## [301] 301 17983 | 5855
## [302] 302 11212 | 155653
## [303] 303 11132 | 1149
## -------
## queryLength: 303 / subjectLength: 23056
##note this gives an error
try(d[,3])
## Error : subscript contains out-of-bounds indices
###instead we have to do this
dists = values(d)$distance
We will call a distance negative if the peak is 5' to the nearest TSS. The density estimate given below shows that the vast majority of events are remote from the 1kb region around TSS; we use dashed lines to denote that region.
index = subjectHits(d)
sdists = ifelse(end(HepG2) < start(tssgr[index]), dists, -dists)
hist(sdists, xlim=c(-100000,100000), main="Density of d(ER binding peak, nearest TSS)" ,breaks=seq(min(sdists),max(sdists),len=1000))
abline(v=-c(10000,10000), lty=2)
Get information on genes
index <- subjectHits(d)[dists<1000]
dists = values(d)$distance
##better way to do this?
geneids <- mcols(tssgr[index])$GENEID
library(Homo.sapiens)
?select
## Help on topic 'select' was found in the following packages:
##
## Package Library
## biomaRt /home/jhuang/R/x86_64-pc-linux-gnu-library/4.1
## AnnotationDbi /home/jhuang/R/x86_64-pc-linux-gnu-library/4.1
## dplyr /usr/local/lib/R/site-library
##
##
## Using the first match ...
columns(Homo.sapiens)
## [1] "ACCNUM" "ALIAS" "CDSCHROM" "CDSEND" "CDSID"
## [6] "CDSNAME" "CDSSTART" "CDSSTRAND" "DEFINITION" "ENSEMBL"
## [11] "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
## [16] "EVIDENCEALL" "EXONCHROM" "EXONEND" "EXONID" "EXONNAME"
## [21] "EXONRANK" "EXONSTART" "EXONSTRAND" "GENEID" "GENENAME"
## [26] "GENETYPE" "GO" "GOALL" "GOID" "IPI"
## [31] "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL" "PATH"
## [36] "PFAM" "PMID" "PROSITE" "REFSEQ" "SYMBOL"
## [41] "TERM" "TXCHROM" "TXEND" "TXID" "TXNAME"
## [46] "TXSTART" "TXSTRAND" "TXTYPE" "UCSCKG" "UNIPROT"
keytypes(Homo.sapiens)
## [1] "ACCNUM" "ALIAS" "CDSID" "CDSNAME" "DEFINITION"
## [6] "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS" "ENTREZID" "ENZYME"
## [11] "EVIDENCE" "EVIDENCEALL" "EXONID" "EXONNAME" "GENEID"
## [16] "GENENAME" "GENETYPE" "GO" "GOALL" "GOID"
## [21] "IPI" "MAP" "OMIM" "ONTOLOGY" "ONTOLOGYALL"
## [26] "PATH" "PFAM" "PMID" "PROSITE" "REFSEQ"
## [31] "SYMBOL" "TERM" "TXID" "TXNAME" "UCSCKG"
## [36] "UNIPROT"
geneids <- as.character(geneids)
select(Homo.sapiens,keys=geneids,columns="GENENAME",keytype="GENEID")
## 'select()' returned 1:1 mapping between keys and columns
## GENEID
## 1 80023
## 2 2101
## 3 7086
## 4 5478
## 5 286262
## 6 2875
## 7 79696
## 8 79763
## 9 60493
## 10 130074
## 11 80344
## 12 58484
## 13 724033
## 14 64284
## 15 4967
## 16 693226
## 17 8508
## 18 26873
## 19 6648
## 20 5833
## 21 9377
## 22 136853
## 23 3032
## 24 91252
## 25 95
## 26 374291
## 27 84331
## 28 114770
## 29 81572
## 30 518
## 31 126295
## 32 9563
## 33 2665
## 34 5524
## 35 10656
## 36 10116
## 37 10939
## 38 58476
## 39 2063
## 40 90121
## 41 81620
## 42 54963
## 43 8731
## 44 6666
## 45 400548
## 46 10312
## 47 23498
## 48 100134015
## 49 56910
## 50 51491
## 51 79814
## 52 8328
## 53 8502
## 54 119559
## 55 100616250
## 56 84883
## 57 10131
## 58 54800
## 59 39
## 60 8028
## 61 57478
## 62 56261
## 63 4705
## 64 25831
## 65 51660
## 66 346606
## 67 63924
## 68 183
## 69 79675
## 70 57019
## 71 23408
## 72 55256
## 73 20
## 74 55090
## 75 3364
## 76 11165
## 77 670
## 78 51112
## 79 56181
## 80 347734
## 81 8165
## 82 2948
## GENENAME
## 1 neurensin 2
## 2 estrogen related receptor alpha
## 3 transketolase
## 4 peptidylprolyl isomerase A
## 5 taperin
## 6 glutamic--pyruvic transaminase
## 7 zinc finger C2HC-type containing 1C
## 8 isochorismatase domain containing 2
## 9 FAST kinase domains 5
## 10 family with sequence similarity 168 member B
## 11 DDB1 and CUL4 associated factor 11
## 12 NLR family CARD domain containing 4
## 13 microRNA 663a
## 14 RAB17, member RAS oncogene family
## 15 oxoglutarate dehydrogenase
## 16 microRNA 641
## 17 nipsnap homolog 1
## 18 5-oxoprolinase, ATP-hydrolysing
## 19 superoxide dismutase 2
## 20 phosphate cytidylyltransferase 2, ethanolamine
## 21 cytochrome c oxidase subunit 5A
## 22 scavenger receptor cysteine rich family member with 4 domains
## 23 hydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit beta
## 24 solute carrier family 39 member 13
## 25 aminoacylase 1
## 26 NADH:ubiquinone oxidoreductase core subunit S7
## 27 MAPK regulated corepressor interacting protein 2
## 28 peptidoglycan recognition protein 2
## 29 p53 and DNA damage regulated 1
## 30 ATP synthase membrane subunit c locus 3
## 31 zinc finger protein 57
## 32 hexose-6-phosphate dehydrogenase/glucose 1-dehydrogenase
## 33 GDP dissociation inhibitor 2
## 34 protein phosphatase 2 phosphatase activator
## 35 KH RNA binding domain containing, signal transduction associated 3
## 36 fem-1 homolog B
## 37 AFG3 like matrix AAA peptidase subunit 2
## 38 tumor protein p53 inducible nuclear protein 2
## 39 nuclear receptor subfamily 2 group F member 6
## 40 TSR2 ribosome maturation factor
## 41 chromatin licensing and DNA replication factor 1
## 42 uridine-cytidine kinase 1 like 1
## 43 RNA guanine-7 methyltransferase
## 44 SRY-box transcription factor 12
## 45 long intergenic non-protein coding RNA 2139
## 46 T cell immune regulator 1, ATPase H+ transporting V0 subunit a3
## 47 3-hydroxyanthranilate 3,4-dioxygenase
## 48 UBOX5 antisense RNA 1
## 49 StAR related lipid transfer domain containing 7
## 50 NOP16 nucleolar protein
## 51 agmatinase
## 52 growth factor independent 1B transcriptional repressor
## 53 plakophilin 4
## 54 sideroflexin 4
## 55 microRNA 3960
## 56 apoptosis inducing factor mitochondria associated 2
## 57 TNF receptor associated protein 1
## 58 kelch like family member 24
## 59 acetyl-CoA acetyltransferase 2
## 60 MLLT10 histone lysine methyltransferase DOT1L cofactor
## 61 ubiquitin specific peptidase 31
## 62 glycerophosphocholine phosphodiesterase 1
## 63 NADH:ubiquinone oxidoreductase subunit A10
## 64 HECT domain E3 ubiquitin protein ligase 1
## 65 mitochondrial pyruvate carrier 1
## 66 monoacylglycerol O-acyltransferase 3
## 67 cell death inducing DFFA like effector c
## 68 angiotensinogen
## 69 FAST kinase domains 1
## 70 cytokine induced apoptosis inhibitor 1
## 71 sirtuin 5
## 72 acireductone dioxygenase 1
## 73 ATP binding cassette subfamily A member 2
## 74 mediator complex subunit 9
## 75 HUS1 checkpoint clamp component
## 76 nudix hydrolase 3
## 77 biphenyl hydrolase like
## 78 trafficking protein particle complex subunit 12
## 79 mitochondrial fission regulator 1 like
## 80 solute carrier family 35 member B2
## 81 A-kinase anchoring protein 1
## 82 glutathione S-transferase mu 4