The question that we address in this component is very precise and can be answered with one experimental paradigm. Given genomic DNA extracted from human liver cells (in this case, the widely used cell lines HepG2 and GM12878), where on the genome does the nuclear protein known as ESRRA (estrogen related receptor alpha) bind. We motivate this by describing an important publication that furthered our understanding of the role of estrogen receptors in breast cancer.
The data we examine this week comes from the analysis of ChIP-seq experiments. We will learn more about how to process raw experimental data. These in the week of this course and our short two week courses. At present we will work with results of a ChIP-seq experiment performed in the ENCODE proj ect. We will see how to import information for files in the “narrowPeak” format, will examine the results in the form of a Bioconductor GRanges object, and will illustrate how to identify, for each binding peak, the nearest transcriptional start site (as recorded for UCSC defined genes). This enables us to assess whether regulatory activity of ESRRA occurs in regions conventionally regarded as transcriptional promoter regions.
This week we are using two sets of genomic regions as examples. These are the reported Estrogen Related Receptor binding sites obtained for a ENCODE ChIP-seq experiment on two cell lines: HepG2 and GM12787. We have put these regions into an R package for your convenience. If you have not done so already, please download and install the package:
library(devtools)
install_github("genomicsclass/ERBS")
Before we get started we want to ensure that you have the latest version of Bioconductor installed, since obtaining the correct answers to these questions depends on this. Load the library. Use the biocVersion function to determine your version and enter it below (enter it as decimal like this: 1.2):
library(BiocInstaller)
## Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help
biocVersion()
## [1] '3.1'
In Bioconductor we make use of what are referred to as S4 classes and methods. You can learn more about this style of object oriented programming (OOP) here. Although we do not expect you to become an expert in the S4 system, it is helpful to have some general understanding as it will help you, for example, retrieve help files for our classes.
You can learn about the Bioconductor class system here. We are going to be working mostly with the GRanges S4 class. Mike has prepared a reference card type guide.
library(ERBS)
Load two data sets
data("GM12878")
data("HepG2")
Explore the HepG2 project
How many regions are represented
class(HepG2)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
# You can read by just typing and reading the provided information
HepG2
## Loading required package: GenomicRanges
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## GRanges object with 303 ranges and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chr2 [ 20335378, 20335787] * | <NA> 0
## [2] chr20 [ 328285, 329145] * | <NA> 0
## [3] chr6 [168135432, 168136587] * | <NA> 0
## [4] chr19 [ 1244419, 1245304] * | <NA> 0
## [5] chr11 [ 64071828, 64073069] * | <NA> 0
## ... ... ... ... ... ... ...
## [299] chr4 [ 1797182, 1797852] * | <NA> 0
## [300] chr1 [110198573, 110199126] * | <NA> 0
## [301] chr17 [ 17734052, 17734469] * | <NA> 0
## [302] chr1 [ 48306453, 48306908] * | <NA> 0
## [303] chr12 [123867207, 123867554] * | <NA> 0
## col signalValue pValue qValue peak
## <logical> <numeric> <numeric> <numeric> <integer>
## [1] <NA> 58.251 75.899 6.143712e-72 195
## [2] <NA> 10.808 69.685 5.028065e-66 321
## [3] <NA> 17.103 54.311 7.930665e-51 930
## [4] <NA> 12.427 43.855 1.359756e-40 604
## [5] <NA> 10.85 40.977 7.333863e-38 492
## ... ... ... ... ... ...
## [299] <NA> 9.681 10.057 1.423343e-08 402
## [300] <NA> 7.929 10.047 1.442076e-08 197
## [301] <NA> 5.864 9.99 1.638918e-08 227
## [302] <NA> 5.66 9.948 1.799414e-08 211
## [303] <NA> 13.211 9.918 1.921805e-08 163
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
# or you can also type
length(HepG2)
## [1] 303
We will continue using the ERBS example. We can see that it is not ordered. We can access the columns by extracting data using value.
values(HepG2)
## DataFrame with 303 rows and 7 columns
## name score col signalValue pValue qValue
## <numeric> <integer> <logical> <numeric> <numeric> <numeric>
## 1 NA 0 NA 58.251 75.899 6.143712e-72
## 2 NA 0 NA 10.808 69.685 5.028065e-66
## 3 NA 0 NA 17.103 54.311 7.930665e-51
## 4 NA 0 NA 12.427 43.855 1.359756e-40
## 5 NA 0 NA 10.850 40.977 7.333863e-38
## ... ... ... ... ... ... ...
## 299 NA 0 NA 9.681 10.057 1.423343e-08
## 300 NA 0 NA 7.929 10.047 1.442076e-08
## 301 NA 0 NA 5.864 9.990 1.638918e-08
## 302 NA 0 NA 5.660 9.948 1.799414e-08
## 303 NA 0 NA 13.211 9.918 1.921805e-08
## peak
## <integer>
## 1 195
## 2 321
## 3 930
## 4 604
## 5 492
## ... ...
## 299 402
## 300 197
## 301 227
## 302 211
## 303 163
Some more basic funcitons are below.
Some of these information can be extracted by treating this data set as matrix. So we can use square brackets for subsetting. Top 10 rows can be extracted by
HepG2[1:10,]
## GRanges object with 10 ranges and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chr2 [ 20335378, 20335787] * | <NA> 0
## [2] chr20 [ 328285, 329145] * | <NA> 0
## [3] chr6 [168135432, 168136587] * | <NA> 0
## [4] chr19 [ 1244419, 1245304] * | <NA> 0
## [5] chr11 [ 64071828, 64073069] * | <NA> 0
## [6] chr20 [ 22410891, 22411863] * | <NA> 0
## [7] chr19 [ 45439028, 45439525] * | <NA> 0
## [8] chr2 [ 16938364, 16938840] * | <NA> 0
## [9] chr16 [ 70191209, 70192150] * | <NA> 0
## [10] chr3 [ 53290050, 53290602] * | <NA> 0
## col signalValue pValue qValue peak
## <logical> <numeric> <numeric> <numeric> <integer>
## [1] <NA> 58.251 75.899 6.143712e-72 195
## [2] <NA> 10.808 69.685 5.028065e-66 321
## [3] <NA> 17.103 54.311 7.930665e-51 930
## [4] <NA> 12.427 43.855 1.359756e-40 604
## [5] <NA> 10.85 40.977 7.333863e-38 492
## [6] <NA> 6.419 41.02 7.749606e-38 660
## [7] <NA> 8.78 39.204 3.804879e-36 267
## [8] <NA> 12.783 38.004 5.360291e-35 255
## [9] <NA> 8.371 37.774 8.192772e-35 688
## [10] <NA> 16.196 37.633 9.446036e-35 281
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
How do I get a region from from Chr 20. this is done by seqnames
chr= seqnames(HepG2)
This will give the chromosome name of each row. This is an Rle object. We can coerce it by as.character(). The Rle class would order different chromosomes together automatically. Many functions can be done even with Rle format. Say table function would work perfectly well
as.character(chr)
## [1] "chr2" "chr20" "chr6" "chr19" "chr11" "chr20" "chr19" "chr2"
## [9] "chr16" "chr3" "chr6" "chr20" "chr7" "chr16" "chr9" "chr11"
## [17] "chr22" "chrX" "chr8" "chr16" "chr16" "chr19" "chr17" "chr17"
## [25] "chr16" "chr1" "chr16" "chr9" "chr17" "chr16" "chr12" "chr6"
## [33] "chr2" "chr3" "chr11" "chr16" "chr6" "chr2" "chr8" "chr1"
## [41] "chr17" "chr20" "chr4" "chr14" "chr19" "chr20" "chr9" "chr2"
## [49] "chr2" "chr19" "chr8" "chr14" "chr22" "chr2" "chr14" "chr6"
## [57] "chr20" "chr2" "chr19" "chr8" "chr2" "chr19" "chr12" "chr2"
## [65] "chr2" "chr11" "chr12" "chr7" "chr19" "chr22" "chr17" "chr3"
## [73] "chr8" "chr3" "chr15" "chr6" "chr9" "chr10" "chr6" "chr2"
## [81] "chr19" "chr11" "chr8" "chr17" "chr15" "chr21" "chr7" "chr2"
## [89] "chr2" "chr3" "chr2" "chr16" "chr10" "chr20" "chr17" "chr13"
## [97] "chr2" "chr5" "chr14" "chr11" "chr8" "chr20" "chr3" "chr7"
## [105] "chr1" "chr1" "chr3" "chr17" "chrX" "chr19" "chr20" "chr6"
## [113] "chr7" "chr16" "chr7" "chr17" "chr20" "chr2" "chr5" "chrX"
## [121] "chr7" "chr6" "chr19" "chr17" "chr16" "chr5" "chr12" "chr9"
## [129] "chr20" "chr2" "chr12" "chr3" "chr7" "chr2" "chr20" "chr20"
## [137] "chr17" "chr12" "chr19" "chr1" "chr7" "chr20" "chr14" "chr12"
## [145] "chr10" "chr6" "chr9" "chr6" "chr1" "chr18" "chr8" "chr15"
## [153] "chr6" "chr2" "chr1" "chr18" "chr16" "chr9" "chr20" "chr19"
## [161] "chr17" "chr10" "chr6" "chr2" "chrX" "chr16" "chr20" "chr16"
## [169] "chr20" "chr16" "chr20" "chr5" "chr16" "chr17" "chr17" "chr3"
## [177] "chr8" "chr18" "chr18" "chr7" "chr20" "chr16" "chr19" "chr11"
## [185] "chr12" "chr2" "chr17" "chr1" "chr20" "chr4" "chr17" "chr1"
## [193] "chr6" "chr5" "chr13" "chr7" "chr20" "chr2" "chr16" "chr6"
## [201] "chr11" "chr5" "chr20" "chr1" "chr9" "chr2" "chr16" "chr10"
## [209] "chr9" "chr2" "chr2" "chr21" "chr1" "chr16" "chr18" "chr10"
## [217] "chr16" "chr3" "chr6" "chr16" "chr2" "chr6" "chr10" "chr16"
## [225] "chr22" "chr2" "chr16" "chr8" "chr20" "chr19" "chr16" "chr20"
## [233] "chr2" "chr3" "chr10" "chr14" "chr6" "chr18" "chr15" "chr9"
## [241] "chr14" "chr7" "chr20" "chr3" "chr6" "chr10" "chr4" "chr1"
## [249] "chr9" "chr15" "chr6" "chr16" "chr2" "chr3" "chr14" "chr19"
## [257] "chr2" "chr5" "chr22" "chr16" "chr6" "chr16" "chr17" "chr11"
## [265] "chr8" "chr3" "chr1" "chr16" "chr21" "chr12" "chr16" "chr1"
## [273] "chr2" "chr2" "chr9" "chr2" "chr16" "chr17" "chr12" "chr17"
## [281] "chr7" "chr20" "chr7" "chr6" "chr12" "chr2" "chr1" "chr5"
## [289] "chr6" "chr2" "chr1" "chr12" "chr2" "chr6" "chr20" "chr2"
## [297] "chr17" "chr3" "chr4" "chr1" "chr17" "chr1" "chr12"
# it will show the names of chromosomes of all the rows.
table(chr)
## chr
## chr1 chr2 chr3
## 18 38 15
## chr4 chr5 chr6
## 4 8 24
## chr7 chr8 chr9
## 14 11 12
## chr10 chr11 chr12
## 9 9 13
## chr13 chr14 chr15
## 2 8 5
## chr16 chr17 chr18
## 31 21 6
## chr19 chr20 chr21
## 16 27 3
## chr22 chrX chrY
## 5 4 0
## chrM chr1_gl000191_random chr1_gl000192_random
## 0 0 0
## chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random
## 0 0 0
## chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3
## 0 0 0
## chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6
## 0 0 0
## chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random
## 0 0 0
## chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random
## 0 0 0
## chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random
## 0 0 0
## chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random
## 0 0 0
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random
## 0 0 0
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random
## 0 0 0
## chrUn_gl000211 chrUn_gl000212 chrUn_gl000213
## 0 0 0
## chrUn_gl000214 chrUn_gl000215 chrUn_gl000216
## 0 0 0
## chrUn_gl000217 chrUn_gl000218 chrUn_gl000219
## 0 0 0
## chrUn_gl000220 chrUn_gl000221 chrUn_gl000222
## 0 0 0
## chrUn_gl000223 chrUn_gl000224 chrUn_gl000225
## 0 0 0
## chrUn_gl000226 chrUn_gl000227 chrUn_gl000228
## 0 0 0
## chrUn_gl000229 chrUn_gl000230 chrUn_gl000231
## 0 0 0
## chrUn_gl000232 chrUn_gl000233 chrUn_gl000234
## 0 0 0
## chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## 0 0 0
## chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## 0 0 0
## chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## 0 0 0
## chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## 0 0 0
## chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## 0 0 0
# it will show the number of rows for each chromosomes. This contain other weird names for chromosomes will be understood when we do more human genome studies.
Furthermore, we can sbset the entire dataset by selecting the chromosome numbers. Like what is shown below.
HepG2[chr=="chr20"]
## GRanges object with 27 ranges and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chr20 [ 328285, 329145] * | <NA> 0
## [2] chr20 [22410891, 22411863] * | <NA> 0
## [3] chr20 [56039583, 56040249] * | <NA> 0
## [4] chr20 [16455811, 16456232] * | <NA> 0
## [5] chr20 [ 3140243, 3140774] * | <NA> 0
## ... ... ... ... ... ... ...
## [23] chr20 [ 5591571, 5592037] * | <NA> 0
## [24] chr20 [25519664, 25520238] * | <NA> 0
## [25] chr20 [19900951, 19901275] * | <NA> 0
## [26] chr20 [35156796, 35157140] * | <NA> 0
## [27] chr20 [25036720, 25037716] * | <NA> 0
## col signalValue pValue qValue peak
## <logical> <numeric> <numeric> <numeric> <integer>
## [1] <NA> 10.808 69.685 5.028065e-66 321
## [2] <NA> 6.419 41.02 7.749606e-38 660
## [3] <NA> 7.796 36.977 3.666932e-34 315
## [4] <NA> 7.351 21.831 1.596682e-19 199
## [5] <NA> 7.296 21.587 2.625365e-19 315
## ... ... ... ... ... ...
## [23] <NA> 8.766 11.433 7.677415e-10 249
## [24] <NA> 3.3 11.419 7.895197e-10 206
## [25] <NA> 4.809 11.155 1.379539e-09 140
## [26] <NA> 10.154 10.313 8.309712e-09 163
## [27] <NA> 4.381 10.087 1.332784e-08 170
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
This will select only the data rows obtained from chromosome 20.
We can order them also like the following. So that it will order the chromosome number followed by the regions with in the chromosome and so on.
x <-HepG2[order(HepG2),]
# This is very convenient feature. We can find the seqnames in this so that it will order them too
seqnames(x)
## factor-Rle of length 303 with 23 runs
## Lengths: 18 38 15 4 8 ... 16 27 3 5 4
## Values : chr1 chr2 chr3 chr4 chr5 ... chr19 chr20 chr21 chr22 chrX
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000248 chrUn_gl000249
# this would give ordered chromosomes and number of repititions in that too.
In the video we used values method to extract meta-data on the regions. An alternative, and actually preferred method is going forward with mcols.
What is the median of signalValue column?
mcols(HepG2)
## DataFrame with 303 rows and 7 columns
## name score col signalValue pValue qValue
## <numeric> <integer> <logical> <numeric> <numeric> <numeric>
## 1 NA 0 NA 58.251 75.899 6.143712e-72
## 2 NA 0 NA 10.808 69.685 5.028065e-66
## 3 NA 0 NA 17.103 54.311 7.930665e-51
## 4 NA 0 NA 12.427 43.855 1.359756e-40
## 5 NA 0 NA 10.850 40.977 7.333863e-38
## ... ... ... ... ... ... ...
## 299 NA 0 NA 9.681 10.057 1.423343e-08
## 300 NA 0 NA 7.929 10.047 1.442076e-08
## 301 NA 0 NA 5.864 9.990 1.638918e-08
## 302 NA 0 NA 5.660 9.948 1.799414e-08
## 303 NA 0 NA 13.211 9.918 1.921805e-08
## peak
## <integer>
## 1 195
## 2 321
## 3 930
## 4 604
## 5 492
## ... ...
## 299 402
## 300 197
## 301 227
## 302 211
## 303 163
median( mcols(HepG2)$signalValue )
## [1] 7.024
In what chromosome is the region with the highest signalValue (copy and paste your answer)?
i=which.max( mcols(HepG2)$signalValue )
seqnames(HepG2[i])
## factor-Rle of length 1 with 1 run
## Lengths: 1
## Values : chrX
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000248 chrUn_gl000249
How many regions are from chromosome 16?
HepG2[chr=="chr16"]
## GRanges object with 31 ranges and 7 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <numeric> <integer>
## [1] chr16 [70191209, 70192150] * | <NA> 0
## [2] chr16 [ 1701039, 1702137] * | <NA> 0
## [3] chr16 [25189109, 25190026] * | <NA> 0
## [4] chr16 [85325101, 85325686] * | <NA> 0
## [5] chr16 [29986461, 29986872] * | <NA> 0
## ... ... ... ... ... ... ...
## [27] chr16 [57481218, 57481854] * | <NA> 0
## [28] chr16 [85322504, 85322950] * | <NA> 0
## [29] chr16 [19134897, 19135280] * | <NA> 0
## [30] chr16 [ 2586101, 2586737] * | <NA> 0
## [31] chr16 [29975932, 29976255] * | <NA> 0
## col signalValue pValue qValue peak
## <logical> <numeric> <numeric> <numeric> <integer>
## [1] <NA> 8.371 37.774 8.192772e-35 688
## [2] <NA> 16.157 36.264 1.656956e-33 783
## [3] <NA> 5.979 31.808 3.443564e-29 606
## [4] <NA> 7.664 31.429 7.883209e-29 223
## [5] <NA> 14.795 29.018 1.730084e-26 198
## ... ... ... ... ... ...
## [27] <NA> 5.126 10.761 3.209779e-09 196
## [28] <NA> 4.331 10.725 3.434943e-09 223
## [29] <NA> 5.38 10.562 4.925627e-09 203
## [30] <NA> 6.521 10.514 5.421234e-09 472
## [31] <NA> 6.897 10.436 6.371957e-09 145
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
# or
chr = seqnames(HepG2)
table(chr)[16]
## chr16
## 31
Make a histogram of the widths of the regions from all chromosomes (not just chr16).
What is the median width?
median( width(HepG2) )
## [1] 560
##You can see the histogram
hist(width(HepG2),nclass=25)
Note it has a heavy right tail.
In the previous video we saw a number of functions for manipulating interval ranges:
shift, narrow, flank, *, +, -, range, reduce, gaps, disjoin.
This is just a subset of all the possible operations, and remember, the rest are documented in the help pages mentioned in the video and in the book page. We will first do a simple review of these operations, so that you get a sense of using them in your R console. Then we will have a few questions which require more thought.
Load the IRanges package. Define an integer range starting at 101 and ending at 200. If we use the operation *2, this will zoom in, giving us a range with half the width. What is the starting point of the resulting range?
library(IRanges)
ir <- IRanges(101,200,100)
start(ir*2)
## [1] 126
Define an integer range starting at 101 and ending at 200. If we use the operation narrow(x,start=20) what is the new starting point of the range?
narrow(ir,start = 20)
## IRanges of length 1
## start end width
## [1] 120 200 81
Define an integer range starting at 101 and ending at 200. If we use the operation +25, what is the width of the resulting range?
ir+25
## IRanges of length 1
## start end width
## [1] 76 225 150
Define an IRanges with starts at 1,11,21 and ends at 3,15,27. width() gives the widths for each range. What is the sum of the widths of all the ranges?
ir <- IRanges(start = c(1,11,21), end = c(3,15,27))
width(ir)
## [1] 3 5 7
sum(width(ir))
## [1] 15
Define an IRanges object, x, with the following set of ranges:
Starts at 101,106,201,211,221,301,306,311,351,361,401,411,501
Ends at 150,160,210,270,225,310,310,330,390,380,415,470,510
Plot these ranges using the plotRanges function in the ph525x package. You can install this library if you have not done so already, with the command: install_github(“genomicsclass/ph525x”)
What is the total width from 101 to 510 which is not covered by ranges in x?
library(ph525x)
## Loading required package: png
## Loading required package: grid
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
x <- IRanges(start = c(101,106,201,211,221,301,306,311,351,361,401,411,501),end = c(150,160,210,270,225,310,310,330,390,380,415,470,510))
# find out the range first
range(x)
## IRanges of length 1
## start end width
## [1] 101 510 410
# Because the range of x is from 101 to 510, we just need to take the sum of the gap widths
sum(width(gaps(x)))
## [1] 130
How many disjoint ranges are contained within the ranges in ‘x’ from the previous question? By disjoint ranges, we mean the following: for two ranges [1,10] and [6,15], there are three disjoint ranges contained within: [1,5], [6,10], and [11,15].
disjoin(x)
## IRanges of length 17
## start end width
## [1] 101 105 5
## [2] 106 150 45
## [3] 151 160 10
## [4] 201 210 10
## [5] 211 220 10
## ... ... ... ...
## [13] 381 390 10
## [14] 401 410 10
## [15] 411 415 5
## [16] 416 470 55
## [17] 501 510 10
length(disjoin(x))
## [1] 17
An intra-range function we didn’t show in the video is resize(). Set up a grid of 2 stacked plots:
par(mfrow=c(2,1))
Now use plotRanges() to plot the original ‘x’ and resize(x,1). You will have to set the xlim to make sure that the plots line up vertically. You can use plotRanges(x,xlim=c(,600)) for example. What is the best description for the operation resize(x,1)
From the man page: resize resizes the ranges to the specified width where either the start, end, or center is used as an anchor. The default is fix=“start”, so resize(x,1) gives you the starting integer of each range in x.
plotRanges(x)
x1 <- resize(x,1)
plotRanges(x1)
In the first week, in the subsection “What We Measure and Why”, we learned that DNA has two strands. These two strands are often called plus, “+”, and minus, “-”.
The GRanges object in the GenomicRanges package extends the concept of interval ranges in two major ways. The ranges are now also identified by:
the chromosome we are referring to (in Bioconductor, this is called “seqnames”)
the strand of the DNA we are referring to (“+” or “-”). No strand is labelled with a star, “*“.
Without these two pieces of information, a specification of a range of DNA would be ambiguous. Let’s make two ranges, with strand and chromosome information, and see how the range operations act based on strand.
library(IRanges)
library(ph525x)
library(GenomicRanges)
x = GRanges("chr1", IRanges(c(1,101),c(50,150)), strand=c("+","-"))
In the last assessment, we visualized IRanges with the plotRanges function in the ph525x library. We can get the internal IRanges from a GRanges object with the following code:
ranges(x)
## IRanges of length 2
## start end width
## [1] 1 50 50
## [2] 101 150 50
So let’s define a new plotting function:
plotGRanges = function(x) plotRanges(ranges(x))
Compare x and resize(x,1) using plotGRanges. The result of running resize(x,1) is two ranges of width 1 which start…
par(mfrow=c(2,1))
plotGRanges(x)
x1 = resize(x,1)
plotGRanges(x1)
Suppose we have two different sets of ranges, which overlap somewhat but not entirely. This is the case for many genes, in which there are different versions of transcripts, also called isoforms. The different transcripts consist of exons which end up in the final mRNA molecule, and a number of transcripts can share exons or have exons which are overlapping but not identical ranges.
We’ll start with a toy example, and learn how to load real genes later:
x = GRanges("chr1", IRanges(c(101,201,401,501),c(150,250,450,550)), strand="+")
y = GRanges("chr1", IRanges(c(101,221,301,401,541),c(150,250,350,470,550)), strand="+")
Plot these two sets of ranges using par(mfrow=c(2,1)) and two calls to plotGRanges.
par(mfrow=c(2,1))
plotGRanges(x)
plotGRanges(y)
If we want to keep the information about which set the ranges belong to, we could combine the two GRanges into a GRangesList:
GRangesList(x,y)
## GRangesList object of length 2:
## [[1]]
## GRanges object with 4 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [101, 150] +
## [2] chr1 [201, 250] +
## [3] chr1 [401, 450] +
## [4] chr1 [501, 550] +
##
## [[2]]
## GRanges object with 5 ranges and 0 metadata columns:
## seqnames ranges strand
## [1] chr1 [101, 150] +
## [2] chr1 [221, 250] +
## [3] chr1 [301, 350] +
## [4] chr1 [401, 470] +
## [5] chr1 [541, 550] +
##
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
However, if we want to combine them into a single GRanges, we can use c():
c(x,y)
## GRanges object with 9 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 [101, 150] +
## [2] chr1 [201, 250] +
## [3] chr1 [401, 450] +
## [4] chr1 [501, 550] +
## [5] chr1 [101, 150] +
## [6] chr1 [221, 250] +
## [7] chr1 [301, 350] +
## [8] chr1 [401, 470] +
## [9] chr1 [541, 550] +
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
Find the total width which is covered by ranges in both x and y. Hint: use c(), disjoin() and %over%.
dis = disjoin(c(x,y))
both = dis %over% x & dis %over% y
sum(width(dis[both]))
## [1] 140
What is the total which is in x or y but not in both?
not =! (dis %over% x & dis %over% y)
sum(width(dis[not]))
## [1] 130
Define the new genomic range ‘z’, which covers range(ranges(x)) but has the opposite strand. What is the number of ranges in x which overlap z according to the %over% command?
z = GRanges("chr1", IRanges(c(101,201,401,501),c(150,250,450,550)), strand="-")
sum(x %over% z)
## [1] 0
The %over% command is specific to strand. If we want to find the ranges regardless of strand, we need to assign a strand of “*“.
In this assessment we will be working with two sets of regions.
library(GenomicRanges)
library(GenomicFeatures)
## Loading required package: AnnotationDbi
library(IRanges)
library(ERBS)
data("HepG2") # cell line for the liver origin
data("GM12878") # immortalized B cell
# browseVignettes("GenomicRanges") to read full review of functions.
Where does the 17th HepG2 region start?
start(HepG2)[17]
## [1] 46528596
Where does 17th region of HepG2 start? (This is the same as previous question, but it was kept since students might have already answered).
Use ?distanceToNearest to find the closest region in GM12878 to the 17th region in HepG2. What is the start site of this region.
d = distanceToNearest(HepG2[17],GM12878)
i = subjectHits(d)
start(GM12878[i])
## [1] 46524762
What is the distance between the 17th region of HepG2 and its closest region in GM12878?
d = distanceToNearest(HepG2[17],GM12878)
mcols(d)$distance
## [1] 2284
For each region in the HepG2 find the closest region in GM12878 and record the distance. What is the proportion of these distances are smaller than 2000 base pairs? Distance is a metadata column on the Hits object, so consider mcols().
d = distanceToNearest(HepG2,GM12878)
mean(mcols(d)$distance < 2000)
## [1] 0.2673267
In this assessment we will be getting gene information for human genes from the Homo.sapiens library.
library(Homo.sapiens)
## Loading required package: OrganismDbi
## Loading required package: GO.db
## Loading required package: DBI
##
## Loading required package: org.Hs.eg.db
##
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene
ghs = genes(Homo.sapiens)
What genome build was used here?
genome(ghs)
## chr1 chr2 chr3
## "hg19" "hg19" "hg19"
## chr4 chr5 chr6
## "hg19" "hg19" "hg19"
## chr7 chr8 chr9
## "hg19" "hg19" "hg19"
## chr10 chr11 chr12
## "hg19" "hg19" "hg19"
## chr13 chr14 chr15
## "hg19" "hg19" "hg19"
## chr16 chr17 chr18
## "hg19" "hg19" "hg19"
## chr19 chr20 chr21
## "hg19" "hg19" "hg19"
## chr22 chrX chrY
## "hg19" "hg19" "hg19"
## chrM chr1_gl000191_random chr1_gl000192_random
## "hg19" "hg19" "hg19"
## chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random
## "hg19" "hg19" "hg19"
## chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3
## "hg19" "hg19" "hg19"
## chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6
## "hg19" "hg19" "hg19"
## chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random
## "hg19" "hg19" "hg19"
## chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random
## "hg19" "hg19" "hg19"
## chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random
## "hg19" "hg19" "hg19"
## chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random
## "hg19" "hg19" "hg19"
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random
## "hg19" "hg19" "hg19"
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random
## "hg19" "hg19" "hg19"
## chrUn_gl000211 chrUn_gl000212 chrUn_gl000213
## "hg19" "hg19" "hg19"
## chrUn_gl000214 chrUn_gl000215 chrUn_gl000216
## "hg19" "hg19" "hg19"
## chrUn_gl000217 chrUn_gl000218 chrUn_gl000219
## "hg19" "hg19" "hg19"
## chrUn_gl000220 chrUn_gl000221 chrUn_gl000222
## "hg19" "hg19" "hg19"
## chrUn_gl000223 chrUn_gl000224 chrUn_gl000225
## "hg19" "hg19" "hg19"
## chrUn_gl000226 chrUn_gl000227 chrUn_gl000228
## "hg19" "hg19" "hg19"
## chrUn_gl000229 chrUn_gl000230 chrUn_gl000231
## "hg19" "hg19" "hg19"
## chrUn_gl000232 chrUn_gl000233 chrUn_gl000234
## "hg19" "hg19" "hg19"
## chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## "hg19" "hg19" "hg19"
## chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## "hg19" "hg19" "hg19"
## chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## "hg19" "hg19" "hg19"
## chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## "hg19" "hg19" "hg19"
## chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## "hg19" "hg19" "hg19"
seqinfo(ghs)
## Seqinfo object with 93 sequences (1 circular) from hg19 genome:
## seqnames seqlengths isCircular genome
## chr1 249250621 FALSE hg19
## chr2 243199373 FALSE hg19
## chr3 198022430 FALSE hg19
## chr4 191154276 FALSE hg19
## chr5 180915260 FALSE hg19
## ... ... ... ...
## chrUn_gl000245 36651 FALSE hg19
## chrUn_gl000246 38154 FALSE hg19
## chrUn_gl000247 36422 FALSE hg19
## chrUn_gl000248 39786 FALSE hg19
## chrUn_gl000249 38502 FALSE hg19
How many genes are represented in ghs?
length(ghs)
## [1] 23056
Whatis the chromosome with the most genes?
which.max( table( seqnames( ghs ) ))
## chr1
## 1
Make a histogram of the widths of genes (use the width() on the GRanges object). This width gives the number of basepairs from the start of the gene to the end, so including exons and introns.
w = width( ghs )
hist( w )
## the larger values are so much larger than the bulk of the data
## that we can barely see the frequencies of large values in the histogram
## the log transformation better shows how fat the right tail is:
hist( log10(w))
What is the median gene width
median(w)
## [1] 20115.5
In the next assessment we will find the closest genes to some of our binding sites. We will use a consensus set of regions. In the video we did it like this:
res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)
The following command is similar.
erbs2= intersect(HepG2,GM12878)
Which is TRUE?
## first order them
erbs3 = erbs[order(erbs),]
##confirm same chr
all( seqnames(erbs2)==seqnames(erbs3) )
## [1] TRUE
mean( start(erbs2)==start(erbs3) & end(erbs2)==end(erbs3) )
## [1] 0.9333333
##the intersection should be smaller
all( width(erbs2) <= width(erbs3) )
## [1] TRUE
# More than 90% of these two are similar with different regions being smaller in erbs2
Using ghs regions:
library(Homo.sapiens)
ghs = genes(Homo.sapiens)
and what you learned in the video, convert the ghs object to one that represents just the tss. What is the TSS (Transcription Start Site) of the gene with ID: 100113402?
Hint: Look at the ghs in the console. Note that the names of the ranges are the same as the GENEID column. So you can index the ranges directly with "100113402".
tssgr = resize(ghs,1)
start(tssgr["100113402"])
## [1] 70563402
Now using the erbs regions defined in a previous question:
library(ERBS)
data("HepG2")
data("GM12878")
res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)
What is the GENEID of the gene with TSS closest to the 4th region of erbs?
res = findOverlaps(erbs[4],tssgr)
res = tssgr[subjectHits(res)]
# or
i = nearest(erbs[4],tssgr)
mcols(tssgr)$GENEID[i]
## FactorList of length 1
## [["2101"]] 2101
In the question above, you identified a gene. Use select function to determine which is the SYMBOL of this gene.
key = as.character(mcols(tssgr)$GENEID[i])
# columns(Homo.sapiens) would give us the list of names of columns in the Homo.sapiens
select(Homo.sapiens, keys = key, columns = c("SYMBOL", "GENENAME"), keytype = "GENEID")
## GENEID SYMBOL GENENAME
## 1 2101 ESRRA estrogen-related receptor alpha
In the final assessment of this module, we will get the sequence for the regions created like this:
library(ERBS)
data("HepG2")
data("GM12878")
res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)
What genome build was used to create these regions?
genome(erbs)
## chr1 chr2 chr3
## "hg19" "hg19" "hg19"
## chr4 chr5 chr6
## "hg19" "hg19" "hg19"
## chr7 chr8 chr9
## "hg19" "hg19" "hg19"
## chr10 chr11 chr12
## "hg19" "hg19" "hg19"
## chr13 chr14 chr15
## "hg19" "hg19" "hg19"
## chr16 chr17 chr18
## "hg19" "hg19" "hg19"
## chr19 chr20 chr21
## "hg19" "hg19" "hg19"
## chr22 chrX chrY
## "hg19" "hg19" "hg19"
## chrM chr1_gl000191_random chr1_gl000192_random
## "hg19" "hg19" "hg19"
## chr4_ctg9_hap1 chr4_gl000193_random chr4_gl000194_random
## "hg19" "hg19" "hg19"
## chr6_apd_hap1 chr6_cox_hap2 chr6_dbb_hap3
## "hg19" "hg19" "hg19"
## chr6_mann_hap4 chr6_mcf_hap5 chr6_qbl_hap6
## "hg19" "hg19" "hg19"
## chr6_ssto_hap7 chr7_gl000195_random chr8_gl000196_random
## "hg19" "hg19" "hg19"
## chr8_gl000197_random chr9_gl000198_random chr9_gl000199_random
## "hg19" "hg19" "hg19"
## chr9_gl000200_random chr9_gl000201_random chr11_gl000202_random
## "hg19" "hg19" "hg19"
## chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random
## "hg19" "hg19" "hg19"
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random
## "hg19" "hg19" "hg19"
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random
## "hg19" "hg19" "hg19"
## chrUn_gl000211 chrUn_gl000212 chrUn_gl000213
## "hg19" "hg19" "hg19"
## chrUn_gl000214 chrUn_gl000215 chrUn_gl000216
## "hg19" "hg19" "hg19"
## chrUn_gl000217 chrUn_gl000218 chrUn_gl000219
## "hg19" "hg19" "hg19"
## chrUn_gl000220 chrUn_gl000221 chrUn_gl000222
## "hg19" "hg19" "hg19"
## chrUn_gl000223 chrUn_gl000224 chrUn_gl000225
## "hg19" "hg19" "hg19"
## chrUn_gl000226 chrUn_gl000227 chrUn_gl000228
## "hg19" "hg19" "hg19"
## chrUn_gl000229 chrUn_gl000230 chrUn_gl000231
## "hg19" "hg19" "hg19"
## chrUn_gl000232 chrUn_gl000233 chrUn_gl000234
## "hg19" "hg19" "hg19"
## chrUn_gl000235 chrUn_gl000236 chrUn_gl000237
## "hg19" "hg19" "hg19"
## chrUn_gl000238 chrUn_gl000239 chrUn_gl000240
## "hg19" "hg19" "hg19"
## chrUn_gl000241 chrUn_gl000242 chrUn_gl000243
## "hg19" "hg19" "hg19"
## chrUn_gl000244 chrUn_gl000245 chrUn_gl000246
## "hg19" "hg19" "hg19"
## chrUn_gl000247 chrUn_gl000248 chrUn_gl000249
## "hg19" "hg19" "hg19"
Now load the human genome data
library(BSgenome.Hsapiens.UCSC.hg19)
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: rtracklayer
Now use the getSeq function to extract the sequence of each region in erbs. Then compute the GC -content (the number of C’s + the number of G’s divided by the length of sequence) of each Q: What is the median GC-content
# two methods can be done
erbseq = getSeq(Hsapiens, erbs)
af = alphabetFrequency(erbseq,as.prob = TRUE)
gc = af[,2]+af[,3]
gcM = median(gc)
gcM
## [1] 0.652568
# or
seqs = getSeq(Hsapiens,erbs)
gc = alphabetFrequency(seqs)[,2:3]
n = width(erbs)
gccontent = rowSums(gc)/n
Now create a control set of regions by shifting erbs by 10000.
What is the median GC-content of these control regions:
erbsControl = shift(erbs, 10000)
erbseqControl = getSeq(Hsapiens, erbsControl)
af = alphabetFrequency(erbseqControl,as.prob = TRUE)
gcControl = (af[,2]+af[,3])
gcControlM = median(gcControl)
gcControlM
## [1] 0.4860174
Compare boxplots of the gc-content of the binding sites and the control regions. Note that the binding sites have, in general, higher GC-content.
boxplot(gcM,gcControlM)
A likely explanation for this is that we have higher GC-content in the promoter of genes (upstream from the transcription start site). And as we saw in the videos and earlier assessments, our binding sites tend to be close to the transcription start sites.