The question that we address in this component is very precise and can be answered with one experimental paradigm. Given genomic DNA extracted from human liver cells (in this case, the widely used cell lines HepG2 and GM12878), where on the genome does the nuclear protein known as ESRRA (estrogen related receptor alpha) bind. We motivate this by describing an important publication that furthered our understanding of the role of estrogen receptors in breast cancer.

The data we examine this week comes from the analysis of ChIP-seq experiments. We will learn more about how to process raw experimental data. These in the week of this course and our short two week courses. At present we will work with results of a ChIP-seq experiment performed in the ENCODE proj ect. We will see how to import information for files in the “narrowPeak” format, will examine the results in the form of a Bioconductor GRanges object, and will illustrate how to identify, for each binding peak, the nearest transcriptional start site (as recorded for UCSC defined genes). This enables us to assess whether regulatory activity of ESRRA occurs in regions conventionally regarded as transcriptional promoter regions.

This week we are using two sets of genomic regions as examples. These are the reported Estrogen Related Receptor binding sites obtained for a ENCODE ChIP-seq experiment on two cell lines: HepG2 and GM12787. We have put these regions into an R package for your convenience. If you have not done so already, please download and install the package:

library(devtools)
install_github("genomicsclass/ERBS")

Quesiton 2.1.1

Before we get started we want to ensure that you have the latest version of Bioconductor installed, since obtaining the correct answers to these questions depends on this. Load the library. Use the biocVersion function to determine your version and enter it below (enter it as decimal like this: 1.2):

library(BiocInstaller)

## Bioconductor version 3.1 (BiocInstaller 1.18.4), ?biocLite for help

biocVersion()

## [1] '3.1'

Question 2.1.2

In Bioconductor we make use of what are referred to as S4 classes and methods. You can learn more about this style of object oriented programming (OOP) here. Although we do not expect you to become an expert in the S4 system, it is helpful to have some general understanding as it will help you, for example, retrieve help files for our classes.

You can learn about the Bioconductor class system here. We are going to be working mostly with the GRanges S4 class. Mike has prepared a reference card type guide.

library(ERBS)

Load two data sets

data("GM12878")
data("HepG2")

Question 2.1.3

Explore the HepG2 project

How many regions are represented

class(HepG2)

## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"

# You can read by just typing and reading the provided information
HepG2

## Loading required package: GenomicRanges
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: S4Vectors
## Loading required package: stats4
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Loading required package: IRanges
## Loading required package: GenomeInfoDb

## GRanges object with 303 ranges and 7 metadata columns:
##         seqnames                 ranges strand   |      name     score
##            <Rle>              <IRanges>  <Rle>   | <numeric> <integer>
##     [1]     chr2 [ 20335378,  20335787]      *   |      <NA>         0
##     [2]    chr20 [   328285,    329145]      *   |      <NA>         0
##     [3]     chr6 [168135432, 168136587]      *   |      <NA>         0
##     [4]    chr19 [  1244419,   1245304]      *   |      <NA>         0
##     [5]    chr11 [ 64071828,  64073069]      *   |      <NA>         0
##     ...      ...                    ...    ... ...       ...       ...
##   [299]     chr4 [  1797182,   1797852]      *   |      <NA>         0
##   [300]     chr1 [110198573, 110199126]      *   |      <NA>         0
##   [301]    chr17 [ 17734052,  17734469]      *   |      <NA>         0
##   [302]     chr1 [ 48306453,  48306908]      *   |      <NA>         0
##   [303]    chr12 [123867207, 123867554]      *   |      <NA>         0
##               col signalValue    pValue       qValue      peak
##         <logical>   <numeric> <numeric>    <numeric> <integer>
##     [1]      <NA>      58.251    75.899 6.143712e-72       195
##     [2]      <NA>      10.808    69.685 5.028065e-66       321
##     [3]      <NA>      17.103    54.311 7.930665e-51       930
##     [4]      <NA>      12.427    43.855 1.359756e-40       604
##     [5]      <NA>       10.85    40.977 7.333863e-38       492
##     ...       ...         ...       ...          ...       ...
##   [299]      <NA>       9.681    10.057 1.423343e-08       402
##   [300]      <NA>       7.929    10.047 1.442076e-08       197
##   [301]      <NA>       5.864      9.99 1.638918e-08       227
##   [302]      <NA>        5.66     9.948 1.799414e-08       211
##   [303]      <NA>      13.211     9.918 1.921805e-08       163
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

# or you can also type 
length(HepG2)

## [1] 303

We will continue using the ERBS example. We can see that it is not ordered. We can access the columns by extracting data using value.

values(HepG2)

## DataFrame with 303 rows and 7 columns
##          name     score       col signalValue    pValue       qValue
##     <numeric> <integer> <logical>   <numeric> <numeric>    <numeric>
## 1          NA         0        NA      58.251    75.899 6.143712e-72
## 2          NA         0        NA      10.808    69.685 5.028065e-66
## 3          NA         0        NA      17.103    54.311 7.930665e-51
## 4          NA         0        NA      12.427    43.855 1.359756e-40
## 5          NA         0        NA      10.850    40.977 7.333863e-38
## ...       ...       ...       ...         ...       ...          ...
## 299        NA         0        NA       9.681    10.057 1.423343e-08
## 300        NA         0        NA       7.929    10.047 1.442076e-08
## 301        NA         0        NA       5.864     9.990 1.638918e-08
## 302        NA         0        NA       5.660     9.948 1.799414e-08
## 303        NA         0        NA      13.211     9.918 1.921805e-08
##          peak
##     <integer>
## 1         195
## 2         321
## 3         930
## 4         604
## 5         492
## ...       ...
## 299       402
## 300       197
## 301       227
## 302       211
## 303       163

Some more basic funcitons are below.

Some of these information can be extracted by treating this data set as matrix. So we can use square brackets for subsetting. Top 10 rows can be extracted by

HepG2[1:10,]

## GRanges object with 10 ranges and 7 metadata columns:
##        seqnames                 ranges strand |      name     score
##           <Rle>              <IRanges>  <Rle> | <numeric> <integer>
##    [1]     chr2 [ 20335378,  20335787]      * |      <NA>         0
##    [2]    chr20 [   328285,    329145]      * |      <NA>         0
##    [3]     chr6 [168135432, 168136587]      * |      <NA>         0
##    [4]    chr19 [  1244419,   1245304]      * |      <NA>         0
##    [5]    chr11 [ 64071828,  64073069]      * |      <NA>         0
##    [6]    chr20 [ 22410891,  22411863]      * |      <NA>         0
##    [7]    chr19 [ 45439028,  45439525]      * |      <NA>         0
##    [8]     chr2 [ 16938364,  16938840]      * |      <NA>         0
##    [9]    chr16 [ 70191209,  70192150]      * |      <NA>         0
##   [10]     chr3 [ 53290050,  53290602]      * |      <NA>         0
##              col signalValue    pValue       qValue      peak
##        <logical>   <numeric> <numeric>    <numeric> <integer>
##    [1]      <NA>      58.251    75.899 6.143712e-72       195
##    [2]      <NA>      10.808    69.685 5.028065e-66       321
##    [3]      <NA>      17.103    54.311 7.930665e-51       930
##    [4]      <NA>      12.427    43.855 1.359756e-40       604
##    [5]      <NA>       10.85    40.977 7.333863e-38       492
##    [6]      <NA>       6.419     41.02 7.749606e-38       660
##    [7]      <NA>        8.78    39.204 3.804879e-36       267
##    [8]      <NA>      12.783    38.004 5.360291e-35       255
##    [9]      <NA>       8.371    37.774 8.192772e-35       688
##   [10]      <NA>      16.196    37.633 9.446036e-35       281
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

How do I get a region from from Chr 20. this is done by seqnames

chr= seqnames(HepG2)

This will give the chromosome name of each row. This is an Rle object. We can coerce it by as.character(). The Rle class would order different chromosomes together automatically. Many functions can be done even with Rle format. Say table function would work perfectly well

as.character(chr)

##   [1] "chr2"  "chr20" "chr6"  "chr19" "chr11" "chr20" "chr19" "chr2" 
##   [9] "chr16" "chr3"  "chr6"  "chr20" "chr7"  "chr16" "chr9"  "chr11"
##  [17] "chr22" "chrX"  "chr8"  "chr16" "chr16" "chr19" "chr17" "chr17"
##  [25] "chr16" "chr1"  "chr16" "chr9"  "chr17" "chr16" "chr12" "chr6" 
##  [33] "chr2"  "chr3"  "chr11" "chr16" "chr6"  "chr2"  "chr8"  "chr1" 
##  [41] "chr17" "chr20" "chr4"  "chr14" "chr19" "chr20" "chr9"  "chr2" 
##  [49] "chr2"  "chr19" "chr8"  "chr14" "chr22" "chr2"  "chr14" "chr6" 
##  [57] "chr20" "chr2"  "chr19" "chr8"  "chr2"  "chr19" "chr12" "chr2" 
##  [65] "chr2"  "chr11" "chr12" "chr7"  "chr19" "chr22" "chr17" "chr3" 
##  [73] "chr8"  "chr3"  "chr15" "chr6"  "chr9"  "chr10" "chr6"  "chr2" 
##  [81] "chr19" "chr11" "chr8"  "chr17" "chr15" "chr21" "chr7"  "chr2" 
##  [89] "chr2"  "chr3"  "chr2"  "chr16" "chr10" "chr20" "chr17" "chr13"
##  [97] "chr2"  "chr5"  "chr14" "chr11" "chr8"  "chr20" "chr3"  "chr7" 
## [105] "chr1"  "chr1"  "chr3"  "chr17" "chrX"  "chr19" "chr20" "chr6" 
## [113] "chr7"  "chr16" "chr7"  "chr17" "chr20" "chr2"  "chr5"  "chrX" 
## [121] "chr7"  "chr6"  "chr19" "chr17" "chr16" "chr5"  "chr12" "chr9" 
## [129] "chr20" "chr2"  "chr12" "chr3"  "chr7"  "chr2"  "chr20" "chr20"
## [137] "chr17" "chr12" "chr19" "chr1"  "chr7"  "chr20" "chr14" "chr12"
## [145] "chr10" "chr6"  "chr9"  "chr6"  "chr1"  "chr18" "chr8"  "chr15"
## [153] "chr6"  "chr2"  "chr1"  "chr18" "chr16" "chr9"  "chr20" "chr19"
## [161] "chr17" "chr10" "chr6"  "chr2"  "chrX"  "chr16" "chr20" "chr16"
## [169] "chr20" "chr16" "chr20" "chr5"  "chr16" "chr17" "chr17" "chr3" 
## [177] "chr8"  "chr18" "chr18" "chr7"  "chr20" "chr16" "chr19" "chr11"
## [185] "chr12" "chr2"  "chr17" "chr1"  "chr20" "chr4"  "chr17" "chr1" 
## [193] "chr6"  "chr5"  "chr13" "chr7"  "chr20" "chr2"  "chr16" "chr6" 
## [201] "chr11" "chr5"  "chr20" "chr1"  "chr9"  "chr2"  "chr16" "chr10"
## [209] "chr9"  "chr2"  "chr2"  "chr21" "chr1"  "chr16" "chr18" "chr10"
## [217] "chr16" "chr3"  "chr6"  "chr16" "chr2"  "chr6"  "chr10" "chr16"
## [225] "chr22" "chr2"  "chr16" "chr8"  "chr20" "chr19" "chr16" "chr20"
## [233] "chr2"  "chr3"  "chr10" "chr14" "chr6"  "chr18" "chr15" "chr9" 
## [241] "chr14" "chr7"  "chr20" "chr3"  "chr6"  "chr10" "chr4"  "chr1" 
## [249] "chr9"  "chr15" "chr6"  "chr16" "chr2"  "chr3"  "chr14" "chr19"
## [257] "chr2"  "chr5"  "chr22" "chr16" "chr6"  "chr16" "chr17" "chr11"
## [265] "chr8"  "chr3"  "chr1"  "chr16" "chr21" "chr12" "chr16" "chr1" 
## [273] "chr2"  "chr2"  "chr9"  "chr2"  "chr16" "chr17" "chr12" "chr17"
## [281] "chr7"  "chr20" "chr7"  "chr6"  "chr12" "chr2"  "chr1"  "chr5" 
## [289] "chr6"  "chr2"  "chr1"  "chr12" "chr2"  "chr6"  "chr20" "chr2" 
## [297] "chr17" "chr3"  "chr4"  "chr1"  "chr17" "chr1"  "chr12"

# it will show the names of chromosomes of all the rows. 
table(chr)

## chr
##                  chr1                  chr2                  chr3 
##                    18                    38                    15 
##                  chr4                  chr5                  chr6 
##                     4                     8                    24 
##                  chr7                  chr8                  chr9 
##                    14                    11                    12 
##                 chr10                 chr11                 chr12 
##                     9                     9                    13 
##                 chr13                 chr14                 chr15 
##                     2                     8                     5 
##                 chr16                 chr17                 chr18 
##                    31                    21                     6 
##                 chr19                 chr20                 chr21 
##                    16                    27                     3 
##                 chr22                  chrX                  chrY 
##                     5                     4                     0 
##                  chrM  chr1_gl000191_random  chr1_gl000192_random 
##                     0                     0                     0 
##        chr4_ctg9_hap1  chr4_gl000193_random  chr4_gl000194_random 
##                     0                     0                     0 
##         chr6_apd_hap1         chr6_cox_hap2         chr6_dbb_hap3 
##                     0                     0                     0 
##        chr6_mann_hap4         chr6_mcf_hap5         chr6_qbl_hap6 
##                     0                     0                     0 
##        chr6_ssto_hap7  chr7_gl000195_random  chr8_gl000196_random 
##                     0                     0                     0 
##  chr8_gl000197_random  chr9_gl000198_random  chr9_gl000199_random 
##                     0                     0                     0 
##  chr9_gl000200_random  chr9_gl000201_random chr11_gl000202_random 
##                     0                     0                     0 
##       chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random 
##                     0                     0                     0 
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random 
##                     0                     0                     0 
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random 
##                     0                     0                     0 
##        chrUn_gl000211        chrUn_gl000212        chrUn_gl000213 
##                     0                     0                     0 
##        chrUn_gl000214        chrUn_gl000215        chrUn_gl000216 
##                     0                     0                     0 
##        chrUn_gl000217        chrUn_gl000218        chrUn_gl000219 
##                     0                     0                     0 
##        chrUn_gl000220        chrUn_gl000221        chrUn_gl000222 
##                     0                     0                     0 
##        chrUn_gl000223        chrUn_gl000224        chrUn_gl000225 
##                     0                     0                     0 
##        chrUn_gl000226        chrUn_gl000227        chrUn_gl000228 
##                     0                     0                     0 
##        chrUn_gl000229        chrUn_gl000230        chrUn_gl000231 
##                     0                     0                     0 
##        chrUn_gl000232        chrUn_gl000233        chrUn_gl000234 
##                     0                     0                     0 
##        chrUn_gl000235        chrUn_gl000236        chrUn_gl000237 
##                     0                     0                     0 
##        chrUn_gl000238        chrUn_gl000239        chrUn_gl000240 
##                     0                     0                     0 
##        chrUn_gl000241        chrUn_gl000242        chrUn_gl000243 
##                     0                     0                     0 
##        chrUn_gl000244        chrUn_gl000245        chrUn_gl000246 
##                     0                     0                     0 
##        chrUn_gl000247        chrUn_gl000248        chrUn_gl000249 
##                     0                     0                     0

# it will show the number of rows for each chromosomes. This contain other weird names for chromosomes will be understood when we do more human genome studies.

Furthermore, we can sbset the entire dataset by selecting the chromosome numbers. Like what is shown below.

HepG2[chr=="chr20"]

## GRanges object with 27 ranges and 7 metadata columns:
##        seqnames               ranges strand   |      name     score
##           <Rle>            <IRanges>  <Rle>   | <numeric> <integer>
##    [1]    chr20 [  328285,   329145]      *   |      <NA>         0
##    [2]    chr20 [22410891, 22411863]      *   |      <NA>         0
##    [3]    chr20 [56039583, 56040249]      *   |      <NA>         0
##    [4]    chr20 [16455811, 16456232]      *   |      <NA>         0
##    [5]    chr20 [ 3140243,  3140774]      *   |      <NA>         0
##    ...      ...                  ...    ... ...       ...       ...
##   [23]    chr20 [ 5591571,  5592037]      *   |      <NA>         0
##   [24]    chr20 [25519664, 25520238]      *   |      <NA>         0
##   [25]    chr20 [19900951, 19901275]      *   |      <NA>         0
##   [26]    chr20 [35156796, 35157140]      *   |      <NA>         0
##   [27]    chr20 [25036720, 25037716]      *   |      <NA>         0
##              col signalValue    pValue       qValue      peak
##        <logical>   <numeric> <numeric>    <numeric> <integer>
##    [1]      <NA>      10.808    69.685 5.028065e-66       321
##    [2]      <NA>       6.419     41.02 7.749606e-38       660
##    [3]      <NA>       7.796    36.977 3.666932e-34       315
##    [4]      <NA>       7.351    21.831 1.596682e-19       199
##    [5]      <NA>       7.296    21.587 2.625365e-19       315
##    ...       ...         ...       ...          ...       ...
##   [23]      <NA>       8.766    11.433 7.677415e-10       249
##   [24]      <NA>         3.3    11.419 7.895197e-10       206
##   [25]      <NA>       4.809    11.155 1.379539e-09       140
##   [26]      <NA>      10.154    10.313 8.309712e-09       163
##   [27]      <NA>       4.381    10.087 1.332784e-08       170
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

This will select only the data rows obtained from chromosome 20.

We can order them also like the following. So that it will order the chromosome number followed by the regions with in the chromosome and so on.

x <-HepG2[order(HepG2),]
# This is very convenient feature. We can find the seqnames in this so that it will order them too 
seqnames(x)

## factor-Rle of length 303 with 23 runs
##   Lengths:    18    38    15     4     8 ...    16    27     3     5     4
##   Values :  chr1  chr2  chr3  chr4  chr5 ... chr19 chr20 chr21 chr22  chrX
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000248 chrUn_gl000249

# this would give ordered chromosomes and number of repititions in that too.

Question 2.2.1

In the video we used values method to extract meta-data on the regions. An alternative, and actually preferred method is going forward with mcols.

What is the median of signalValue column?

mcols(HepG2)

## DataFrame with 303 rows and 7 columns
##          name     score       col signalValue    pValue       qValue
##     <numeric> <integer> <logical>   <numeric> <numeric>    <numeric>
## 1          NA         0        NA      58.251    75.899 6.143712e-72
## 2          NA         0        NA      10.808    69.685 5.028065e-66
## 3          NA         0        NA      17.103    54.311 7.930665e-51
## 4          NA         0        NA      12.427    43.855 1.359756e-40
## 5          NA         0        NA      10.850    40.977 7.333863e-38
## ...       ...       ...       ...         ...       ...          ...
## 299        NA         0        NA       9.681    10.057 1.423343e-08
## 300        NA         0        NA       7.929    10.047 1.442076e-08
## 301        NA         0        NA       5.864     9.990 1.638918e-08
## 302        NA         0        NA       5.660     9.948 1.799414e-08
## 303        NA         0        NA      13.211     9.918 1.921805e-08
##          peak
##     <integer>
## 1         195
## 2         321
## 3         930
## 4         604
## 5         492
## ...       ...
## 299       402
## 300       197
## 301       227
## 302       211
## 303       163

median( mcols(HepG2)$signalValue )

## [1] 7.024

Question 2.2.2

In what chromosome is the region with the highest signalValue (copy and paste your answer)?

i=which.max(  mcols(HepG2)$signalValue  )
seqnames(HepG2[i])

## factor-Rle of length 1 with 1 run
##   Lengths:    1
##   Values : chrX
## Levels(93): chr1 chr2 chr3 ... chrUn_gl000248 chrUn_gl000249

Question 2.2.3

How many regions are from chromosome 16?

HepG2[chr=="chr16"]

## GRanges object with 31 ranges and 7 metadata columns:
##        seqnames               ranges strand   |      name     score
##           <Rle>            <IRanges>  <Rle>   | <numeric> <integer>
##    [1]    chr16 [70191209, 70192150]      *   |      <NA>         0
##    [2]    chr16 [ 1701039,  1702137]      *   |      <NA>         0
##    [3]    chr16 [25189109, 25190026]      *   |      <NA>         0
##    [4]    chr16 [85325101, 85325686]      *   |      <NA>         0
##    [5]    chr16 [29986461, 29986872]      *   |      <NA>         0
##    ...      ...                  ...    ... ...       ...       ...
##   [27]    chr16 [57481218, 57481854]      *   |      <NA>         0
##   [28]    chr16 [85322504, 85322950]      *   |      <NA>         0
##   [29]    chr16 [19134897, 19135280]      *   |      <NA>         0
##   [30]    chr16 [ 2586101,  2586737]      *   |      <NA>         0
##   [31]    chr16 [29975932, 29976255]      *   |      <NA>         0
##              col signalValue    pValue       qValue      peak
##        <logical>   <numeric> <numeric>    <numeric> <integer>
##    [1]      <NA>       8.371    37.774 8.192772e-35       688
##    [2]      <NA>      16.157    36.264 1.656956e-33       783
##    [3]      <NA>       5.979    31.808 3.443564e-29       606
##    [4]      <NA>       7.664    31.429 7.883209e-29       223
##    [5]      <NA>      14.795    29.018 1.730084e-26       198
##    ...       ...         ...       ...          ...       ...
##   [27]      <NA>       5.126    10.761 3.209779e-09       196
##   [28]      <NA>       4.331    10.725 3.434943e-09       223
##   [29]      <NA>        5.38    10.562 4.925627e-09       203
##   [30]      <NA>       6.521    10.514 5.421234e-09       472
##   [31]      <NA>       6.897    10.436 6.371957e-09       145
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

# or 
chr = seqnames(HepG2)
table(chr)[16]

## chr16 
##    31

Question 2.2.4

Make a histogram of the widths of the regions from all chromosomes (not just chr16).

What is the median width?

median( width(HepG2) )

## [1] 560

##You can see the histogram
hist(width(HepG2),nclass=25)

Note it has a heavy right tail.

Question 2.3.1

In the previous video we saw a number of functions for manipulating interval ranges:

shift, narrow, flank, *, +, -, range, reduce, gaps, disjoin.

This is just a subset of all the possible operations, and remember, the rest are documented in the help pages mentioned in the video and in the book page. We will first do a simple review of these operations, so that you get a sense of using them in your R console. Then we will have a few questions which require more thought.

Load the IRanges package. Define an integer range starting at 101 and ending at 200. If we use the operation *2, this will zoom in, giving us a range with half the width. What is the starting point of the resulting range?

library(IRanges)
ir <- IRanges(101,200,100)
start(ir*2)

## [1] 126

Question 2.3.2

Define an integer range starting at 101 and ending at 200. If we use the operation narrow(x,start=20) what is the new starting point of the range?

narrow(ir,start = 20)

## IRanges of length 1
##     start end width
## [1]   120 200    81

Question 2.3.3

Define an integer range starting at 101 and ending at 200. If we use the operation +25, what is the width of the resulting range?

ir+25

## IRanges of length 1
##     start end width
## [1]    76 225   150

Question 2.3.4

Define an IRanges with starts at 1,11,21 and ends at 3,15,27. width() gives the widths for each range. What is the sum of the widths of all the ranges?

ir <- IRanges(start = c(1,11,21), end = c(3,15,27))
width(ir)

## [1] 3 5 7

sum(width(ir))

## [1] 15

Question 2.3.5

Define an IRanges object, x, with the following set of ranges:

Starts at 101,106,201,211,221,301,306,311,351,361,401,411,501

Ends at 150,160,210,270,225,310,310,330,390,380,415,470,510

Plot these ranges using the plotRanges function in the ph525x package. You can install this library if you have not done so already, with the command: install_github(“genomicsclass/ph525x”)

What is the total width from 101 to 510 which is not covered by ranges in x?

library(ph525x)

## Loading required package: png
## Loading required package: grid
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

x <- IRanges(start = c(101,106,201,211,221,301,306,311,351,361,401,411,501),end = c(150,160,210,270,225,310,310,330,390,380,415,470,510))

# find out the range first
range(x)

## IRanges of length 1
##     start end width
## [1]   101 510   410

# Because the range of x is from 101 to 510, we just need to take the sum of the gap widths
sum(width(gaps(x)))

## [1] 130

Question 2.3.6

How many disjoint ranges are contained within the ranges in ‘x’ from the previous question? By disjoint ranges, we mean the following: for two ranges [1,10] and [6,15], there are three disjoint ranges contained within: [1,5], [6,10], and [11,15].

disjoin(x)

## IRanges of length 17
##      start end width
## [1]    101 105     5
## [2]    106 150    45
## [3]    151 160    10
## [4]    201 210    10
## [5]    211 220    10
## ...    ... ...   ...
## [13]   381 390    10
## [14]   401 410    10
## [15]   411 415     5
## [16]   416 470    55
## [17]   501 510    10

length(disjoin(x))

## [1] 17

Question 2.3.7

An intra-range function we didn’t show in the video is resize(). Set up a grid of 2 stacked plots:

par(mfrow=c(2,1))

Now use plotRanges() to plot the original ‘x’ and resize(x,1). You will have to set the xlim to make sure that the plots line up vertically. You can use plotRanges(x,xlim=c(,600)) for example. What is the best description for the operation resize(x,1)

From the man page: resize resizes the ranges to the specified width where either the start, end, or center is used as an anchor. The default is fix=“start”, so resize(x,1) gives you the starting integer of each range in x.

plotRanges(x)

x1 <- resize(x,1)
plotRanges(x1)

Question 2.4.1

In the first week, in the subsection “What We Measure and Why”, we learned that DNA has two strands. These two strands are often called plus, “+”, and minus, “-”.

The GRanges object in the GenomicRanges package extends the concept of interval ranges in two major ways. The ranges are now also identified by:

the chromosome we are referring to (in Bioconductor, this is called “seqnames”)
the strand of the DNA we are referring to (“+” or “-”). No strand is labelled with a star, “*“.

Without these two pieces of information, a specification of a range of DNA would be ambiguous. Let’s make two ranges, with strand and chromosome information, and see how the range operations act based on strand.

library(IRanges)
library(ph525x)
library(GenomicRanges)
x = GRanges("chr1", IRanges(c(1,101),c(50,150)), strand=c("+","-"))

In the last assessment, we visualized IRanges with the plotRanges function in the ph525x library. We can get the internal IRanges from a GRanges object with the following code:

ranges(x)

## IRanges of length 2
##     start end width
## [1]     1  50    50
## [2]   101 150    50

So let’s define a new plotting function:

plotGRanges = function(x) plotRanges(ranges(x))

Compare x and resize(x,1) using plotGRanges. The result of running resize(x,1) is two ranges of width 1 which start…

par(mfrow=c(2,1))
plotGRanges(x)
x1 = resize(x,1)
plotGRanges(x1)

Question 2.4.2

Suppose we have two different sets of ranges, which overlap somewhat but not entirely. This is the case for many genes, in which there are different versions of transcripts, also called isoforms. The different transcripts consist of exons which end up in the final mRNA molecule, and a number of transcripts can share exons or have exons which are overlapping but not identical ranges.

We’ll start with a toy example, and learn how to load real genes later:

x = GRanges("chr1", IRanges(c(101,201,401,501),c(150,250,450,550)), strand="+")
y = GRanges("chr1", IRanges(c(101,221,301,401,541),c(150,250,350,470,550)), strand="+")

Plot these two sets of ranges using par(mfrow=c(2,1)) and two calls to plotGRanges.

par(mfrow=c(2,1))
plotGRanges(x)
plotGRanges(y)

If we want to keep the information about which set the ranges belong to, we could combine the two GRanges into a GRangesList:

GRangesList(x,y)

## GRangesList object of length 2:
## [[1]] 
## GRanges object with 4 ranges and 0 metadata columns:
##       seqnames     ranges strand
##          <Rle>  <IRanges>  <Rle>
##   [1]     chr1 [101, 150]      +
##   [2]     chr1 [201, 250]      +
##   [3]     chr1 [401, 450]      +
##   [4]     chr1 [501, 550]      +
## 
## [[2]] 
## GRanges object with 5 ranges and 0 metadata columns:
##       seqnames     ranges strand
##   [1]     chr1 [101, 150]      +
##   [2]     chr1 [221, 250]      +
##   [3]     chr1 [301, 350]      +
##   [4]     chr1 [401, 470]      +
##   [5]     chr1 [541, 550]      +
## 
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths

However, if we want to combine them into a single GRanges, we can use c():

c(x,y)

## GRanges object with 9 ranges and 0 metadata columns:
##       seqnames     ranges strand
##          <Rle>  <IRanges>  <Rle>
##   [1]     chr1 [101, 150]      +
##   [2]     chr1 [201, 250]      +
##   [3]     chr1 [401, 450]      +
##   [4]     chr1 [501, 550]      +
##   [5]     chr1 [101, 150]      +
##   [6]     chr1 [221, 250]      +
##   [7]     chr1 [301, 350]      +
##   [8]     chr1 [401, 470]      +
##   [9]     chr1 [541, 550]      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Find the total width which is covered by ranges in both x and y. Hint: use c(), disjoin() and %over%.

dis = disjoin(c(x,y))
both = dis %over% x & dis %over% y
sum(width(dis[both]))

## [1] 140

Question 2.4.3

What is the total which is in x or y but not in both?

not =! (dis %over% x & dis %over% y)
sum(width(dis[not]))

## [1] 130

Question 2.4.4

Define the new genomic range ‘z’, which covers range(ranges(x)) but has the opposite strand. What is the number of ranges in x which overlap z according to the %over% command?

z = GRanges("chr1", IRanges(c(101,201,401,501),c(150,250,450,550)), strand="-")
sum(x %over% z)

## [1] 0

The %over% command is specific to strand. If we want to find the ranges regardless of strand, we need to assign a strand of “*“.

In this assessment we will be working with two sets of regions.

library(GenomicRanges)
library(GenomicFeatures)

## Loading required package: AnnotationDbi

library(IRanges)
library(ERBS)
data("HepG2") # cell line for the liver origin
data("GM12878") # immortalized B cell
# browseVignettes("GenomicRanges") to read full review of functions.

Question 2.5.1

Where does the 17th HepG2 region start?

start(HepG2)[17]

## [1] 46528596

Question 2.5.2

Where does 17th region of HepG2 start? (This is the same as previous question, but it was kept since students might have already answered).

Question 2.5.3

Use ?distanceToNearest to find the closest region in GM12878 to the 17th region in HepG2. What is the start site of this region.

d = distanceToNearest(HepG2[17],GM12878)
i = subjectHits(d)
start(GM12878[i])

## [1] 46524762

Question 2.5.4

What is the distance between the 17th region of HepG2 and its closest region in GM12878?

d = distanceToNearest(HepG2[17],GM12878)
mcols(d)$distance

## [1] 2284

Question 2.5.5

For each region in the HepG2 find the closest region in GM12878 and record the distance. What is the proportion of these distances are smaller than 2000 base pairs? Distance is a metadata column on the Hits object, so consider mcols().

d = distanceToNearest(HepG2,GM12878)
mean(mcols(d)$distance < 2000)

## [1] 0.2673267

In this assessment we will be getting gene information for human genes from the Homo.sapiens library.

library(Homo.sapiens)

## Loading required package: OrganismDbi
## Loading required package: GO.db
## Loading required package: DBI
## 
## Loading required package: org.Hs.eg.db
## 
## Loading required package: TxDb.Hsapiens.UCSC.hg19.knownGene

ghs = genes(Homo.sapiens)

Question 2.6.1

What genome build was used here?

genome(ghs)

##                  chr1                  chr2                  chr3 
##                "hg19"                "hg19"                "hg19" 
##                  chr4                  chr5                  chr6 
##                "hg19"                "hg19"                "hg19" 
##                  chr7                  chr8                  chr9 
##                "hg19"                "hg19"                "hg19" 
##                 chr10                 chr11                 chr12 
##                "hg19"                "hg19"                "hg19" 
##                 chr13                 chr14                 chr15 
##                "hg19"                "hg19"                "hg19" 
##                 chr16                 chr17                 chr18 
##                "hg19"                "hg19"                "hg19" 
##                 chr19                 chr20                 chr21 
##                "hg19"                "hg19"                "hg19" 
##                 chr22                  chrX                  chrY 
##                "hg19"                "hg19"                "hg19" 
##                  chrM  chr1_gl000191_random  chr1_gl000192_random 
##                "hg19"                "hg19"                "hg19" 
##        chr4_ctg9_hap1  chr4_gl000193_random  chr4_gl000194_random 
##                "hg19"                "hg19"                "hg19" 
##         chr6_apd_hap1         chr6_cox_hap2         chr6_dbb_hap3 
##                "hg19"                "hg19"                "hg19" 
##        chr6_mann_hap4         chr6_mcf_hap5         chr6_qbl_hap6 
##                "hg19"                "hg19"                "hg19" 
##        chr6_ssto_hap7  chr7_gl000195_random  chr8_gl000196_random 
##                "hg19"                "hg19"                "hg19" 
##  chr8_gl000197_random  chr9_gl000198_random  chr9_gl000199_random 
##                "hg19"                "hg19"                "hg19" 
##  chr9_gl000200_random  chr9_gl000201_random chr11_gl000202_random 
##                "hg19"                "hg19"                "hg19" 
##       chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random 
##                "hg19"                "hg19"                "hg19" 
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random 
##                "hg19"                "hg19"                "hg19" 
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000211        chrUn_gl000212        chrUn_gl000213 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000214        chrUn_gl000215        chrUn_gl000216 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000217        chrUn_gl000218        chrUn_gl000219 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000220        chrUn_gl000221        chrUn_gl000222 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000223        chrUn_gl000224        chrUn_gl000225 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000226        chrUn_gl000227        chrUn_gl000228 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000229        chrUn_gl000230        chrUn_gl000231 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000232        chrUn_gl000233        chrUn_gl000234 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000235        chrUn_gl000236        chrUn_gl000237 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000238        chrUn_gl000239        chrUn_gl000240 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000241        chrUn_gl000242        chrUn_gl000243 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000244        chrUn_gl000245        chrUn_gl000246 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000247        chrUn_gl000248        chrUn_gl000249 
##                "hg19"                "hg19"                "hg19"

seqinfo(ghs)

## Seqinfo object with 93 sequences (1 circular) from hg19 genome:
##   seqnames       seqlengths isCircular genome
##   chr1            249250621      FALSE   hg19
##   chr2            243199373      FALSE   hg19
##   chr3            198022430      FALSE   hg19
##   chr4            191154276      FALSE   hg19
##   chr5            180915260      FALSE   hg19
##   ...                   ...        ...    ...
##   chrUn_gl000245      36651      FALSE   hg19
##   chrUn_gl000246      38154      FALSE   hg19
##   chrUn_gl000247      36422      FALSE   hg19
##   chrUn_gl000248      39786      FALSE   hg19
##   chrUn_gl000249      38502      FALSE   hg19

How many genes are represented in ghs?

length(ghs)

## [1] 23056

Question 2.6.2

Whatis the chromosome with the most genes?

which.max( table( seqnames( ghs ) ))

## chr1 
##    1

Question 2.6.3

Make a histogram of the widths of genes (use the width() on the GRanges object). This width gives the number of basepairs from the start of the gene to the end, so including exons and introns.

w = width( ghs )
hist( w )

## the larger values are so much larger than the bulk of the data
## that we can barely see the frequencies of large values in the histogram 
## the log transformation better shows how fat the right tail is:
hist( log10(w))

Quetion 2.6.4

What is the median gene width

median(w)

## [1] 20115.5

In the next assessment we will find the closest genes to some of our binding sites. We will use a consensus set of regions. In the video we did it like this:

res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)

The following command is similar.

erbs2= intersect(HepG2,GM12878)

Question 2.7.1

Which is TRUE?

## first order them
erbs3 = erbs[order(erbs),]
##confirm same chr
all( seqnames(erbs2)==seqnames(erbs3) )

## [1] TRUE

mean( start(erbs2)==start(erbs3) & end(erbs2)==end(erbs3) )

## [1] 0.9333333

##the intersection should be smaller
all( width(erbs2) <= width(erbs3) )

## [1] TRUE

# More than 90% of these two are similar with different regions being smaller in erbs2

Question 2.7.2

Using ghs regions:

library(Homo.sapiens)
ghs = genes(Homo.sapiens)

and what you learned in the video, convert the ghs object to one that represents just the tss. What is the TSS (Transcription Start Site) of the gene with ID: 100113402?

Hint: Look at the ghs in the console. Note that the names of the ranges are the same as the GENEID column. So you can index the ranges directly with "100113402".

tssgr = resize(ghs,1)
start(tssgr["100113402"])

## [1] 70563402

Question 2.7.3

Now using the erbs regions defined in a previous question:

library(ERBS)
data("HepG2")
data("GM12878")
res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)

What is the GENEID of the gene with TSS closest to the 4th region of erbs?

res = findOverlaps(erbs[4],tssgr)
res = tssgr[subjectHits(res)]

# or 
i = nearest(erbs[4],tssgr)
mcols(tssgr)$GENEID[i]

## FactorList of length 1
## [["2101"]] 2101

Question 2.7.4

In the question above, you identified a gene. Use select function to determine which is the SYMBOL of this gene.

key = as.character(mcols(tssgr)$GENEID[i])
# columns(Homo.sapiens) would give us the list of names of columns in the Homo.sapiens
select(Homo.sapiens, keys = key, columns = c("SYMBOL", "GENENAME"), keytype = "GENEID")

##   GENEID SYMBOL                        GENENAME
## 1   2101  ESRRA estrogen-related receptor alpha

In the final assessment of this module, we will get the sequence for the regions created like this:

library(ERBS)
data("HepG2")
data("GM12878")
res = findOverlaps(HepG2,GM12878)
erbs = HepG2[queryHits(res)]
erbs = granges(erbs)

Question 2.8.1

What genome build was used to create these regions?

genome(erbs)

##                  chr1                  chr2                  chr3 
##                "hg19"                "hg19"                "hg19" 
##                  chr4                  chr5                  chr6 
##                "hg19"                "hg19"                "hg19" 
##                  chr7                  chr8                  chr9 
##                "hg19"                "hg19"                "hg19" 
##                 chr10                 chr11                 chr12 
##                "hg19"                "hg19"                "hg19" 
##                 chr13                 chr14                 chr15 
##                "hg19"                "hg19"                "hg19" 
##                 chr16                 chr17                 chr18 
##                "hg19"                "hg19"                "hg19" 
##                 chr19                 chr20                 chr21 
##                "hg19"                "hg19"                "hg19" 
##                 chr22                  chrX                  chrY 
##                "hg19"                "hg19"                "hg19" 
##                  chrM  chr1_gl000191_random  chr1_gl000192_random 
##                "hg19"                "hg19"                "hg19" 
##        chr4_ctg9_hap1  chr4_gl000193_random  chr4_gl000194_random 
##                "hg19"                "hg19"                "hg19" 
##         chr6_apd_hap1         chr6_cox_hap2         chr6_dbb_hap3 
##                "hg19"                "hg19"                "hg19" 
##        chr6_mann_hap4         chr6_mcf_hap5         chr6_qbl_hap6 
##                "hg19"                "hg19"                "hg19" 
##        chr6_ssto_hap7  chr7_gl000195_random  chr8_gl000196_random 
##                "hg19"                "hg19"                "hg19" 
##  chr8_gl000197_random  chr9_gl000198_random  chr9_gl000199_random 
##                "hg19"                "hg19"                "hg19" 
##  chr9_gl000200_random  chr9_gl000201_random chr11_gl000202_random 
##                "hg19"                "hg19"                "hg19" 
##       chr17_ctg5_hap1 chr17_gl000203_random chr17_gl000204_random 
##                "hg19"                "hg19"                "hg19" 
## chr17_gl000205_random chr17_gl000206_random chr18_gl000207_random 
##                "hg19"                "hg19"                "hg19" 
## chr19_gl000208_random chr19_gl000209_random chr21_gl000210_random 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000211        chrUn_gl000212        chrUn_gl000213 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000214        chrUn_gl000215        chrUn_gl000216 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000217        chrUn_gl000218        chrUn_gl000219 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000220        chrUn_gl000221        chrUn_gl000222 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000223        chrUn_gl000224        chrUn_gl000225 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000226        chrUn_gl000227        chrUn_gl000228 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000229        chrUn_gl000230        chrUn_gl000231 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000232        chrUn_gl000233        chrUn_gl000234 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000235        chrUn_gl000236        chrUn_gl000237 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000238        chrUn_gl000239        chrUn_gl000240 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000241        chrUn_gl000242        chrUn_gl000243 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000244        chrUn_gl000245        chrUn_gl000246 
##                "hg19"                "hg19"                "hg19" 
##        chrUn_gl000247        chrUn_gl000248        chrUn_gl000249 
##                "hg19"                "hg19"                "hg19"

Question 2.8.2

Now load the human genome data

library(BSgenome.Hsapiens.UCSC.hg19)

## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: rtracklayer

Now use the getSeq function to extract the sequence of each region in erbs. Then compute the GC -content (the number of C’s + the number of G’s divided by the length of sequence) of each Q: What is the median GC-content

# two methods can be done
erbseq = getSeq(Hsapiens, erbs)
af = alphabetFrequency(erbseq,as.prob = TRUE)
gc = af[,2]+af[,3]
gcM = median(gc)
gcM

## [1] 0.652568

# or 
seqs = getSeq(Hsapiens,erbs)
gc = alphabetFrequency(seqs)[,2:3]
n = width(erbs)
gccontent = rowSums(gc)/n

Question 2.8.3

Now create a control set of regions by shifting erbs by 10000.

What is the median GC-content of these control regions:

erbsControl = shift(erbs, 10000)
erbseqControl = getSeq(Hsapiens, erbsControl)
af = alphabetFrequency(erbseqControl,as.prob = TRUE)
gcControl = (af[,2]+af[,3])
gcControlM = median(gcControl)
gcControlM

## [1] 0.4860174

Compare boxplots of the gc-content of the binding sites and the control regions. Note that the binding sites have, in general, higher GC-content.

boxplot(gcM,gcControlM)

A likely explanation for this is that we have higher GC-content in the promoter of genes (upstream from the transcription start site). And as we saw in the videos and earlier assessments, our binding sites tend to be close to the transcription start sites.

Computing Genomic Regions

Prasanth

August 26, 2015