archWk4basic Exercises

Question 4.4.1 Structure of ErbViz juxta() output

In the video we demonstrated the creation of the new package, erbsViz, centered around the following visualization funciton:

juxta = function (chrname="chr22", ...) 
{
    require(ERBS)
    data(HepG2)
    data(GM12878)
    require(ggbio)
    require(GenomicRanges)  # "subset" is overused, need import detail
    ap1 = autoplot(GenomicRanges::subset(HepG2, seqnames==chrname))
    ap2 = autoplot(GenomicRanges::subset(GM12878, seqnames==chrname))
    tracks(HepG2 = ap1, Bcell = ap2, ...)
  }

Once this function is added to your workspace, you can create the erbsViz package using package.skeleton("erbsViz", "juxta")

After doing this, you get an installable package upon editing the manual pages generated so that each title field has proper content, essentially an unquoted string of descriptive prose. You use install.packages("erbsViz", repos = NULL, type = "source")

Use the erbsViz package as created during the video to perform the following computation:

library(erbsViz)

## 
## Attaching package: 'erbsViz'
## 
## The following object is masked _by_ '.GlobalEnv':
## 
##     juxta

jdemo = juxta()

## Loading required package: ERBS
## Loading required package: ggbio
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: ggplot2
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Need specific help about ggbio? try mailing 
##  the maintainer or visit http://tengfei.github.com/ggbio/
## 
## Attaching package: 'ggbio'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
##     stat_identity, xlim
## 
## Loading required package: GenomicRanges
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb

What is the class of jdemo?

class(jdemo)

## [1] "Tracks"
## attr(,"package")
## [1] "ggbio"

Have a look at getSlots(getClass(class(jdemo))) to get a sense of how the display might be customized.

label.text.color

4.5.1 Setup for a makrOrganismPackage call

In the preceding video, we saw how to make an OrganismDb package for S. cerevisiae. In this problem set, you will make such a package for C. elegans. The makeOrganismPackage function in OrganismDbi requires a list for the graphData parameter

gd = list(join1 = c(GO.db="GOID", [fixme]="GO"),
          join2 = c([fixme]="ENTREZID",
                    TxDb.Celegans.UCSC.ce6.ensGene="GENEID"))

What should you substitute for [fixme] in the call above? Hint: the package name to be used has eg.db as suffix. See the annotation cheat sheet for the contraction pattern.

org.Ce.eg.db

 gd = list(join1 = c(GO.db="GOID", org.Ce.eg.db="GO"),
           join2 = c(org.Ce.eg.db="ENTREZID",
                     TxDb.Celegans.UCSC.ce6.ensGene="GENEID"))

Question 4.5.2 A query to the new Cen.ele6 package

After correctly specifying gd above, use

library(OrganismDbi)

## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Loading required package: GenomicFeatures

if (!file.exists("Cen.ele6"))
makeOrganismPackage("Cen.ele6", gd, "C. elegans", "1.0.0", "me <me@abc.com>", 
      "me <me@abc.com>", ".")

and install the package via

install.packages("Cen.ele6", repos=NULL, type="source")

## Installing package into 'C:/Users/Prasanth/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)

use library(Cen.ele6). What is the `sum(seqlengths(Cen.ele6))?

Question 4.8.1 Genes with High MBP1 binding scores have low MCM1 scores

Use harbChIP to display the distribution of binding scores for MCM1 over the yeast genome. Use the qqnormal display as follows:

library(harbChIP)

## Loading required package: tools
## Loading required package: Biostrings
## 
## Attaching package: 'harbChIP'
## 
## The following object is masked from 'package:OrganismDbi':
## 
##     keys
## 
## The following object is masked from 'package:GenomicFeatures':
## 
##     organism
## 
## The following object is masked from 'package:AnnotationDbi':
## 
##     keys
## 
## The following object is masked from 'package:GenomeInfoDb':
## 
##     organism
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     organism

data(harbChIP)
sv=qqnorm(exprs(harbChIP)[,"MCM1"], main = "Mcm1 binding scores")

Superimporse five genes to which Mbp1 had high binding scores

plot(sv, main = "Mcm1 binding scores")
topb = names(sort(exprs(harbChIP)[, "MBP1"], decreasing = TRUE)[1:5])
points(sv$x[topb], sv$y[topb], col = "red", pch=19)

How many distinct red points do you see? Two of the genes that are neighbours ad very similar binding scores reported and are not distinguishable.

Question 4.8.2 Slices of the cell cycle transcriptional cascade

Plot the trajectory for the gene for which Mcm1 binding was reportedly strongest, and add to this in purpose the trajectory of the gene for which Mbp1 binding was strongest.

library(yeastCC)
data("spYCCES")
alp = spYCCES[,spYCCES$syncmeth=="alpha"]
nm = names(which.max(exprs(harbChIP)[,"MCM1"]))
nm2 = names(which.max(exprs(harbChIP)[,"MBP1"]))
plot(exprs(alp)[nm,]~alp$time, ylab=paste0(nm,"expression"), type="l")
lines(exprs(alp)[nm2,]~alp$time, ylab=paste0(nm,"expression"), col="purple")
legend(40, -.5, lty=1, col=c("black", "purple"), legend=c("MCM1", "MBP1"))

How far apart in minutes are the first peaks of the genes plotted here:

Question 4.8.3 Figure enhancement query

What is the name of a function that will help the viewer distinguish the genes plotted in different colors on the preceding display?

legend(40, -.5, lty=1, col=c(“black”, “purple”), legend=c(“MCM1”, “MBP1”))

Question 4.9.1

In the video we looked at the coincidence of GWAS hits from NHGRI GWAS catalog and the ESRRA binding sites in the B cell line GM12878.

library(ERBS)
data("GM12878")
library(gwascat)
data("gwrngs19")
fo = findOverlaps(GM12878, reduce(gwrngs19))
length(fo)

## [1] 28

How many distinct peaks include atleast one GWAS hit?

length(unique(queryHits(fo)))

## [1] 26

4.9.2 A p-value based on random repositioning

GWAS hits are distributed very widely over the genome. It may be the case that any set of genomic intervals with widths comparable to those of GM12878 would overlap with as many SNP. In this case we would regard the 28 GWAS hits covered by ESRRA binding sites to be something like a chance event.

The reposition function in the updated ph525x package takes a set of GRanges and repositions them to randomly selected starts on randomly selected chromosomes. By counting the number of GWAS hits covered by repositioned versions of GM12878, we get an approximate realization of the “null distribution” of this quantity.

Please update your installation of ph525x with

library(devtools)
install_github("genomicsclass/ph525x")

then

library(ph525x)

## Loading required package: png
## Loading required package: grid

library(gwascat)
rg = reduce(gwrngs19)
rsc = sapply(1:100, function(x) length(findOverlaps(reposition(GM12878),rg)))
mean(rsc > length(fo))

## [1] 0

What does this indicate as an approximate p-value for the null hypothesis: the tendency of an interval to contain a GWAS hit is unrelated to the status of the interval as an ESRRA binding site?

In one hundred realizations from the putative null distribution, no score exceeded the observed score. The probability of getting a score at least as high as the observed, under the null hypothesis, is therefore estimated to be at most 1%. This is an uncertain estimate however.

Q 4.10.1 Data acquisition; Identifying a sample by experimental characteristics

To learn about the experimental design, use

library(GEOquery)

## Setting options('download.file.method.GEOquery'='auto')

pcl = getGEO("GSE35463")[[1]]

## ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/GSE35463/matrix/
## Found 1 file(s)
## GSE35463_series_matrix.txt.gz

## Warning in download.file(myurl, destfile, mode = mode, quiet = TRUE, method
## = getOption("download.file.method.GEOquery")): downloaded length 49383106 !
## = reported length 200

## File stored at: 
## C:\Users\Prasanth\AppData\Local\Temp\RtmpSKzp98/GPL6244.soft

pcl$source_name_ch1

##                                                                 V2 
##     Cells transfected with siCtrl and cultured for 9  hrs in EBSS. 
##                                                                 V3 
##     Cells transfected with siCtrl and cultured for 9  hrs in Mock. 
##                                                                 V4 
## Cells transfected with siRelA/p65 and cultured for 9  hrs in EBSS. 
##                                                                 V5 
## Cells transfected with siRelA/p65 and cultured for 9  hrs in Mock. 
##                                                                 V6 
##     Cells transfected with siRelB and cultured for 9  hrs in EBSS. 
##                                                                 V7 
##     Cells transfected with siRelB and cultured for 9  hrs in Mock. 
##                                                                 V8 
##     Cells transfected with siCtrl and cultured for 0  hrs in EBSS. 
##                                                                 V9 
##     Cells transfected with siCtrl and cultured for 3  hrs in EBSS. 
##                                                                V10 
##  Cells transfected with  siCtrl  and cultured for  6  hrs in EBSS. 
##                                                                V11 
##  Cells transfected with  siCtrl  and cultured for  9  hrs in EBSS. 
##                                                                V12 
##   Cells transfected with  siNupr1 and cultured for 0  hrs in EBSS. 
##                                                                V13 
##  Cells transfected with  siNupr1  and cultured for 3  hrs in EBSS. 
##                                                                V14 
## Cells transfected with  siNupr1  and cultured for  6  hrs in EBSS. 
##                                                                V15 
## Cells transfected with  siNupr1  and cultured for  9  hrs in EBSS. 
## 14 Levels: Cells transfected with  siCtrl  and cultured for  6  hrs in EBSS. ...

Which of the samples represents expression after 9 hours of culturing a direct knockdown of NUPR1? Respond with a GSM-prefixed label.

GSM869014

Question 4.10.2 Annotation for gene 1.0 ST Arrays

One of the more modern platforms identified in our search for pancreatic cancer studies conducted in GEO is the Affy gene 1.0 ST used in GSE35463. When we acquire this, we obtain an ExpressionSet instance. Elements in the phenoData inform us that experiments used a combination of MiaPaCa2 and Panc1 cells, and that Nupr1, RelA/p65, RelB and IER3 were knocked-down using 140 ng of specific siRNAs.

What is the transcript cluster id for NUPR1? Use select on

hugene10sttranscriptcluster.db.:

BiocInstaller::biocLite("hugene10sttranscriptcluster.db")

library(hugene10sttranscriptcluster.db)

##

select(hugene10sttranscriptcluster.db, keys="NUPR1", keytype = "SYMBOL", columns="PROBEID")

##   SYMBOL PROBEID
## 1  NUPR1 8000574

Question 4.10.3 Determining the preprocessing method

What is the three-letter acronym for the data processing method used in developing the expression quantifications in this experiment?

RMA

Use pcl$data_processing

Question 4.10.4 The distribution of differences of paired NUPR1 measures

Form the vector of four time-specific differences between siCtrl-treated and siNupr1-treated cultures. Use shapiro.test to assess whether there is a substantial departure from a Normal distribution for these differences.

What is the p-value for the test for non-normality?

# PROEID can be taken from previous question 

eshift = exprs(pcl)["8000574", 7:10] - exprs(pcl)["8000574", 11:14]
shapiro.test(eshift)

## 
##  Shapiro-Wilk normality test
## 
## data:  eshift
## W = 0.89471, p-value = 0.4052

Question 4.10.5 Does siNUPR1 knowndown NUPR1?

Given that the Shapiro-Wilk test does not reject a Gaussian model for this tiny dataset, we might proceed to use a paired t-test to assess the null hypothesis that siNupr1 does not shift the mean trajectory of NUPR1 expression from 0 to 9 hours in the pc cell cultures.

What is the p-value?:

t.test(eshift)

## 
##  One Sample t-test
## 
## data:  eshift
## t = 6.7499, df = 3, p-value = 0.006642
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.4593764 1.2789686
## sample estimates:
## mean of x 
## 0.8691725

Select the statement that represent a valid concern about this focused test

With the small sample size it may be possible to do a better job of estimating the standard error of the effect estimate by an empirical Bayes method that assesses variability present for all genes on the array

TRUE

The Gusssian assumptions of the t-test may not be satisfied by the data > FALSE

The data have been preprocessed by RMA which may eliminate some of the signal > FALSE

Explanation:

t-test is robust against departures from Gaussian model.

RMA: It is unlikely that the experimental perturbations would induce large-scale changes in distribution of expression.

Correct answer emphasizes importance of using all information on variability obtained in the assay.