In the video we demonstrated the creation of the new package, erbsViz, centered around the following visualization funciton:
juxta = function (chrname="chr22", ...)
{
require(ERBS)
data(HepG2)
data(GM12878)
require(ggbio)
require(GenomicRanges) # "subset" is overused, need import detail
ap1 = autoplot(GenomicRanges::subset(HepG2, seqnames==chrname))
ap2 = autoplot(GenomicRanges::subset(GM12878, seqnames==chrname))
tracks(HepG2 = ap1, Bcell = ap2, ...)
}
Once this function is added to your workspace, you can create the erbsViz package using package.skeleton("erbsViz", "juxta")
After doing this, you get an installable package upon editing the manual pages generated so that each title field has proper content, essentially an unquoted string of descriptive prose. You use install.packages("erbsViz", repos = NULL, type = "source")
Use the erbsViz package as created during the video to perform the following computation:
library(erbsViz)
##
## Attaching package: 'erbsViz'
##
## The following object is masked _by_ '.GlobalEnv':
##
## juxta
jdemo = juxta()
## Loading required package: ERBS
## Loading required package: ggbio
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Loading required package: ggplot2
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## Need specific help about ggbio? try mailing
## the maintainer or visit http://tengfei.github.com/ggbio/
##
## Attaching package: 'ggbio'
##
## The following objects are masked from 'package:ggplot2':
##
## geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
## stat_identity, xlim
##
## Loading required package: GenomicRanges
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
What is the class of jdemo?
class(jdemo)
## [1] "Tracks"
## attr(,"package")
## [1] "ggbio"
Have a look at getSlots(getClass(class(jdemo))) to get a sense of how the display might be customized.
label.text.color
In the preceding video, we saw how to make an OrganismDb package for S. cerevisiae. In this problem set, you will make such a package for C. elegans. The makeOrganismPackage function in OrganismDbi requires a list for the graphData parameter
gd = list(join1 = c(GO.db="GOID", [fixme]="GO"),
join2 = c([fixme]="ENTREZID",
TxDb.Celegans.UCSC.ce6.ensGene="GENEID"))
What should you substitute for [fixme] in the call above? Hint: the package name to be used has eg.db as suffix. See the annotation cheat sheet for the contraction pattern.
org.Ce.eg.db
gd = list(join1 = c(GO.db="GOID", org.Ce.eg.db="GO"),
join2 = c(org.Ce.eg.db="ENTREZID",
TxDb.Celegans.UCSC.ce6.ensGene="GENEID"))
After correctly specifying gd above, use
library(OrganismDbi)
## Loading required package: AnnotationDbi
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Loading required package: GenomicFeatures
if (!file.exists("Cen.ele6"))
makeOrganismPackage("Cen.ele6", gd, "C. elegans", "1.0.0", "me <me@abc.com>",
"me <me@abc.com>", ".")
and install the package via
install.packages("Cen.ele6", repos=NULL, type="source")
## Installing package into 'C:/Users/Prasanth/Documents/R/win-library/3.2'
## (as 'lib' is unspecified)
use library(Cen.ele6). What is the `sum(seqlengths(Cen.ele6))?
Use harbChIP to display the distribution of binding scores for MCM1 over the yeast genome. Use the qqnormal display as follows:
library(harbChIP)
## Loading required package: tools
## Loading required package: Biostrings
##
## Attaching package: 'harbChIP'
##
## The following object is masked from 'package:OrganismDbi':
##
## keys
##
## The following object is masked from 'package:GenomicFeatures':
##
## organism
##
## The following object is masked from 'package:AnnotationDbi':
##
## keys
##
## The following object is masked from 'package:GenomeInfoDb':
##
## organism
##
## The following object is masked from 'package:BiocGenerics':
##
## organism
data(harbChIP)
sv=qqnorm(exprs(harbChIP)[,"MCM1"], main = "Mcm1 binding scores")
Superimporse five genes to which Mbp1 had high binding scores
plot(sv, main = "Mcm1 binding scores")
topb = names(sort(exprs(harbChIP)[, "MBP1"], decreasing = TRUE)[1:5])
points(sv$x[topb], sv$y[topb], col = "red", pch=19)
How many distinct red points do you see? Two of the genes that are neighbours ad very similar binding scores reported and are not distinguishable.
Plot the trajectory for the gene for which Mcm1 binding was reportedly strongest, and add to this in purpose the trajectory of the gene for which Mbp1 binding was strongest.
library(yeastCC)
data("spYCCES")
alp = spYCCES[,spYCCES$syncmeth=="alpha"]
nm = names(which.max(exprs(harbChIP)[,"MCM1"]))
nm2 = names(which.max(exprs(harbChIP)[,"MBP1"]))
plot(exprs(alp)[nm,]~alp$time, ylab=paste0(nm,"expression"), type="l")
lines(exprs(alp)[nm2,]~alp$time, ylab=paste0(nm,"expression"), col="purple")
legend(40, -.5, lty=1, col=c("black", "purple"), legend=c("MCM1", "MBP1"))
How far apart in minutes are the first peaks of the genes plotted here:
What is the name of a function that will help the viewer distinguish the genes plotted in different colors on the preceding display?
legend(40, -.5, lty=1, col=c(“black”, “purple”), legend=c(“MCM1”, “MBP1”))
In the video we looked at the coincidence of GWAS hits from NHGRI GWAS catalog and the ESRRA binding sites in the B cell line GM12878.
library(ERBS)
data("GM12878")
library(gwascat)
data("gwrngs19")
fo = findOverlaps(GM12878, reduce(gwrngs19))
length(fo)
## [1] 28
How many distinct peaks include atleast one GWAS hit?
length(unique(queryHits(fo)))
## [1] 26
GWAS hits are distributed very widely over the genome. It may be the case that any set of genomic intervals with widths comparable to those of GM12878 would overlap with as many SNP. In this case we would regard the 28 GWAS hits covered by ESRRA binding sites to be something like a chance event.
The reposition function in the updated ph525x package takes a set of GRanges and repositions them to randomly selected starts on randomly selected chromosomes. By counting the number of GWAS hits covered by repositioned versions of GM12878, we get an approximate realization of the “null distribution” of this quantity.
Please update your installation of ph525x with
library(devtools)
install_github("genomicsclass/ph525x")
then
library(ph525x)
## Loading required package: png
## Loading required package: grid
library(gwascat)
rg = reduce(gwrngs19)
rsc = sapply(1:100, function(x) length(findOverlaps(reposition(GM12878),rg)))
mean(rsc > length(fo))
## [1] 0
What does this indicate as an approximate p-value for the null hypothesis: the tendency of an interval to contain a GWAS hit is unrelated to the status of the interval as an ESRRA binding site?
To learn about the experimental design, use
library(GEOquery)
## Setting options('download.file.method.GEOquery'='auto')
pcl = getGEO("GSE35463")[[1]]
## ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE35nnn/GSE35463/matrix/
## Found 1 file(s)
## GSE35463_series_matrix.txt.gz
## Warning in download.file(myurl, destfile, mode = mode, quiet = TRUE, method
## = getOption("download.file.method.GEOquery")): downloaded length 49383106 !
## = reported length 200
## File stored at:
## C:\Users\Prasanth\AppData\Local\Temp\RtmpSKzp98/GPL6244.soft
pcl$source_name_ch1
## V2
## Cells transfected with siCtrl and cultured for 9 hrs in EBSS.
## V3
## Cells transfected with siCtrl and cultured for 9 hrs in Mock.
## V4
## Cells transfected with siRelA/p65 and cultured for 9 hrs in EBSS.
## V5
## Cells transfected with siRelA/p65 and cultured for 9 hrs in Mock.
## V6
## Cells transfected with siRelB and cultured for 9 hrs in EBSS.
## V7
## Cells transfected with siRelB and cultured for 9 hrs in Mock.
## V8
## Cells transfected with siCtrl and cultured for 0 hrs in EBSS.
## V9
## Cells transfected with siCtrl and cultured for 3 hrs in EBSS.
## V10
## Cells transfected with siCtrl and cultured for 6 hrs in EBSS.
## V11
## Cells transfected with siCtrl and cultured for 9 hrs in EBSS.
## V12
## Cells transfected with siNupr1 and cultured for 0 hrs in EBSS.
## V13
## Cells transfected with siNupr1 and cultured for 3 hrs in EBSS.
## V14
## Cells transfected with siNupr1 and cultured for 6 hrs in EBSS.
## V15
## Cells transfected with siNupr1 and cultured for 9 hrs in EBSS.
## 14 Levels: Cells transfected with siCtrl and cultured for 6 hrs in EBSS. ...
Which of the samples represents expression after 9 hours of culturing a direct knockdown of NUPR1? Respond with a GSM-prefixed label.
GSM869014
One of the more modern platforms identified in our search for pancreatic cancer studies conducted in GEO is the Affy gene 1.0 ST used in GSE35463. When we acquire this, we obtain an ExpressionSet instance. Elements in the phenoData inform us that experiments used a combination of MiaPaCa2 and Panc1 cells, and that Nupr1, RelA/p65, RelB and IER3 were knocked-down using 140 ng of specific siRNAs.
What is the transcript cluster id for NUPR1? Use select on
hugene10sttranscriptcluster.db.:
BiocInstaller::biocLite("hugene10sttranscriptcluster.db")
library(hugene10sttranscriptcluster.db)
##
select(hugene10sttranscriptcluster.db, keys="NUPR1", keytype = "SYMBOL", columns="PROBEID")
## SYMBOL PROBEID
## 1 NUPR1 8000574
What is the three-letter acronym for the data processing method used in developing the expression quantifications in this experiment?
RMA
Use pcl$data_processing
Form the vector of four time-specific differences between siCtrl-treated and siNupr1-treated cultures. Use shapiro.test to assess whether there is a substantial departure from a Normal distribution for these differences.
What is the p-value for the test for non-normality?
# PROEID can be taken from previous question
eshift = exprs(pcl)["8000574", 7:10] - exprs(pcl)["8000574", 11:14]
shapiro.test(eshift)
##
## Shapiro-Wilk normality test
##
## data: eshift
## W = 0.89471, p-value = 0.4052
Given that the Shapiro-Wilk test does not reject a Gaussian model for this tiny dataset, we might proceed to use a paired t-test to assess the null hypothesis that siNupr1 does not shift the mean trajectory of NUPR1 expression from 0 to 9 hours in the pc cell cultures.
What is the p-value?:
t.test(eshift)
##
## One Sample t-test
##
## data: eshift
## t = 6.7499, df = 3, p-value = 0.006642
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 0.4593764 1.2789686
## sample estimates:
## mean of x
## 0.8691725
With the small sample size it may be possible to do a better job of estimating the standard error of the effect estimate by an empirical Bayes method that assesses variability present for all genes on the array
TRUE
The Gusssian assumptions of the t-test may not be satisfied by the data > FALSE
The data have been preprocessed by RMA which may eliminate some of the signal > FALSE
t-test is robust against departures from Gaussian model.
RMA: It is unlikely that the experimental perturbations would induce large-scale changes in distribution of expression.
Correct answer emphasizes importance of using all information on variability obtained in the assay.