In general, the variability we observe across biological units, such as individuals, within a population is referred to as biological. We refer to the variability we observe across measurements of the same biological unit, such as aliquots from the same biological sample, as technical. Because newly developed measurement technologies are common in genomics, technical replicates are often generated to assess experimental data. By generating measurements from samples that are designed to be the same, we are able to measure and assess technical variability. We also use the terminology biological replicates and technical replicates to refer to samples from which we can measure biological and technical variability respectively.
It is important not to confuse biological and technical variability when performing statistical inference as the interpretation is quite different. For example, when analyzing data from technical replicates the population is just the one sample from which these come from, as opposed to the more general population, such as healthy humans or control mice. Here we explore this concept with an experiment that was designed to include both technical and biological replicates.
The dataset we will study includes data from gene expression arrays. In this experiment, RNA was extracted from 12 randomly selected mice from two strains. RNA from all 24 mice were hybridized to microarrays, but we also pooled RNA from different mice and hybridized those as well. Two such pooled samples included RNA from all 12 mice from each strain. Other pools were also created, as we will see below, but we will ignore these for this assessment.
library(devtools)
install_github("genomicsclass/maPooling")
We can see the experimental design using the pData function. The rows represent samples and the columns represent mice. A one in cell i,j indicates that RNA from mouse j was included in sample i. The strain can be identified from the row names. For e.g. rowname = a10a11a4 means sample was selected from mice a4,a10 and a11. This is not a recommended approach.
library(Biobase)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
##
## The following object is masked from 'package:stats':
##
## xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
## table, tapply, union, unique, unlist, unsplit
##
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
library(maPooling)
data(maPooling)
e = maPooling;head(pData(e))
## a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a14 b2 b3 b5 b6 b8 b9 b10 b11
## a10 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## a10a11 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
## a10a11a4 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
## a11 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
## a12 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
## a12a14 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
## b12 b13 b14 b15
## a10 0 0 0 0
## a10a11 0 0 0 0
## a10a11a4 0 0 0 0
## a11 0 0 0 0
## a12 0 0 0 0
## a12a14 0 0 0 0
Note that ultimately we are interested in detecting genes that are differentially expressed between the two strains of mice, which we will refer to as strain 0 and 1. We can apply tests to the technical replicates of pooled samples or the data from 12 individual mice. We can identify these pooled samples because all mice from each strain were represented in these samples and thus the sum of the rows of experimental design matrix add up to 12:
data(maPooling)
pd=pData(maPooling)
pooled=which(rowSums(pd)==12)
We also have microarray data for each individual mouse. For each strain we have 12 biological replicates. We can find them by looking for rows with just one 1. We remove some samples that were repeated so that we have biological replicates only:For e.g. a3, a3tr1, and a3tr2 has a one separate entry, though it is all a3 and it should only have one entry.
individuals=which(rowSums(pd)==1)
##remove replicates represented by "tr" in the names
individuals=individuals[-grep("tr",names(individuals))]
We can use this to create two measurement matrices representing technical replicates.
pool = exprs(maPooling)[,pooled];indiv = exprs(maPooling)[,individuals]
We can also get the mouse strain for each:
strain= ifelse(grepl("a",rownames(pData(maPooling))),0,1)
g_pool = strain[pooled]
g_indiv = strain[individuals]
We will use these objects in the questions below.
Compute the standard deviations for each gene for the strain defined by strain==1 across technical replicates and biological replicates.
For what proportion of genes is the estimated biological variability larger than the estimated technical variability
library(genefilter)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
##
## Attaching package: 'genefilter'
##
## The following object is masked from 'package:base':
##
## anyNA
library(rafalib)
technicalsd = rowSds(pool[,g_pool==1])
biologicalsd = rowSds(indiv[,g_indiv==1])
mean(biologicalsd>technicalsd)
## [1] 0.7994725
## we can also make a plot
plot(technicalsd,biologicalsd)
abline(0,1,col=2)
LIM=range(c(technicalsd,biologicalsd))
mypar(1,1)
boxplot(technicalsd,biologicalsd,names=c("technical","biological"),ylab="standard deviation")
For the data with technical replicates, pool compute t-test comparing strains 0 and 1 using the rowttest function in the genefilter package. Compute q-values using the qvalue package.
How many genes have q-values < 0.05 ?
library(genefilter)
library(qvalue)
tt = rowttests(pool,as.factor(g_pool))
qvals = qvalue(tt$p.value)
sum(qvals$qvalues < 0.05)
## [1] 2169
mean(qvals$qvalues < 0.05)
## [1] 0.136218
Now we are going to validate the genes found in the previous assessment question with biological replicates. Using the biological replicate dataset, compute p-values (using rowttest) for the genes that in the previous question we found to have q-values < 0.05 (using the technical replicates).
For what proportion of these genes do we achieve the p-value from the biological replicates above 0.05?
tt = rowttests(indiv,factor(g_indiv))
qvals = qvalue(tt$p.value)
mean(qvals$qvalues < 0.05)
## [1] 0.07404384
Note that it is much larger than we expected given that we estimated an FDR of 5% in the first list. This is because the pooled data is not accounting for biological variability.
For the biological replicate data we computed p-values and q-values using rowttests:
library(genefilter)
library(qvalue)
pvals = rowttests(indiv,factor(g_indiv))$p.value
qvals = qvalue(pvals)$qvalue
Now use the limma package to obtain p-values using the moderated t-tests provided by the ebayes function. Obtain q-vaues From these by applying the qvalue function.
What proportion of the genes with q-value < 0.05 obtained using t-tests have q-values < 0.05 when using the moderated t-test?
library(limma)
##
## Attaching package: 'limma'
##
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
fit = lmFit(indiv, design = model.matrix(~g_indiv))
# perform eBayes for Heirarchical modeling
fit = eBayes(fit)
tt = topTable(fit, coef = 2, number = Inf, sort.by = "none")
library(qvalue)
pvals2 = tt$P.Value
qvals2 = qvalue(pvals2)$qvalue
sum( qvals2<0.05 & qvals<0.05)/sum(qvals<0.05)
## [1] 0.9889737
Note that we get very strong agreement because N is 12 here and thus the standard error estimates are not shrunk too much.
What does “genes set testing” refer to? To test for differences in expression…
Testing for differences between population iin predefined sets of genes.
Which of the following statements applies to gene set testing? The t-statistics from individual genes within a gene set:
can have positive, 0, or negative correlation.
In lecture, we saw that the variance of the average of a set N of Normal variables with mean 0 and variance 1 is 1/N. This can be demonstrated in R with a simulation of 10,000 averages for N=10:
var(rowMeans(matrix(rnorm(10000 * 10, 0, 1), ncol = 10)))
## [1] 0.09899337
which comes close to 1/10.
Suppose we have 10 normal variables with mean 0 and variance 1, but each random variable has a correlation of 0.7 with the other variables. We can create such random variables in R using the multivariate normal distribution:
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:genefilter':
##
## area
Sigma = matrix(0.7, ncol = 10, nrow = 10)
diag(Sigma) = 1
mvrnorm(n=1, mu=rep(0,10), Sigma = Sigma)
## [1] -1.54479855 -0.93777168 -0.63150678 -0.82237133 -0.07157774
## [6] -0.58005320 -1.47182957 -0.59879337 -1.69170824 -1.39491149
This gives 10 numbers which fit the above description. If we set n=2 we will get two rows of 10 numbers.
What is the variance of the average of these 10 random variables? (Hint: set n=10000 and take the variance of the 10,000 averages). (Hint #2: this was also given as a formula in lecture)
# using simulation:
var(rowMeans(mvrnorm(n=10000, mu=rep(0,10), Sigma = Sigma)))
## [1] 0.7391023
# using the formula from the lecture
1/10 * (1+(10-1) * 0.7)
## [1] 0.73
What is the prarical consequence of the above demonstration?
We saw in the lecrure that the variance of the average of positively-correlated Normal random variables is larger than the variance of the average of independent Normal randomvariables? If we test the average of t-statistics from 10 positively-correlated genes…
we need to widenthe null distribution
The following two questions are advanced. Please ask in the Discussion Forum.
We will explore software for testing differential expression in a set of genes. These tests differ from the gene-by-gene tests we saw previously. The gene set testing software we will use lives in the limma package.
For this assessment we will download an experiment from the GEO website, using the getGEO function from the GEOquery package:
## Setting options('download.file.method.GEOquery'='auto')
## ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34313/matrix/
## Found 1 file(s)
## GSE34313_series_matrix.txt.gz
## Warning in download.file(myurl, destfile, mode = mode, quiet = TRUE, method
## = getOption("download.file.method.GEOquery")): downloaded length 26060596 !
## = reported length 200
## File stored at:
## C:\Users\Prasanth\AppData\Local\Temp\RtmpIdiXOs/GPL6480.soft
This dataset is hosted by GEO by the following link. The experiment is described in the paper Masuno 2011. Birefly, the investigation applied a glucocorticoid hormone in cultured human airway smooth muscle. The glucocorticoid hormone is used to treat asthma, as it reduces the inflammation response. However, it has many other effects throughout the different tissues of the body.
The groups are defined in the characteristics_ch1.2 variable:
e$condition = e$characteristics_ch1.2
levels(e$condition) = c("dex24","dex4","control")
table(e$conditions)
## < table of extent 0 >
## Note that the ExpressionSets we can access the phenoData columns directly. Specifically, note the e$condition is equivalent to pData$condition
This data includes GO terms for every gene. We can see these in looking at the feature annotations:
names(fData(e))
## [1] "ID" "SPOT_ID" "CONTROL_TYPE"
## [4] "REFSEQ" "GB_ACC" "GENE"
## [7] "GENE_SYMBOL" "GENE_NAME" "UNIGENE_ID"
## [10] "ENSEMBL_ID" "TIGR_ID" "ACCESSION_STRING"
## [13] "CHROMOSOMAL_LOCATION" "CYTOBAND" "DESCRIPTION"
## [16] "GO_ID" "SEQUENCE"
fData(e)$GO_ID[1:4]
## [1] GO:0016020(membrane)|GO:0016021(integral to membrane)
## [2] GO:0005794(Golgi apparatus)|GO:0006886(intracellular protein transport)|GO:0008565(protein transporter activity)|GO:0016020(membrane)|GO:0016192(vesicle-mediated transport)|GO:0030117(membrane coat)|GO:0030659(cytoplasmic vesicle membrane)|GO:0031410(cytoplasmic vesicle)
## [3] GO:0001669(acrosomal vesicle)|GO:0006836(neurotransmitter transport)|GO:0016020(membrane)|GO:0016021(integral to membrane)|GO:0022857(transmembrane transporter activity)|GO:0030054(cell junction)|GO:0030672(synaptic vesicle membrane)|GO:0031410(cytoplasmic vesicle)|GO:0045202(synapse)
## [4] GO:0000166(nucleotide binding)|GO:0003676(nucleic acid binding)
## 13508 Levels: ...
We will compare the control samples to those treated with dexamethasone (the hormone) after 4 hours.
lvls = c("control", "dex4")
es = e[,e$condition %in% lvls]
es$condition = factor(es$condition, levels=lvls)
The following lines run the linear model in limma. We note that the top genes are common immune-response genes (CSF2, LIF, CCL2, IL6). Also present is FKBP5, a gene which regulates and is regulated by the protein which receives the glucocorticoid hormone.
library(limma)
library(qvalue)
design = model.matrix(~ es$condition)
fit = lmFit(es, design=design)
fit = eBayes(fit)
topTable(fit)[,c(6,7,18,22)]
## Removing intercept from test coefficients
## GENE GENE_SYMBOL logFC adj.P.Val
## A_23_P133408 1437 CSF2 -4.724965 6.248554e-08
## A_24_P122137 3976 LIF -6.689927 6.248554e-08
## A_32_P5276 26084 ARHGEF26 3.411468 1.744514e-07
## A_23_P89431 6347 CCL2 -3.618725 1.744514e-07
## A_23_P42257 8870 IER3 -4.808854 3.294838e-07
## A_23_P71037 3569 IL6 -4.568355 4.212726e-07
## A_24_P20327 28999 KLF15 4.013070 4.212726e-07
## A_24_P38081 2289 FKBP5 3.567059 4.212726e-07
## A_23_P213944 1839 HBEGF -2.902190 4.212726e-07
## A_24_P250922 5743 PTGS2 -3.518360 4.727798e-07
We will use the ROAST method for gene set testing. We can test a single gene set by looking up the genes which contain a certain GO ID, and providing this to the roast function.
The roast function performs an advanced statistical technique, rotation of residuals, in order to generate a sense of the null distribution for the test statistic. The test statistic in this case is the summary of the scores from each gene. The tests are self-contained because only the summary for a single set is used, whereas other gene set tests might compare a set to all the other genes in the dataset, e.g., a competitive gene set test.
The result here tells us that the immune response genes are significantly down-regulated, and additionally, mixed up and down.
# Immune response
set.seed(1)
idx = grep("GO:0006955", fData(es)$GO_ID)
length(idx)
## [1] 504
r1 = roast(es, idx, design)
r1
## Active.Prop P.Value
## Down 0.16269841 0.01750875
## Up 0.09325397 0.98299150
## UpOrDown 0.16269841 0.03500000
## Mixed 0.25595238 0.01300000
Set the seed at 1 (set.seed(1)) and repeat the above for the go term associated with “GO:0045454”. What is the p.value for genes in this gene set being down regulated?
set.seed(1)
idx = grep("GO:0045454", fData(es)$GO_ID)
length(idx)
## [1] 81
r1 = roast(es, idx, design)
r1
## Active.Prop P.Value
## Down 0.08641975 0.91195598
## Up 0.23456790 0.08854427
## UpOrDown 0.23456790 0.17700000
## Mixed 0.32098765 0.01200000
We can also use the mroast function to perform multiple roast tests. First we need to create a list, which contains the indices of genes in the ExpressionSet for each of a number of gene sets. We will use the org.Hs.eg.db package to gather the gene set information. The following code used R to extracts and organizes GO annotations from a somewhat messy annotation provided by the manufacturer.
library(org.Hs.eg.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: GenomeInfoDb
## Loading required package: S4Vectors
## Loading required package: IRanges
##
## Attaching package: 'AnnotationDbi'
##
## The following object is masked from 'package:MASS':
##
## select
##
## Loading required package: DBI
org.Hs.egGO2EG
## GO2EG map for Human (object of class "Go3AnnDbBimap")
go2eg = as.list(org.Hs.egGO2EG)
head(go2eg)
## $`GO:0000002`
## TAS TAS ISS IMP IMP NAS IMP IDA IBA
## "291" "1890" "4205" "4358" "4976" "9361" "10000" "55186" "63875"
## IEA IDA IMP
## "80119" "84275" "92667"
##
## $`GO:0000003`
## IBA IEP
## "8208" "8510"
##
## $`GO:0000011`
## IBA
## "64145"
##
## $`GO:0000012`
## IDA IDA IEA IMP IDA IDA
## "3981" "7141" "7515" "23411" "54840" "55775"
## IMP IMP IEA
## "55775" "200558" "100133315"
##
## $`GO:0000018`
## TAS TAS TAS IMP IMP IEP
## "3575" "3836" "3838" "9984" "10189" "56916"
##
## $`GO:0000019`
## TAS IDA
## "4361" "10111"
The following code unlists the list, then gets matches for each Entrez gene ID to the index in the ExpressionSet. Finally, we rebuild the list.
govector = unlist(go2eg)
golengths = sapply(go2eg, length)
head(fData(es)$GENE)
## [1] "400451" "10239" "9899" "348093" "57099" "146050"
idxvector = match(govector, fData(es)$GENE);table(is.na(idxvector))
##
## FALSE TRUE
## 239596 8306
##This is the organized list of indexes for genes per GO term:
idx = split(idxvector, rep(names(go2eg), golengths));head(go2eg)
## $`GO:0000002`
## TAS TAS ISS IMP IMP NAS IMP IDA IBA
## "291" "1890" "4205" "4358" "4976" "9361" "10000" "55186" "63875"
## IEA IDA IMP
## "80119" "84275" "92667"
##
## $`GO:0000003`
## IBA IEP
## "8208" "8510"
##
## $`GO:0000011`
## IBA
## "64145"
##
## $`GO:0000012`
## IDA IDA IEA IMP IDA IDA
## "3981" "7141" "7515" "23411" "54840" "55775"
## IMP IMP IEA
## "55775" "200558" "100133315"
##
## $`GO:0000018`
## TAS TAS TAS IMP IMP IEP
## "3575" "3836" "3838" "9984" "10189" "56916"
##
## $`GO:0000019`
## TAS IDA
## "4361" "10111"
##We can see the genes like this:
go2eg[[1]]
## TAS TAS ISS IMP IMP NAS IMP IDA IBA
## "291" "1890" "4205" "4358" "4976" "9361" "10000" "55186" "63875"
## IEA IDA IMP
## "80119" "84275" "92667"
fData(es)$GENE[idx[[1]]]
## [1] "291" "1890" "4205" "4358" "4976" "9361" "10000" "55186"
## [9] "63875" "80119" "84275" "92667"
We need to clean this list such that there are no \(NA\) values. We also clean it to remove gene sets which have less than 10 genes.
idxclean = lapply(idx, function(x) x[!is.na(x)])
idxlengths = sapply(idxclean, length)
idxsub = idxclean[idxlengths > 10]
length(idxsub)
## [1] 3090
With these indexes in place we are now ready to run mroast. The following line of code runs the multiple ROAST test. This can take about 3 minutes.
set.seed(1)
r2 = mroast(es, idxsub, design)
head(r2)
## NGenes PropDown PropUp Direction PValue FDR
## GO:0005125 174 0.2758621 0.04597701 Down 0.001 0.009716981
## GO:0008083 167 0.2874251 0.07784431 Down 0.001 0.009716981
## GO:0070098 65 0.2461538 0.06153846 Down 0.001 0.009716981
## GO:0043433 63 0.2698413 0.12698413 Down 0.001 0.009716981
## GO:0042102 54 0.1666667 0.09259259 Down 0.001 0.009716981
## GO:0006959 53 0.2452830 0.09433962 Down 0.001 0.009716981
## PValue.Mixed FDR.Mixed
## GO:0005125 0.001 0.002566445
## GO:0008083 0.001 0.002566445
## GO:0070098 0.001 0.002566445
## GO:0043433 0.001 0.002566445
## GO:0042102 0.001 0.002566445
## GO:0006959 0.001 0.002566445
Find the GO term that has the the largest proportion of upregulated genes. How many genes does this term have?
r2[which.max(r2$PropDown),1]
## [1] 15
Note that this is a relatively small gene set. In fact, we can see that there is much more variability in this proportion for smaller gene sets:
plot(log2(r2$NGenes),r2$PropUp)
plot(log2(r2$NGenes),-log10(r2$PValue))
For the data in the previous question, rerun mroast using only gene sets with 50 or more genes.
In this new analysis, what is the GO term has the largest proportion of upregulated genes (answer should be in this form GO:000001)?
idxsub = idxclean[idxlengths > 50]
length(idxsub)
## [1] 707
r2 = mroast(es, idxsub, design)
head(r2)
## NGenes PropDown PropUp Direction PValue FDR
## GO:0005125 174 0.2758621 0.04597701 Down 0.001 0.05891667
## GO:0008083 167 0.2874251 0.07784431 Down 0.001 0.05891667
## GO:0071222 86 0.2441860 0.16279070 Down 0.001 0.05891667
## GO:0043433 63 0.2698413 0.12698413 Down 0.001 0.05891667
## GO:0006959 53 0.2452830 0.09433962 Down 0.001 0.05891667
## GO:0070098 65 0.2461538 0.06153846 Down 0.001 0.05891667
## PValue.Mixed FDR.Mixed
## GO:0005125 0.001 0.002472028
## GO:0008083 0.001 0.002472028
## GO:0071222 0.001 0.002472028
## GO:0043433 0.001 0.002472028
## GO:0006959 0.001 0.002472028
## GO:0070098 0.002 0.006427273
r2[which.max(r2$PropUp),]
## NGenes PropDown PropUp Direction PValue FDR
## GO:0000776 69 0.05797101 0.3768116 Up 0.028 0.4521512
## PValue.Mixed FDR.Mixed
## GO:0000776 0.016 0.04454675
Use select and the Go.db annotation package to findthe TERM for GO:0000776
#source("https://bioconductor.org/biocLite.R")
#biocLite("GO.db")
library(GO.db)
##
columns(GO.db)
## [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION"
keytypes(GO.db)
## [1] "GOID" "TERM" "ONTOLOGY" "DEFINITION"
select(GO.db, keys="GO:0000776",columns="TERM")
## GOID TERM
## 1 GO:0000776 kinetochore
# there is another function simply uses
GOTERM[["GO:0000776"]]
## GOID: GO:0000776
## Term: kinetochore
## Ontology: CC
## Definition: A multisubunit complex that is located at the
## centromeric region of DNA and provides an attachment point for
## the spindle microtubules.
## Synonym: GO:0005699
## Secondary: GO:0005699