Inference_Exercises

In general, the variability we observe across biological units, such as individuals, within a population is referred to as biological. We refer to the variability we observe across measurements of the same biological unit, such as aliquots from the same biological sample, as technical. Because newly developed measurement technologies are common in genomics, technical replicates are often generated to assess experimental data. By generating measurements from samples that are designed to be the same, we are able to measure and assess technical variability. We also use the terminology biological replicates and technical replicates to refer to samples from which we can measure biological and technical variability respectively.

It is important not to confuse biological and technical variability when performing statistical inference as the interpretation is quite different. For example, when analyzing data from technical replicates the population is just the one sample from which these come from, as opposed to the more general population, such as healthy humans or control mice. Here we explore this concept with an experiment that was designed to include both technical and biological replicates.

The dataset we will study includes data from gene expression arrays. In this experiment, RNA was extracted from 12 randomly selected mice from two strains. RNA from all 24 mice were hybridized to microarrays, but we also pooled RNA from different mice and hybridized those as well. Two such pooled samples included RNA from all 12 mice from each strain. Other pools were also created, as we will see below, but we will ignore these for this assessment.

library(devtools)
install_github("genomicsclass/maPooling")

We can see the experimental design using the pData function. The rows represent samples and the columns represent mice. A one in cell i,j indicates that RNA from mouse j was included in sample i. The strain can be identified from the row names. For e.g. rowname = a10a11a4 means sample was selected from mice a4,a10 and a11. This is not a recommended approach.

library(Biobase)

## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

library(maPooling)
data(maPooling)
e = maPooling;head(pData(e))

##          a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12 a14 b2 b3 b5 b6 b8 b9 b10 b11
## a10       0  0  0  0  0  0  0  0   1   0   0   0  0  0  0  0  0  0   0   0
## a10a11    0  0  0  0  0  0  0  0   1   1   0   0  0  0  0  0  0  0   0   0
## a10a11a4  0  0  1  0  0  0  0  0   1   1   0   0  0  0  0  0  0  0   0   0
## a11       0  0  0  0  0  0  0  0   0   1   0   0  0  0  0  0  0  0   0   0
## a12       0  0  0  0  0  0  0  0   0   0   1   0  0  0  0  0  0  0   0   0
## a12a14    0  0  0  0  0  0  0  0   0   0   1   1  0  0  0  0  0  0   0   0
##          b12 b13 b14 b15
## a10        0   0   0   0
## a10a11     0   0   0   0
## a10a11a4   0   0   0   0
## a11        0   0   0   0
## a12        0   0   0   0
## a12a14     0   0   0   0

Note that ultimately we are interested in detecting genes that are differentially expressed between the two strains of mice, which we will refer to as strain 0 and 1. We can apply tests to the technical replicates of pooled samples or the data from 12 individual mice. We can identify these pooled samples because all mice from each strain were represented in these samples and thus the sum of the rows of experimental design matrix add up to 12:

data(maPooling)
pd=pData(maPooling)
pooled=which(rowSums(pd)==12)

We also have microarray data for each individual mouse. For each strain we have 12 biological replicates. We can find them by looking for rows with just one 1. We remove some samples that were repeated so that we have biological replicates only:For e.g. a3, a3tr1, and a3tr2 has a one separate entry, though it is all a3 and it should only have one entry.

individuals=which(rowSums(pd)==1)
##remove replicates represented by "tr" in the names
individuals=individuals[-grep("tr",names(individuals))]

We can use this to create two measurement matrices representing technical replicates.

pool = exprs(maPooling)[,pooled];indiv = exprs(maPooling)[,individuals]

We can also get the mouse strain for each:

strain= ifelse(grepl("a",rownames(pData(maPooling))),0,1)
g_pool = strain[pooled]
g_indiv = strain[individuals]

We will use these objects in the questions below.

Question 4.1.1

Compute the standard deviations for each gene for the strain defined by strain==1 across technical replicates and biological replicates.

For what proportion of genes is the estimated biological variability larger than the estimated technical variability

library(genefilter)

## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA

library(rafalib)
technicalsd = rowSds(pool[,g_pool==1])
biologicalsd = rowSds(indiv[,g_indiv==1])
mean(biologicalsd>technicalsd)

## [1] 0.7994725

## we can also make a plot

plot(technicalsd,biologicalsd)
abline(0,1,col=2)

LIM=range(c(technicalsd,biologicalsd))
mypar(1,1)
boxplot(technicalsd,biologicalsd,names=c("technical","biological"),ylab="standard deviation")

Question 4.1.2

For the data with technical replicates, pool compute t-test comparing strains 0 and 1 using the rowttest function in the genefilter package. Compute q-values using the qvalue package.

How many genes have q-values < 0.05 ?

library(genefilter)
library(qvalue)
tt = rowttests(pool,as.factor(g_pool))
qvals = qvalue(tt$p.value)
sum(qvals$qvalues < 0.05)

## [1] 2169

mean(qvals$qvalues < 0.05)

## [1] 0.136218

Question 4.1.3

Now we are going to validate the genes found in the previous assessment question with biological replicates. Using the biological replicate dataset, compute p-values (using rowttest) for the genes that in the previous question we found to have q-values < 0.05 (using the technical replicates).

For what proportion of these genes do we achieve the p-value from the biological replicates above 0.05?

tt = rowttests(indiv,factor(g_indiv))
qvals = qvalue(tt$p.value)
mean(qvals$qvalues < 0.05)

## [1] 0.07404384

Note that it is much larger than we expected given that we estimated an FDR of 5% in the first list. This is because the pooled data is not accounting for biological variability.

Question 4.1.4

For the biological replicate data we computed p-values and q-values using rowttests:

library(genefilter)
library(qvalue)
pvals = rowttests(indiv,factor(g_indiv))$p.value
qvals = qvalue(pvals)$qvalue

Now use the limma package to obtain p-values using the moderated t-tests provided by the ebayes function. Obtain q-vaues From these by applying the qvalue function.

What proportion of the genes with q-value < 0.05 obtained using t-tests have q-values < 0.05 when using the moderated t-test?

library(limma)

## 
## Attaching package: 'limma'
## 
## The following object is masked from 'package:BiocGenerics':
## 
##     plotMA

fit = lmFit(indiv, design = model.matrix(~g_indiv))
# perform eBayes for Heirarchical modeling
fit = eBayes(fit)
tt = topTable(fit, coef = 2, number = Inf, sort.by = "none")

library(qvalue)
pvals2 = tt$P.Value
qvals2 = qvalue(pvals2)$qvalue
sum( qvals2<0.05 & qvals<0.05)/sum(qvals<0.05)

## [1] 0.9889737

Note that we get very strong agreement because N is 12 here and thus the standard error estimates are not shrunk too much.

Question 4.2.1

What does “genes set testing” refer to? To test for differences in expression…

Testing for differences between population iin predefined sets of genes.

Question 4.2.4

Which of the following statements applies to gene set testing? The t-statistics from individual genes within a gene set:

can have positive, 0, or negative correlation.

Question 4.2.3

In lecture, we saw that the variance of the average of a set N of Normal variables with mean 0 and variance 1 is 1/N. This can be demonstrated in R with a simulation of 10,000 averages for N=10:

var(rowMeans(matrix(rnorm(10000 * 10, 0, 1), ncol = 10)))

## [1] 0.09899337

which comes close to 1/10.

Suppose we have 10 normal variables with mean 0 and variance 1, but each random variable has a correlation of 0.7 with the other variables. We can create such random variables in R using the multivariate normal distribution:

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:genefilter':
## 
##     area

Sigma = matrix(0.7, ncol = 10, nrow = 10)
diag(Sigma) = 1
mvrnorm(n=1, mu=rep(0,10), Sigma = Sigma)

##  [1] -1.54479855 -0.93777168 -0.63150678 -0.82237133 -0.07157774
##  [6] -0.58005320 -1.47182957 -0.59879337 -1.69170824 -1.39491149

This gives 10 numbers which fit the above description. If we set n=2 we will get two rows of 10 numbers.

What is the variance of the average of these 10 random variables? (Hint: set n=10000 and take the variance of the 10,000 averages). (Hint #2: this was also given as a formula in lecture)

# using simulation:
var(rowMeans(mvrnorm(n=10000, mu=rep(0,10), Sigma = Sigma)))

## [1] 0.7391023

# using the formula from the lecture
1/10 * (1+(10-1) * 0.7)

## [1] 0.73

What is the prarical consequence of the above demonstration?

We saw in the lecrure that the variance of the average of positively-correlated Normal random variables is larger than the variance of the average of independent Normal randomvariables? If we test the average of t-statistics from 10 positively-correlated genes…

we need to widenthe null distribution

Question 4.2.4

The following two questions are advanced. Please ask in the Discussion Forum.

We will explore software for testing differential expression in a set of genes. These tests differ from the gene-by-gene tests we saw previously. The gene set testing software we will use lives in the limma package.

For this assessment we will download an experiment from the GEO website, using the getGEO function from the GEOquery package:

## Setting options('download.file.method.GEOquery'='auto')
## ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE34nnn/GSE34313/matrix/
## Found 1 file(s)
## GSE34313_series_matrix.txt.gz

## Warning in download.file(myurl, destfile, mode = mode, quiet = TRUE, method
## = getOption("download.file.method.GEOquery")): downloaded length 26060596 !
## = reported length 200

## File stored at: 
## C:\Users\Prasanth\AppData\Local\Temp\RtmpIdiXOs/GPL6480.soft

This dataset is hosted by GEO by the following link. The experiment is described in the paper Masuno 2011. Birefly, the investigation applied a glucocorticoid hormone in cultured human airway smooth muscle. The glucocorticoid hormone is used to treat asthma, as it reduces the inflammation response. However, it has many other effects throughout the different tissues of the body.

The groups are defined in the characteristics_ch1.2 variable:

e$condition = e$characteristics_ch1.2
levels(e$condition) = c("dex24","dex4","control")
table(e$conditions)

## < table of extent 0 >

## Note that the ExpressionSets we can access the phenoData columns directly. Specifically, note the e$condition is equivalent to pData$condition

This data includes GO terms for every gene. We can see these in looking at the feature annotations:

names(fData(e))

##  [1] "ID"                   "SPOT_ID"              "CONTROL_TYPE"        
##  [4] "REFSEQ"               "GB_ACC"               "GENE"                
##  [7] "GENE_SYMBOL"          "GENE_NAME"            "UNIGENE_ID"          
## [10] "ENSEMBL_ID"           "TIGR_ID"              "ACCESSION_STRING"    
## [13] "CHROMOSOMAL_LOCATION" "CYTOBAND"             "DESCRIPTION"         
## [16] "GO_ID"                "SEQUENCE"

fData(e)$GO_ID[1:4]

## [1] GO:0016020(membrane)|GO:0016021(integral to membrane)                                                                                                                                                                                                                                        
## [2] GO:0005794(Golgi apparatus)|GO:0006886(intracellular protein transport)|GO:0008565(protein transporter activity)|GO:0016020(membrane)|GO:0016192(vesicle-mediated transport)|GO:0030117(membrane coat)|GO:0030659(cytoplasmic vesicle membrane)|GO:0031410(cytoplasmic vesicle)              
## [3] GO:0001669(acrosomal vesicle)|GO:0006836(neurotransmitter transport)|GO:0016020(membrane)|GO:0016021(integral to membrane)|GO:0022857(transmembrane transporter activity)|GO:0030054(cell junction)|GO:0030672(synaptic vesicle membrane)|GO:0031410(cytoplasmic vesicle)|GO:0045202(synapse)
## [4] GO:0000166(nucleotide binding)|GO:0003676(nucleic acid binding)                                                                                                                                                                                                                              
## 13508 Levels:  ...

We will compare the control samples to those treated with dexamethasone (the hormone) after 4 hours.

lvls = c("control", "dex4")
es = e[,e$condition %in% lvls]
es$condition = factor(es$condition, levels=lvls)

The following lines run the linear model in limma. We note that the top genes are common immune-response genes (CSF2, LIF, CCL2, IL6). Also present is FKBP5, a gene which regulates and is regulated by the protein which receives the glucocorticoid hormone.

library(limma)
library(qvalue)
design = model.matrix(~ es$condition)
fit = lmFit(es, design=design)
fit = eBayes(fit)
topTable(fit)[,c(6,7,18,22)]

## Removing intercept from test coefficients

##               GENE GENE_SYMBOL     logFC    adj.P.Val
## A_23_P133408  1437        CSF2 -4.724965 6.248554e-08
## A_24_P122137  3976         LIF -6.689927 6.248554e-08
## A_32_P5276   26084    ARHGEF26  3.411468 1.744514e-07
## A_23_P89431   6347        CCL2 -3.618725 1.744514e-07
## A_23_P42257   8870        IER3 -4.808854 3.294838e-07
## A_23_P71037   3569         IL6 -4.568355 4.212726e-07
## A_24_P20327  28999       KLF15  4.013070 4.212726e-07
## A_24_P38081   2289       FKBP5  3.567059 4.212726e-07
## A_23_P213944  1839       HBEGF -2.902190 4.212726e-07
## A_24_P250922  5743       PTGS2 -3.518360 4.727798e-07

We will use the ROAST method for gene set testing. We can test a single gene set by looking up the genes which contain a certain GO ID, and providing this to the roast function.

The roast function performs an advanced statistical technique, rotation of residuals, in order to generate a sense of the null distribution for the test statistic. The test statistic in this case is the summary of the scores from each gene. The tests are self-contained because only the summary for a single set is used, whereas other gene set tests might compare a set to all the other genes in the dataset, e.g., a competitive gene set test.

The result here tells us that the immune response genes are significantly down-regulated, and additionally, mixed up and down.

# Immune response
set.seed(1)
idx = grep("GO:0006955", fData(es)$GO_ID)
length(idx)

## [1] 504

r1 = roast(es, idx, design)
r1

##          Active.Prop    P.Value
## Down      0.16269841 0.01750875
## Up        0.09325397 0.98299150
## UpOrDown  0.16269841 0.03500000
## Mixed     0.25595238 0.01300000

Set the seed at 1 (set.seed(1)) and repeat the above for the go term associated with “GO:0045454”. What is the p.value for genes in this gene set being down regulated?

set.seed(1)
idx = grep("GO:0045454", fData(es)$GO_ID)
length(idx)

## [1] 81

r1 = roast(es, idx, design)
r1

##          Active.Prop    P.Value
## Down      0.08641975 0.91195598
## Up        0.23456790 0.08854427
## UpOrDown  0.23456790 0.17700000
## Mixed     0.32098765 0.01200000

Question 4.2.5

We can also use the mroast function to perform multiple roast tests. First we need to create a list, which contains the indices of genes in the ExpressionSet for each of a number of gene sets. We will use the org.Hs.eg.db package to gather the gene set information. The following code used R to extracts and organizes GO annotations from a somewhat messy annotation provided by the manufacturer.

library(org.Hs.eg.db)

## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: GenomeInfoDb
## Loading required package: S4Vectors
## Loading required package: IRanges
## 
## Attaching package: 'AnnotationDbi'
## 
## The following object is masked from 'package:MASS':
## 
##     select
## 
## Loading required package: DBI

org.Hs.egGO2EG

## GO2EG map for Human (object of class "Go3AnnDbBimap")

go2eg = as.list(org.Hs.egGO2EG)
head(go2eg)

## $`GO:0000002`
##     TAS     TAS     ISS     IMP     IMP     NAS     IMP     IDA     IBA 
##   "291"  "1890"  "4205"  "4358"  "4976"  "9361" "10000" "55186" "63875" 
##     IEA     IDA     IMP 
## "80119" "84275" "92667" 
## 
## $`GO:0000003`
##    IBA    IEP 
## "8208" "8510" 
## 
## $`GO:0000011`
##     IBA 
## "64145" 
## 
## $`GO:0000012`
##         IDA         IDA         IEA         IMP         IDA         IDA 
##      "3981"      "7141"      "7515"     "23411"     "54840"     "55775" 
##         IMP         IMP         IEA 
##     "55775"    "200558" "100133315" 
## 
## $`GO:0000018`
##     TAS     TAS     TAS     IMP     IMP     IEP 
##  "3575"  "3836"  "3838"  "9984" "10189" "56916" 
## 
## $`GO:0000019`
##     TAS     IDA 
##  "4361" "10111"

The following code unlists the list, then gets matches for each Entrez gene ID to the index in the ExpressionSet. Finally, we rebuild the list.

govector = unlist(go2eg)
golengths = sapply(go2eg, length)
head(fData(es)$GENE)

## [1] "400451" "10239"  "9899"   "348093" "57099"  "146050"

idxvector = match(govector, fData(es)$GENE);table(is.na(idxvector))

## 
##  FALSE   TRUE 
## 239596   8306

##This is the organized list of indexes for genes per GO term:
idx = split(idxvector, rep(names(go2eg), golengths));head(go2eg)

## $`GO:0000002`
##     TAS     TAS     ISS     IMP     IMP     NAS     IMP     IDA     IBA 
##   "291"  "1890"  "4205"  "4358"  "4976"  "9361" "10000" "55186" "63875" 
##     IEA     IDA     IMP 
## "80119" "84275" "92667" 
## 
## $`GO:0000003`
##    IBA    IEP 
## "8208" "8510" 
## 
## $`GO:0000011`
##     IBA 
## "64145" 
## 
## $`GO:0000012`
##         IDA         IDA         IEA         IMP         IDA         IDA 
##      "3981"      "7141"      "7515"     "23411"     "54840"     "55775" 
##         IMP         IMP         IEA 
##     "55775"    "200558" "100133315" 
## 
## $`GO:0000018`
##     TAS     TAS     TAS     IMP     IMP     IEP 
##  "3575"  "3836"  "3838"  "9984" "10189" "56916" 
## 
## $`GO:0000019`
##     TAS     IDA 
##  "4361" "10111"

##We can see the genes like this:
go2eg[[1]]

##     TAS     TAS     ISS     IMP     IMP     NAS     IMP     IDA     IBA 
##   "291"  "1890"  "4205"  "4358"  "4976"  "9361" "10000" "55186" "63875" 
##     IEA     IDA     IMP 
## "80119" "84275" "92667"

fData(es)$GENE[idx[[1]]]

##  [1] "291"   "1890"  "4205"  "4358"  "4976"  "9361"  "10000" "55186"
##  [9] "63875" "80119" "84275" "92667"

We need to clean this list such that there are no \(NA\) values. We also clean it to remove gene sets which have less than 10 genes.

idxclean = lapply(idx, function(x) x[!is.na(x)])
idxlengths = sapply(idxclean, length)
idxsub = idxclean[idxlengths > 10]
length(idxsub)

## [1] 3090

With these indexes in place we are now ready to run mroast. The following line of code runs the multiple ROAST test. This can take about 3 minutes.

set.seed(1)
r2 = mroast(es, idxsub, design)
head(r2)

##            NGenes  PropDown     PropUp Direction PValue         FDR
## GO:0005125    174 0.2758621 0.04597701      Down  0.001 0.009716981
## GO:0008083    167 0.2874251 0.07784431      Down  0.001 0.009716981
## GO:0070098     65 0.2461538 0.06153846      Down  0.001 0.009716981
## GO:0043433     63 0.2698413 0.12698413      Down  0.001 0.009716981
## GO:0042102     54 0.1666667 0.09259259      Down  0.001 0.009716981
## GO:0006959     53 0.2452830 0.09433962      Down  0.001 0.009716981
##            PValue.Mixed   FDR.Mixed
## GO:0005125        0.001 0.002566445
## GO:0008083        0.001 0.002566445
## GO:0070098        0.001 0.002566445
## GO:0043433        0.001 0.002566445
## GO:0042102        0.001 0.002566445
## GO:0006959        0.001 0.002566445

Find the GO term that has the the largest proportion of upregulated genes. How many genes does this term have?

r2[which.max(r2$PropDown),1]

## [1] 15

Note that this is a relatively small gene set. In fact, we can see that there is much more variability in this proportion for smaller gene sets:

plot(log2(r2$NGenes),r2$PropUp)

plot(log2(r2$NGenes),-log10(r2$PValue))

Question 4.2.6

For the data in the previous question, rerun mroast using only gene sets with 50 or more genes.

In this new analysis, what is the GO term has the largest proportion of upregulated genes (answer should be in this form GO:000001)?

idxsub = idxclean[idxlengths > 50]
length(idxsub)

## [1] 707

r2 = mroast(es, idxsub, design)
head(r2)

##            NGenes  PropDown     PropUp Direction PValue        FDR
## GO:0005125    174 0.2758621 0.04597701      Down  0.001 0.05891667
## GO:0008083    167 0.2874251 0.07784431      Down  0.001 0.05891667
## GO:0071222     86 0.2441860 0.16279070      Down  0.001 0.05891667
## GO:0043433     63 0.2698413 0.12698413      Down  0.001 0.05891667
## GO:0006959     53 0.2452830 0.09433962      Down  0.001 0.05891667
## GO:0070098     65 0.2461538 0.06153846      Down  0.001 0.05891667
##            PValue.Mixed   FDR.Mixed
## GO:0005125        0.001 0.002472028
## GO:0008083        0.001 0.002472028
## GO:0071222        0.001 0.002472028
## GO:0043433        0.001 0.002472028
## GO:0006959        0.001 0.002472028
## GO:0070098        0.002 0.006427273

r2[which.max(r2$PropUp),]

##            NGenes   PropDown    PropUp Direction PValue       FDR
## GO:0000776     69 0.05797101 0.3768116        Up  0.028 0.4521512
##            PValue.Mixed  FDR.Mixed
## GO:0000776        0.016 0.04454675

Question 4.2.7

Use select and the Go.db annotation package to findthe TERM for GO:0000776

#source("https://bioconductor.org/biocLite.R")
#biocLite("GO.db")
library(GO.db)

##

columns(GO.db)

## [1] "GOID"       "TERM"       "ONTOLOGY"   "DEFINITION"

keytypes(GO.db)

## [1] "GOID"       "TERM"       "ONTOLOGY"   "DEFINITION"

select(GO.db, keys="GO:0000776",columns="TERM")

##         GOID        TERM
## 1 GO:0000776 kinetochore

# there is another function simply uses

GOTERM[["GO:0000776"]]

## GOID: GO:0000776
## Term: kinetochore
## Ontology: CC
## Definition: A multisubunit complex that is located at the
##     centromeric region of DNA and provides an attachment point for
##     the spindle microtubules.
## Synonym: GO:0005699
## Secondary: GO:0005699

Inference_Exercises

Prasanth

September 9, 2015

Question 4.1.1

Question 4.1.2

Question 4.1.3

Question 4.1.4

Question 4.2.1

Question 4.2.4

Question 4.2.3

Question 4.2.4

Question 4.2.5

Question 4.2.6

Question 4.2.7