Exercises

In order to explain the different methods of normalization and their problems with large array of data, which are assumed to have

most genes are not differentially expressed across conditions.

the distribution of genes across the samples are roughly the same

Question 3.8.1

Two sets of plot showing probability density and cumulative density of a vector of gene expression for a single sample and an average of sorted gene expression vectors.

Which of the normalization method (among Loess normalization, Quantile normalization or Variance stabilizing normalizations) would turn a value of 60 into 45 as we go from single to the average of multiple genes.

Quantile normalization

In the rest of this assessment we will use the spikeIn experiment again. Note that these arrays are considered replicated except for the spiked-in genes (the only ones what should be changing)

library(genefilter)
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
## 
## Attaching package: 'genefilter'
## 
## The following object is masked from 'package:base':
## 
##     anyNA
library(SpikeIn)
## Loading required package: affy
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## 
## The following object is masked from 'package:stats':
## 
##     xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, as.vector, cbind,
##     colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
##     intersect, is.unsorted, lapply, Map, mapply, match, mget,
##     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
##     rbind, Reduce, rep.int, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unlist, unsplit
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
data("SpikeIn95")
siNames = colnames(pData(SpikeIn95))
spikeinIndex = which(probeNames(SpikeIn95) %in% siNames)
## 

The spiked-in probes should change from array to array with the exception of the last eight arrays:

pData(SpikeIn95)[52:59,]
##                 37777_at 684_at 1597_at 38734_at 39058_at 36311_at
## 2353m99hpp_av08      512   1024    0.00     0.25      0.5        1
## 2353n99hpp_av08      512   1024    0.00     0.25      0.5        1
## 2353o99hpp_av08      512   1024    0.00     0.25      0.5        1
## 2353p99hpp_av08      512   1024    0.00     0.25      0.5        1
## 2353q99hpp_av08     1024      0    0.25     0.50      1.0        2
## 2353r99hpp_av08     1024      0    0.25     0.50      1.0        2
## 2353s99hpp_av08     1024      0    0.25     0.50      1.0        2
## 2353t99hpp_av08     1024      0    0.25     0.50      1.0        2
##                 36889_at 1024_at 36202_at 36085_at 40322_at 407_at 1091_at
## 2353m99hpp_av08        2       4        8       16       32    512     128
## 2353n99hpp_av08        2       4        8       16       32    512     128
## 2353o99hpp_av08        2       4        8       16       32    512     128
## 2353p99hpp_av08        2       4        8       16       32    512     128
## 2353q99hpp_av08        4       8       16       32       64   1024     256
## 2353r99hpp_av08        4       8       16       32       64   1024     256
## 2353s99hpp_av08        4       8       16       32       64   1024     256
## 2353t99hpp_av08        4       8       16       32       64   1024     256
##                 1708_at 33818_at 546_at
## 2353m99hpp_av08     256       64      8
## 2353n99hpp_av08     256       64      8
## 2353o99hpp_av08     256       64      8
## 2353p99hpp_av08     256       64      8
## 2353q99hpp_av08     512      128     16
## 2353r99hpp_av08     512      128     16
## 2353s99hpp_av08     512      128     16
## 2353t99hpp_av08     512      128     16

So, for the purpose of this assessment, we will remove the arrays in which we have repeated spike-in concentrations:

arrayIndex = c(1:51,52,56)
pm = pm(SpikeIn95)[,arrayIndex]

We can see the need for normalization by looking at boxplots:

boxplot(log2( pm ),range=0)

Make sure to use the matrix pm as defined above for the next two questions.

Question 3.8.2

Compute the across sample log (base 2) scale SD for each probe in pm.

What is the median SD for the non-spiked in genes (use the index spikeinIndexto exclude spike-ins)?

sds=rowSds(log2(pm))
median(sds[-spikeinIndex])
## [1] 0.1990065

what is the median SD fir the spiked-in genes?

median(sds[spikeinIndex])
## [1] 1.820172

Note that the technology does appear to be detecting the spike-in and distinguishing them from the rest:

boxplot(sds[-spikeinIndex],sds[spikeinIndex],range=0)

But we know there is room for improvement.

Question 3.8.3

Note that the first summary in the previous question relates to specificity: we want measurements across replicate arrays to be similar and thus the variability low. However, because we designed the experiment for concentrations in the spike-ins to vary across samples, we want the concentration of spike-in genes to change so that variability should be higher. A successful normalization approach will improve specificity (lower the SD of non-spiked-in genes) without affecting sensitivity (leave SD of spiked-in genes the same).

Now use the normalize.quantiles in the preprocessCore package to normalize all the probes together (make sure to normalize in the original scale)

Now recalculate the median SDs as in the previous assessment question. Remember to use the same pm

arrayIndex = c(1:51,52,56)
pm = pm(SpikeIn95)[,arrayIndex]

BiocInstaller::biocLite("preprocessCore")

library(preprocessCore)
npm = normalize.quantiles(pm)

What is the median SD for the non-spiked in genes now?

nsds=rowSds(log2(npm))
median(nsds[-spikeinIndex])
## [1] 0.1274938

We can see the advantages of normalizing by noting the large drop in variability in the non-spiked in genes:

boxplot(sds[-spikeinIndex],nsds[-spikeinIndex],range=0,ylim=c(0,2.5))

Now do we achieve this improvement by sacrificing sensitivity? To check for this we see if the variability across spiked-in genes remains high.

Question 3.8.4

What is the median SD for the spiked in genes after normalization?

median(nsds[spikeinIndex])
## [1] 1.813862

We can see that the SD goes down only a small fraction. These boxplots show that for most probes, the spike-in variability remaining high.

Question 3.8.5

As we discussed in the videos, normalization techniques such as quantile normalization are not always appropriate. An example dataset comes from the dataset described in the paper Loven et al. (2012) described in the videos and included in this package:

library(devtools)
install_github("stephaniehicks/mycAffyData")

Alternatively you can download directly from github here.

You will need to install this package as well:

BiocInstaller::biocLite("primeviewcdf")

BiocInstaller::biocLite("SQN")

We discussed the SQN approach in the videos and you can read more about it here:

?SQN

Note that we need control genes to run this function. This datasets has such controls and you can extract them like this:

library(mycAffyData)
data(mycData)
erccIndex=grep("ERCC",probeNames(mycData))
## 
## 
## Attaching package: 'primeviewcdf'
## 
## The following objects are masked from 'package:hgu95acdf':
## 
##     i2xy, xy2i

The ERCC stands for the external RNA control consortium, which is trying to define standards for high throughput technologies.

Let’s consider three approaches: 1) no normalization, 2) quantile normalization, 3) sqn normalization. We will put them in a list:

library(preprocessCore)
library(SQN)
## Loading required package: mclust
## Package 'mclust' version 5.0.2
## Type 'citation("mclust")' for citing this R package in publications.
## Loading required package: nor1mix
pms = pm(mycData)
pms = list( log2( pms ),
            log2( normalize.quantiles( pms )),
            SQN( log2(pms),ctrl.id=erccIndex ))
## Using 'sigma' instead 'sig2' (= sigma^2) is preferred now
names(pms)=c("none","qn","sqn")

This is an experiment in which we expect most genes to be expressed higher in samples 3,4 compared to 1,2. The control genes should not change. To explore which normalization technique worked best, compute the difference between samples 1 and 3:

M = sapply(pms,function(y){
  y[,3]-y[,1]
})

This is an experiment in which we expect most genes to be expressed higher in samples 3,4 compared to 1,2. The control genes should not change.

Make boxplots comparing these differences for control genes and the rest for all three techniques.

library(rafalib)
mypar(2,1)
boxplot(M[erccIndex,],range=0,ylim=c(-2,2))
abline(h=0,lty=2)
boxplot(M[-erccIndex,],range=0,ylim=c(-2,2))
abline(h=0,lty=2)

Which of the following is not correct?

In the raw data, the control genes are well below 0 and the rest are only slightly above. Given the experimental design, we want the controls at 0 and the rest above.

TRUE

Compared to the raw data, quantile normalized data is in general lower, for both control and non-control genes. This is slightly worse than the raw data.

TRUE

With SQN the control genes are now centered at 0 and the non-control genes are mostly above 0. This is the desired result.

TRUE

Quantile normalization has centered the majority of genes. The controls are only a minority so this is the preferred result.

FALSE

———————————————————–

SYSTEMATIC BIAS ASSESSMENT

In this assessment, we will examine the potential for sequence based systematic bias. In particular we will examine how the GC content of the gene transcript might affect the read counts in an RNA-seq experiment.

Systematic bias is important to keep in mind, especially in experiments with many batches. Comparisons across batches are complicated by technical biases, and it is essential that conditions of interest (treatment vs untreated) be balanced across batches, or else the experiment will uncover a mix of biological differences and systematic bias, with no hope to disentangle the two. This issue comes up over and over again in genomic studies, since the beginning of microarray data and continuing to the present day, with massive sequencing-based genomic data sets.

The ReCount project provides summaries of a number of RNA-seq experiments as ExpressionSet objects, so that labs around the world do not have to perform alignment just to “re-count” the number of reads which fall in genes. If you use data from this project make sure to cite the authors! Download the ExpressionSet for the Bottomly experiment, here. Load this object into R:

library(Biobase)
load("bottomly_eset.RData")
bottomly.eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 36536 features, 21 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: SRX033480 SRX033488 ... SRX033494 (21 total)
##   varLabels: sample.id num.tech.reps ... lane.number (5 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: ENSMUSG00000000001 ENSMUSG00000000003 ...
##     ENSMUSG00000090268 (36536 total)
##   fvarLabels: gene
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## Annotation:
head(exprs(bottomly.eset))
##                    SRX033480 SRX033488 SRX033481 SRX033489 SRX033482
## ENSMUSG00000000001       369       744       287       769       348
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         0         1         0         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         0         1         1         5         0
## ENSMUSG00000000049         0         1         0         1         0
##                    SRX033490 SRX033483 SRX033476 SRX033478 SRX033479
## ENSMUSG00000000001       803       433       469       585       321
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         0         7         6         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         0         0         0         0
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033472 SRX033473 SRX033474 SRX033475 SRX033491
## ENSMUSG00000000001       301       461       309       374       781
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         1         1         1         1         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         4         1         1         0         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033484 SRX033492 SRX033485 SRX033493 SRX033486
## ENSMUSG00000000001       555       820       294       758       419
## ENSMUSG00000000003         0         0         0         0         0
## ENSMUSG00000000028         2         1         1         4         1
## ENSMUSG00000000031         0         0         0         0         0
## ENSMUSG00000000037         2         1         1         1         1
## ENSMUSG00000000049         0         0         0         0         0
##                    SRX033494
## ENSMUSG00000000001       857
## ENSMUSG00000000003         0
## ENSMUSG00000000028         5
## ENSMUSG00000000031         0
## ENSMUSG00000000037         2
## ENSMUSG00000000049         0

Examine the histogram of sample-sample correlations of log counts:

hist(cor(log(exprs(bottomly.eset) + 1)))

There are a number of reasons for such high correlations:

Remove some of the rows with many zeros, and plot the histogram of correlations again:

mat = exprs(bottomly.eset)
rs = rowSums(mat)
hist(cor(log(mat[ rs > 10, ] + 1)))


Subset and rename the ExpressionSet to remove the genes which have 2 or fewer reads across all samples:

e = bottomly.eset[ rowSums(exprs(bottomly.eset)) > 2, ]

We now have 12971 genes and 21 samples. The genes are annotated with ENSEMBL IDs, stored as rownames (and in fData(e)). The quickest way for us to check the GC content of each gene is to use the precomputed TxDb for mouse in Bioconductor. If we were doing this analysis not just for quick check, we should load the ENSEMBL genes from the ENSEMBL GTF file and use these for computing GC.

Install the TxDb for mouse (13 Mb), the BSgenome for mouse (593 Mb), and the annotation database for mouse (56 Mb)

library("TxDb.Mmusculus.UCSC.mm9.knownGene")
## Loading required package: GenomicFeatures
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: GenomeInfoDb
## Loading required package: GenomicRanges
## Loading required package: AnnotationDbi
library("BSgenome.Mmusculus.UCSC.mm9")
## Loading required package: BSgenome
## Loading required package: Biostrings
## Loading required package: XVector
## Loading required package: rtracklayer
library("org.Mm.eg.db")
## Loading required package: DBI

Now we will connect the ENSEMBL IDs to Entrez IDs, and then get the sequence for the exons per gene.

res = select(org.Mm.eg.db, keys=rownames(e), keytype="ENSEMBL", columns="ENTREZID")

Note there are 1 to many mappings, which will we ignore for this quick analysis. Now add a column to the ExpressionSet, by getting the first match of the ENTREZ IDs to our ENSEMBL IDs.

fData(e)$ENTREZ = res$ENTREZID[ match(rownames(e), res$ENSEMBL) ]

There are some NA’s for those ENSEMBL IDs missing matches, which we will just remove from analysis:

sum(is.na(fData(e)$ENTREZ))
## [1] 1170
e = e[ !is.na(fData(e)$ENTREZ) , ]

Now it’s only two steps to get the GC content: get the genes, and get the sequence for the genes. First we get the exons for each gene:

txdb = TxDb.Mmusculus.UCSC.mm9.knownGene
grl = exonsBy(txdb, by="gene")

We need to subset the ExpressionSet again to only those genes for which we have ranges:

e = e[ fData(e)$ENTREZ %in% names(grl), ]

Now put the exon GRangesList in the order of the counts ExpressionSet and reduce the exons GRangesList. The reduce() call is necessary so that we don’t overcount any overlapping exon sequence.

reduced.exons = reduce(grl[ fData(e)$ENTREZ ])

Question 3.9.1

Get the sequence for the reduced exons using getSeq. Note that the returned object is a DNAStringSetList. For each gene, we have a DNAStringSet of the reduced exon sequence. By calling DNAStringSet(lapply(x, unlist)) on the DNAStringSetList, ‘x’, we get a DNAStringSet for each gene with the joined reduced exon sequence. Note that this unlisting does take a few minutes to perform though.

What is the GC-content (a ratio, answer between 0 and 1) for the first gene in reduced.exons? Hint: letterFrequency

library("BSgenome.Mmusculus.UCSC.mm9")
mssq = BSgenome.Mmusculus.UCSC.mm9
class(mssq)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
rdsq = getSeq(mssq, reduced.exons)
ul = DNAStringSet(lapply(rdsq, unlist))
sum(letterFrequency(ul[1], letters = c("C","G"), as.prob = TRUE))
## [1] 0.4151355
# or more simple way

x = getSeq(Mmusculus, reduced.exons)
dss = DNAStringSet(lapply(x, unlist))
gc = letterFrequency(dss,"GC", as.prob = TRUE)

Plot the log counts over the GC content for one sample:

plot( log(exprs(e)[,1]+1) ~ gc)

It might be easier to see a dependence with boxplots:

boxplot( log(exprs(e)[,1]+1) ~ cut(gc, 20))

We can calculate the median log count for each bin of GC content for a single sample:

sapply(split(log(exprs(e)[,1]+1), cut(gc, 20)), median)
## (0.333,0.354] (0.354,0.373] (0.373,0.393] (0.393,0.413] (0.413,0.432] 
##      2.426015      2.639057      3.433987      3.569433      3.713572 
## (0.432,0.452] (0.452,0.472] (0.472,0.492] (0.492,0.511] (0.511,0.531] 
##      3.688879      3.465736      3.526361      3.496508      3.178054 
## (0.531,0.551]  (0.551,0.57]   (0.57,0.59]   (0.59,0.61]   (0.61,0.63] 
##      3.178054      3.433987      3.465736      3.367296      3.901922 
##  (0.63,0.649] (0.649,0.669] (0.669,0.689] (0.689,0.709] (0.709,0.729] 
##      3.113268      1.945910      3.481122      2.845180      2.890372

We can perform this calculation across all samples as well:

gc.depend = sapply(1:ncol(e), function(i) sapply(split(log(exprs(e)[,i]+1), cut(gc, 20)), median))

Plot the GC dependence and color the lines by batch identifier:

plot(gc.depend[,1], type="n", ylim=c(0,6))
batch = factor(e$experiment.number)
for (i in 1:ncol(e)) lines(gc.depend[,i], col=batch[i])
legend("bottom", levels(batch), col=1:3, lty=1)

Question 3.9.2

Which batch number has the highest counts for the low GC genes?

7

Which batch number has the highest counts for the high GC genes – in particular for the second to highest GC bin?

4