October 10, 2014

Question

  • Role of read depth in experiment design
    • "Do you have enough reads?": Introducing subSeq software
    • "Do people usually have enough reads?": subSeq results across many RNA-Seq experiments
    • "Exactly how many reads do we need?": Predicting the curve

Multiplexing offers new challenges in experiment design

  • Multiplexed barcode sequencing allows multiple libraries to be run on the same flowcell
  • Investigators can choose how to allocate their reads
    • Additional replicates
    • Additional conditions/time points
    • Multiple experiments
  • Designing the experiment requires knowing effect of read depth

Goals

  • When designing a sequencing experiment, choose a read depth and number of replicates to satisfy any of the following requirements:
    • Number of significant genes
    • Statistical power for a given gene or set of genes
    • Statistical power for a given effect size
  • Given an existing experiment, determine if we have a high enough depth to trust the biological conclusions
  • Test how analysis methods are affected by read depth

We can simulate smaller read depths using subsampling

Where are we on the "saturation curve"

plot of chunk saturation_curves

The subSeq R package

Bioinformatics Applications Note

Key: subsampling can be performed on counts, not reads

  • Given an \(m \times n\) matrix \(X\) of counts
  • Construct a matrix \(Y\), where every read has a probability \(p\) of being included:

\[ Y_{i,j} \sim \mbox{Binom}(X_{i,j}, p) \]

  • Mathematically identical to (e.g.) DownsampleSam software

Key: subsampling can be performed on count matrix

What you need

  • Unnormalized matrix of read counts (e.g. from htseq-count)
    • One row per gene
    • One column per sample
  • Choice of subsampling proportions in (0, 1]
    • More proportions = more running time, more detail
  • Choice of one or more analysis methods
    • Anything you can do on the full count matrix in R

Example: Hammer et al data

library(subSeq)
data(hammer)
library(Biobase)
countmatrix <- exprs(hammer)[, 1:4]
countmatrix <- countmatrix[rowSums(countmatrix) >= 5, ]
head(countmatrix)
##                    SRX020102 SRX020103 SRX020104 SRX020105
## ENSRNOG00000000001         2         4        18        24
## ENSRNOG00000000007         4         1         3         1
## ENSRNOG00000000008         0         1         4         2
## ENSRNOG00000000010        19        10        19        13
## ENSRNOG00000000012         7         5         1         0
## ENSRNOG00000000017         4         1        12        18

Treatment

We compare a control group with rats treated with L5 SNL.

treatment <- phenoData(hammer)$protocol[1:4]

treatment
## [1] control control L5 SNL  L5 SNL 
## Levels: control L5 SNL

Proportions

We recommend choosing proportions on a logarithmic scale.

proportions <- 10^seq(-2, 0, .25)
proportions
## [1] 0.01000 0.01778 0.03162 0.05623 0.10000 0.17783 0.31623 0.56234 1.00000

Subsampling

Choose one or more RNA-Seq differential expression methods for each step.

subsamples <- subsample(countmatrix, proportions, method=c("edgeR", "DESeq2"),
                       treatment=treatment)

Summarize into a table

subsamples.summ <- summary(subsamples)
head(subsamples.summ)
##      depth proportion method replication significant pearson spearman
## 1:  198567    0.01000  edgeR           1         201  0.3296   0.3892
## 2:  352812    0.01778  edgeR           1         425  0.3920   0.4603
## 3:  627759    0.03162  edgeR           1         827  0.4606   0.5373
## 4: 1117230    0.05623  edgeR           1        1534  0.5503   0.6222
## 5: 1985611    0.10000  edgeR           1        2468  0.6417   0.7127
## 6: 3530798    0.17783  edgeR           1        3502  0.7456   0.8001
##       MSE   estFDP     rFDP percent
## 1: 2.3728 0.001191 0.000000 0.02403
## 2: 2.0724 0.004430 0.007059 0.05046
## 3: 1.7405 0.007126 0.008464 0.09805
## 4: 1.3607 0.005091 0.003911 0.18271
## 5: 1.0577 0.009132 0.009319 0.29236
## 6: 0.7355 0.013354 0.013992 0.41289

Summarize into a table

depth proportion significant pearson spearman MSE estFDP rFDP percent
198567 0.0100 201 0.3296 0.3892 2.3728 0.0012 0.0000 0.0240
352812 0.0178 425 0.3920 0.4603 2.0724 0.0044 0.0071 0.0505
627759 0.0316 827 0.4606 0.5373 1.7405 0.0071 0.0085 0.0981
1117230 0.0562 1534 0.5503 0.6222 1.3607 0.0051 0.0039 0.1827
1985611 0.1000 2468 0.6417 0.7127 1.0577 0.0091 0.0093 0.2924
3530798 0.1778 3502 0.7456 0.8001 0.7355 0.0134 0.0140 0.4129
6280315 0.3162 5152 0.8323 0.8728 0.4479 0.0243 0.0219 0.6025
11172016 0.5623 6763 0.9194 0.9419 0.1967 0.0340 0.0271 0.7868
19862441 1.0000 8363 1.0000 1.0000 0.0000 0.0510 0.0000 1.0000

Plot

plot(subsamples.summ)

plot of chunk hammer_plot

subSeq makes it easy to plot data other ways

plot of chunk pvalue_hists

Other features (see vignette)

  • Perform multiple replications of subsampling at each depth
  • Use your own custom analysis methods (anything that can be done on a count matrix in R)
    • Alternative splicing/differential exon usage
  • Recreate the exact count matrix used at any depth

Subsampling on many experiments

Expression atlas

  • Repository of processed gene expression experiments
  • We used 25 contrasts across 12 experiments
    • edgeR and DESeq2
    • Performed subsampling at 41 proportions, evenly spaced on a log scale from .01 to 1
    • 3 replications at each proportion

Significant genes in each experiment

plot of chunk significant_curves

Some experiments saturate more than others

plot of chunk metric_saturations

What Determines the Shape of the Curve?

  • Distribution of effect sizes
  • Distribution of gene expression level
  • Distribution of gene length
  • Experiment design (# of replicates)
  • Amount of variation added at each stage

We can see if saturation has happened…

plot of chunk saturation_curves_2

Can we tell when saturation will happen?

plot of chunk saturation_curves_predicted

Prediction requires a model

plot of chunk prediction_example

Log-Normal CDF Model

\[ G(p) = k\times \Phi(log(p - b), \mu, \sigma)\]

where

  • \(G(p)\) is the number of significant genes at proportion \(p\)
  • \(\Phi\) is the CDF of the normal distribution.
  • \(k,\mu,\sigma,b\) are the four parameters fit for each curve, found through nonlinear least squares

Log-Normal CDF Model

plot of chunk prediction_example_wpred

Log-Normal CDF model: log scale

plot of chunk prediction_example_logx

Predictive power

plot of chunk predictive_power

Predictions work in many experiments

plot of chunk prediction_many

Predictions on a log scale

plot of chunk prediction_many_log

Residuals demonstrate that the fit is reasonable

plot of chunk residuals_plot

Continuing the curve

Effect of doubling/tripling read depth

plot of chunk saturations_hist

Accuracy of such predictions

plot of chunk graph_prediction_full

Saturation is not easily predicted from read depth

plot of chunk saturations_pergene_depth

Are replications necessary?

plot of chunk plot_replication_fits

Takeaway

  • subSeq can predict the effect of increasing the read depth
    • Can predict effect of doubling quite accurately
    • Tripling with moderate accuracy
  • Different experiments need different numbers of reads: The extent of saturation can't easily be predicted from the depth
  • subSeq replications might be necessary when the number of significant genes is < 500, rarely necessary when > 1000.