subSeq: Determining appropriate sequencing depth through efficient read subsampling

October 10, 2014

Question

Role of read depth in experiment design
- "Do you have enough reads?": Introducing subSeq software
- "Do people usually have enough reads?": subSeq results across many RNA-Seq experiments
- "Exactly how many reads do we need?": Predicting the curve

Multiplexing offers new challenges in experiment design

Multiplexed barcode sequencing allows multiple libraries to be run on the same flowcell
Investigators can choose how to allocate their reads
- Additional replicates
- Additional conditions/time points
- Multiple experiments
Designing the experiment requires knowing effect of read depth

Goals

When designing a sequencing experiment, choose a read depth and number of replicates to satisfy any of the following requirements:
- Number of significant genes
- Statistical power for a given gene or set of genes
- Statistical power for a given effect size
Given an existing experiment, determine if we have a high enough depth to trust the biological conclusions
Test how analysis methods are affected by read depth

We can simulate smaller read depths using subsampling

Where are we on the "saturation curve"

plot of chunk saturation_curves

The subSeq R package

Bioinformatics Applications Note

Key: subsampling can be performed on counts, not reads

Given an \(m \times n\) matrix \(X\) of counts
Construct a matrix \(Y\), where every read has a probability \(p\) of being included:

\[ Y_{i,j} \sim \mbox{Binom}(X_{i,j}, p) \]

Mathematically identical to (e.g.) DownsampleSam software

Key: subsampling can be performed on count matrix

What you need

Unnormalized matrix of read counts (e.g. from htseq-count)
- One row per gene
- One column per sample
Choice of subsampling proportions in (0, 1]
- More proportions = more running time, more detail
Choice of one or more analysis methods
- Anything you can do on the full count matrix in R

Example: Hammer et al data

library(subSeq)
data(hammer)
library(Biobase)
countmatrix <- exprs(hammer)[, 1:4]
countmatrix <- countmatrix[rowSums(countmatrix) >= 5, ]
head(countmatrix)

##                    SRX020102 SRX020103 SRX020104 SRX020105
## ENSRNOG00000000001         2         4        18        24
## ENSRNOG00000000007         4         1         3         1
## ENSRNOG00000000008         0         1         4         2
## ENSRNOG00000000010        19        10        19        13
## ENSRNOG00000000012         7         5         1         0
## ENSRNOG00000000017         4         1        12        18

Treatment

We compare a control group with rats treated with L5 SNL.

treatment <- phenoData(hammer)$protocol[1:4]

treatment

## [1] control control L5 SNL  L5 SNL 
## Levels: control L5 SNL

Proportions

We recommend choosing proportions on a logarithmic scale.

proportions <- 10^seq(-2, 0, .25)
proportions

## [1] 0.01000 0.01778 0.03162 0.05623 0.10000 0.17783 0.31623 0.56234 1.00000

Subsampling

Choose one or more RNA-Seq differential expression methods for each step.

subsamples <- subsample(countmatrix, proportions, method=c("edgeR", "DESeq2"),
                       treatment=treatment)

Summarize into a table

subsamples.summ <- summary(subsamples)
head(subsamples.summ)

##      depth proportion method replication significant pearson spearman
## 1:  198567    0.01000  edgeR           1         201  0.3296   0.3892
## 2:  352812    0.01778  edgeR           1         425  0.3920   0.4603
## 3:  627759    0.03162  edgeR           1         827  0.4606   0.5373
## 4: 1117230    0.05623  edgeR           1        1534  0.5503   0.6222
## 5: 1985611    0.10000  edgeR           1        2468  0.6417   0.7127
## 6: 3530798    0.17783  edgeR           1        3502  0.7456   0.8001
##       MSE   estFDP     rFDP percent
## 1: 2.3728 0.001191 0.000000 0.02403
## 2: 2.0724 0.004430 0.007059 0.05046
## 3: 1.7405 0.007126 0.008464 0.09805
## 4: 1.3607 0.005091 0.003911 0.18271
## 5: 1.0577 0.009132 0.009319 0.29236
## 6: 0.7355 0.013354 0.013992 0.41289

Summarize into a table

depth	proportion	significant	pearson	spearman	MSE	estFDP	rFDP	percent
198567	0.0100	201	0.3296	0.3892	2.3728	0.0012	0.0000	0.0240
352812	0.0178	425	0.3920	0.4603	2.0724	0.0044	0.0071	0.0505
627759	0.0316	827	0.4606	0.5373	1.7405	0.0071	0.0085	0.0981
1117230	0.0562	1534	0.5503	0.6222	1.3607	0.0051	0.0039	0.1827
1985611	0.1000	2468	0.6417	0.7127	1.0577	0.0091	0.0093	0.2924
3530798	0.1778	3502	0.7456	0.8001	0.7355	0.0134	0.0140	0.4129
6280315	0.3162	5152	0.8323	0.8728	0.4479	0.0243	0.0219	0.6025
11172016	0.5623	6763	0.9194	0.9419	0.1967	0.0340	0.0271	0.7868
19862441	1.0000	8363	1.0000	1.0000	0.0000	0.0510	0.0000	1.0000

Plot

plot(subsamples.summ)

plot of chunk hammer_plot

subSeq makes it easy to plot data other ways

plot of chunk pvalue_hists

Other features (see vignette)

Perform multiple replications of subsampling at each depth
Use your own custom analysis methods (anything that can be done on a count matrix in R)
- Alternative splicing/differential exon usage
Recreate the exact count matrix used at any depth

Subsampling on many experiments

Expression atlas

Repository of processed gene expression experiments
We used 25 contrasts across 12 experiments
- edgeR and DESeq2
- Performed subsampling at 41 proportions, evenly spaced on a log scale from .01 to 1
- 3 replications at each proportion

Significant genes in each experiment

Some experiments saturate more than others

plot of chunk metric_saturations

What Determines the Shape of the Curve?

Distribution of effect sizes
Distribution of gene expression level
Distribution of gene length
Experiment design (# of replicates)
Amount of variation added at each stage

We can see if saturation has happened…

plot of chunk saturation_curves_2

Can we tell when saturation will happen?

plot of chunk saturation_curves_predicted

Prediction requires a model

plot of chunk prediction_example

Log-Normal CDF Model

\[ G(p) = k\times \Phi(log(p - b), \mu, \sigma)\]

where

\(G(p)\) is the number of significant genes at proportion \(p\)
\(\Phi\) is the CDF of the normal distribution.
\(k,\mu,\sigma,b\) are the four parameters fit for each curve, found through nonlinear least squares

Log-Normal CDF Model

plot of chunk prediction_example_wpred

Log-Normal CDF model: log scale

plot of chunk prediction_example_logx

Predictive power

plot of chunk predictive_power

Predictions work in many experiments

plot of chunk prediction_many

Predictions on a log scale

plot of chunk prediction_many_log

Residuals demonstrate that the fit is reasonable

plot of chunk residuals_plot

Continuing the curve

Effect of doubling/tripling read depth

plot of chunk saturations_hist

Accuracy of such predictions

plot of chunk graph_prediction_full

Saturation is not easily predicted from read depth

plot of chunk saturations_pergene_depth

Are replications necessary?

plot of chunk plot_replication_fits

Takeaway

subSeq can predict the effect of increasing the read depth
- Can predict effect of doubling quite accurately
- Tripling with moderate accuracy
Different experiments need different numbers of reads: The extent of saturation can't easily be predicted from the depth
subSeq replications might be necessary when the number of significant genes is < 500, rarely necessary when > 1000.