The question has come up a few times on the Bioconductor support site how to use methods like DESeq2 or edgeR to detect allele-specific expression, and how to see if the allelic ratio differs across condition. Aaron Lun has written up some examples about how to do this. The trick here and similarly with multi-assay designs (e.g. CLIP-seq) is that individual samples appear twice in the counts matrix. For testing for allele-specific expression (ASE) or condition-specific ASE, the counts for one allele and the other appear as subsequent columns in the counts matrix.

The design formula used for this analysis is given below. We have a term for the differences across condition (control and treated samples), as well as a term to account for variation among the different samples within control and treated groups. Finally, we use an interaction between the condition and the count (reference or alternate allele counts), to estimate the alternate vs reference allele ratios separately for the two conditions.

design <- ~condition + condition:sample + condition:count

We will show in the following code how the alternate vs reference allele ratios can be translated into more standard alternate vs total allele ratios.

One issue which users might not expect is that library size factors are not needed with this setup, as individual terms in the design are used to model the read count for a gene and for a sample. Actually estimating library sizes could potentially impair inference if there were large shifts in the expression ratio across many genes.

Below I set up some artificial read counts just for demonstration of how the design and steps would look for DESeq2 (it would look similar for edgeR or limma-voom, etc.).

Fair warning: I don’t know the degree to which the Beta Binomial models (arguably a better specified model) are better at ASE and condition-specific ASE. A good discussion of statistical testing and preparing data for ASE is Castel (2015). I discuss the difference between the Negative Binomial and the Beta Binomial a bit more at the end of the document.

n <- 1000 # number of genes
m <- 10 # number of samples
mu <- 100 # mean total counts
disp <- .1 # dispersion of total counts
mat <- matrix(rnbinom(n*m, mu=n, size=1/disp), ncol=m)
prob <- matrix(.5, nrow=n, ncol=m)
sim.ase.prob <- .25
sim.ase.genes <- (.8*n+1):(.9*n)
sim.cs.ase.genes <- (n*.9+1):n
prob[sim.ase.genes,] <- sim.ase.prob
prob[sim.cs.ase.genes,(m/2+1):m] <- sim.ase.prob
alt <- matrix(rbinom(n*m, mat, prob), ncol=m)
ref <- mat - alt

A quick plot of the alt / total ratios from our simulated counts:

image(t((alt/mat)[n:1,]), zlim=c(0,1), 
      col=colorRampPalette(c("red","white","darkgreen"))(101),
      main="ratio of alt allele / total count",
      xaxt="n", yaxt="n", xlab="samples", ylab="genes")
legend("top",c("0","0.5","1"),fill=c("red","white","darkgreen"))

The matrix bigmat will store all the allele counts, for reference and alternate in consecutive columns:

bigmat <- matrix(0, nrow=n, ncol=2*m)
bigmat[,2*(0:(m-1))+1] <- ref
bigmat[,2*(0:(m-1))+2] <- alt
head(bigmat) # the top genes
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  452  476  405  426  398  426  491  488  456   442   707   648   663
## [2,]  342  327  536  587  597  605  337  362  438   423   396   386   362
## [3,]  456  452  436  394  243  238  567  551  502   493   468   453   721
## [4,]  822  846  782  821  510  485  369  336  691   679   509   499   481
## [5,]  436  404  466  448  510  506  269  310  432   411   320   355   678
## [6,]  327  322  433  369  403  390  283  257  557   583   326   324   296
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   752   769   725   351   371   439   410
## [2,]   352   291   281   942  1004   351   316
## [3,]   695   687   687   371   352   406   375
## [4,]   490   557   552   330   368   594   565
## [5,]   660   541   511   575   583   603   583
## [6,]   314   339   367   662   665   440   441
head(bigmat[sim.ase.genes,]) # the genes with ASE
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  354  135  646  218  952  346 1254  388  564   186   445   158   392
## [2,]  496  175  650  230  931  285  711  239  339   111   496   173   579
## [3,] 1167  339  818  221  349  143  634  231  461   174   745   227   922
## [4,]  595  194  775  260  764  266  741  225  790   238  1025   346   489
## [5,]  734  252 1038  339 1171  385  532  171 1181   376   557   189   559
## [6,] 1186  382  563  185  684  223 1057  353  636   175   385   164  1045
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   166   836   285   478   155   944   324
## [2,]   191   872   281   785   204   813   298
## [3,]   336   427   113  1142   382   702   242
## [4,]   151   781   257   576   201   435   157
## [5,]   173   678   218   598   223   768   241
## [6,]   339   609   204   484   161   846   303
head(bigmat[sim.cs.ase.genes,]) # the genes with condition-specific ASE
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  426  446  489  494  489  502  556  547  452   439   655   206   433
## [2,]  392  428  246  229  652  655  369  351  537   506   724   241  1081
## [3,]  460  472  505  505  574  612  441  471  366   368   783   289   685
## [4,]  680  696  754  744  247  258  393  429  427   421   571   211   860
## [5,]  525  520  321  339  661  657  536  526  547   589   751   281  1153
## [6,]  458  474  500  474  275  295  475  494  388   370   716   241   934
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   128   914   281  1171   417   691   221
## [2,]   360   251    73   519   155   747   242
## [3,]   232   649   216   940   272   811   268
## [4,]   254   920   323  1139   398   662   190
## [5,]   347   882   285   888   284   632   207
## [6,]   354   515   188   773   224   461   138
mode(bigmat) <- "integer"

The sample table looks like:

samps <- data.frame(sample=factor(rep(rep(1:(m/2),each=2),2)),
                    count=factor(rep(c("ref","alt"),m),levels=c("ref","alt")),
                    condition=factor(rep(c("control","treated"),each=m)))
samps
##    sample count condition
## 1       1   ref   control
## 2       1   alt   control
## 3       2   ref   control
## 4       2   alt   control
## 5       3   ref   control
## 6       3   alt   control
## 7       4   ref   control
## 8       4   alt   control
## 9       5   ref   control
## 10      5   alt   control
## 11      1   ref   treated
## 12      1   alt   treated
## 13      2   ref   treated
## 14      2   alt   treated
## 15      3   ref   treated
## 16      3   alt   treated
## 17      4   ref   treated
## 18      4   alt   treated
## 19      5   ref   treated
## 20      5   alt   treated

Now we can build the DESeqDataSet:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(bigmat, samps, design)

Because we have sample in the design, and so we are looking at ratios within sample, we do not want to use DESeq2’s size factor normalization, and hence set the size factors all to 1.

Here fitType="mean" is needed because of artificial data simulation. "parametric" or "local" may be more appropriate for real data.

sizeFactors(dds) <- rep(1, 2*m)
dds <- DESeq(dds, fitType="mean")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
resultsNames(dds)
##  [1] "Intercept"                    "condition_treated_vs_control"
##  [3] "conditioncontrol.sample2"     "conditiontreated.sample2"    
##  [5] "conditioncontrol.sample3"     "conditiontreated.sample3"    
##  [7] "conditioncontrol.sample4"     "conditiontreated.sample4"    
##  [9] "conditioncontrol.sample5"     "conditiontreated.sample5"    
## [11] "conditioncontrol.countalt"    "conditiontreated.countalt"

The term conditioncontrol.countalt gives the alt / ref ratio in control:

res.control <- results(dds, name="conditioncontrol.countalt")
cont.alt.vs.ref <- 2^res.control$log2FoldChange
plot(cont.alt.vs.ref)

We can translate this into the alt / total ratio:

cont.alt.vs.total <- cont.alt.vs.ref / (1 + cont.alt.vs.ref)
plot(cont.alt.vs.total)