The question has come up a few times on the Bioconductor support site how to use methods like DESeq2 or edgeR to detect allele-specific expression, and how to see if the allelic ratio differs across condition. Aaron Lun has written up some examples about how to do this. The trick here and similarly with multi-assay designs (e.g. CLIP-seq) is that individual samples appear twice in the counts matrix. For testing for allele-specific expression (ASE) or condition-specific ASE, the counts for one allele and the other appear as subsequent columns in the counts matrix.

The design formula used for this analysis is given below. We have a term for the differences across condition (control and treated samples), as well as a term to account for variation among the different samples within control and treated groups. Finally, we use an interaction between the condition and the count (reference or alternate allele counts), to estimate the alternate vs reference allele ratios separately for the two conditions.

design <- ~condition + condition:sample + condition:count

We will show in the following code how the alternate vs reference allele ratios can be translated into more standard alternate vs total allele ratios.

One issue which users might not expect is that library size factors are not needed with this setup, as individual terms in the design are used to model the read count for a gene and for a sample. Actually estimating library sizes could potentially impair inference if there were large shifts in the expression ratio across many genes.

Below I set up some artificial read counts just for demonstration of how the design and steps would look for DESeq2 (it would look similar for edgeR or limma-voom, etc.).

Fair warning: I don’t know the degree to which the Beta Binomial models (arguably a better specified model) are better at ASE and condition-specific ASE. A good discussion of statistical testing and preparing data for ASE is Castel (2015). I discuss the difference between the Negative Binomial and the Beta Binomial a bit more at the end of the document.

n <- 1000 # number of genes
m <- 10 # number of samples
mu <- 100 # mean total counts
disp <- .1 # dispersion of total counts
mat <- matrix(rnbinom(n*m, mu=n, size=1/disp), ncol=m)
prob <- matrix(.5, nrow=n, ncol=m)
sim.ase.prob <- .25
sim.ase.genes <- (.8*n+1):(.9*n)
sim.cs.ase.genes <- (n*.9+1):n
prob[sim.ase.genes,] <- sim.ase.prob
prob[sim.cs.ase.genes,(m/2+1):m] <- sim.ase.prob
alt <- matrix(rbinom(n*m, mat, prob), ncol=m)
ref <- mat - alt

A quick plot of the alt / total ratios from our simulated counts:

image(t((alt/mat)[n:1,]), zlim=c(0,1), 
      col=colorRampPalette(c("red","white","darkgreen"))(101),
      main="ratio of alt allele / total count",
      xaxt="n", yaxt="n", xlab="samples", ylab="genes")
legend("top",c("0","0.5","1"),fill=c("red","white","darkgreen"))

The matrix bigmat will store all the allele counts, for reference and alternate in consecutive columns:

bigmat <- matrix(0, nrow=n, ncol=2*m)
bigmat[,2*(0:(m-1))+1] <- ref
bigmat[,2*(0:(m-1))+2] <- alt
head(bigmat) # the top genes

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  452  476  405  426  398  426  491  488  456   442   707   648   663
## [2,]  342  327  536  587  597  605  337  362  438   423   396   386   362
## [3,]  456  452  436  394  243  238  567  551  502   493   468   453   721
## [4,]  822  846  782  821  510  485  369  336  691   679   509   499   481
## [5,]  436  404  466  448  510  506  269  310  432   411   320   355   678
## [6,]  327  322  433  369  403  390  283  257  557   583   326   324   296
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   752   769   725   351   371   439   410
## [2,]   352   291   281   942  1004   351   316
## [3,]   695   687   687   371   352   406   375
## [4,]   490   557   552   330   368   594   565
## [5,]   660   541   511   575   583   603   583
## [6,]   314   339   367   662   665   440   441

head(bigmat[sim.ase.genes,]) # the genes with ASE

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  354  135  646  218  952  346 1254  388  564   186   445   158   392
## [2,]  496  175  650  230  931  285  711  239  339   111   496   173   579
## [3,] 1167  339  818  221  349  143  634  231  461   174   745   227   922
## [4,]  595  194  775  260  764  266  741  225  790   238  1025   346   489
## [5,]  734  252 1038  339 1171  385  532  171 1181   376   557   189   559
## [6,] 1186  382  563  185  684  223 1057  353  636   175   385   164  1045
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   166   836   285   478   155   944   324
## [2,]   191   872   281   785   204   813   298
## [3,]   336   427   113  1142   382   702   242
## [4,]   151   781   257   576   201   435   157
## [5,]   173   678   218   598   223   768   241
## [6,]   339   609   204   484   161   846   303

head(bigmat[sim.cs.ase.genes,]) # the genes with condition-specific ASE

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,]  426  446  489  494  489  502  556  547  452   439   655   206   433
## [2,]  392  428  246  229  652  655  369  351  537   506   724   241  1081
## [3,]  460  472  505  505  574  612  441  471  366   368   783   289   685
## [4,]  680  696  754  744  247  258  393  429  427   421   571   211   860
## [5,]  525  520  321  339  661  657  536  526  547   589   751   281  1153
## [6,]  458  474  500  474  275  295  475  494  388   370   716   241   934
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,]   128   914   281  1171   417   691   221
## [2,]   360   251    73   519   155   747   242
## [3,]   232   649   216   940   272   811   268
## [4,]   254   920   323  1139   398   662   190
## [5,]   347   882   285   888   284   632   207
## [6,]   354   515   188   773   224   461   138

mode(bigmat) <- "integer"

The sample table looks like:

samps <- data.frame(sample=factor(rep(rep(1:(m/2),each=2),2)),
                    count=factor(rep(c("ref","alt"),m),levels=c("ref","alt")),
                    condition=factor(rep(c("control","treated"),each=m)))
samps

##    sample count condition
## 1       1   ref   control
## 2       1   alt   control
## 3       2   ref   control
## 4       2   alt   control
## 5       3   ref   control
## 6       3   alt   control
## 7       4   ref   control
## 8       4   alt   control
## 9       5   ref   control
## 10      5   alt   control
## 11      1   ref   treated
## 12      1   alt   treated
## 13      2   ref   treated
## 14      2   alt   treated
## 15      3   ref   treated
## 16      3   alt   treated
## 17      4   ref   treated
## 18      4   alt   treated
## 19      5   ref   treated
## 20      5   alt   treated

Now we can build the DESeqDataSet:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(bigmat, samps, design)

Because we have sample in the design, and so we are looking at ratios within sample, we do not want to use DESeq2’s size factor normalization, and hence set the size factors all to 1.

Here fitType="mean" is needed because of artificial data simulation. "parametric" or "local" may be more appropriate for real data.

sizeFactors(dds) <- rep(1, 2*m)
dds <- DESeq(dds, fitType="mean")

## using pre-existing size factors

## estimating dispersions

## gene-wise dispersion estimates

## mean-dispersion relationship

## final dispersion estimates

## fitting model and testing

resultsNames(dds)

##  [1] "Intercept"                    "condition_treated_vs_control"
##  [3] "conditioncontrol.sample2"     "conditiontreated.sample2"    
##  [5] "conditioncontrol.sample3"     "conditiontreated.sample3"    
##  [7] "conditioncontrol.sample4"     "conditiontreated.sample4"    
##  [9] "conditioncontrol.sample5"     "conditiontreated.sample5"    
## [11] "conditioncontrol.countalt"    "conditiontreated.countalt"

The term conditioncontrol.countalt gives the alt / ref ratio in control:

res.control <- results(dds, name="conditioncontrol.countalt")
cont.alt.vs.ref <- 2^res.control$log2FoldChange
plot(cont.alt.vs.ref)

We can translate this into the alt / total ratio:

cont.alt.vs.total <- cont.alt.vs.ref / (1 + cont.alt.vs.ref)
plot(cont.alt.vs.total)

The p-value in this results table tests if the alt / ref ratio is not equal to 1, so if the alt / total ratio is not equal to 0.5:

plot(cont.alt.vs.total, 
     col=ifelse(res.control$padj < .1, "red", "black"))

Analagously, the conditiontreated.countalt term gives the in the alt vs ref ratio in treated samples:

res.trt <- results(dds, name="conditiontreated.countalt")

Again we can translate a difference in the alt / ref ratio to the alt / total ratio in treated samples:

trt.alt.vs.ref <- 2^res.trt$log2FoldChange
trt.alt.vs.total <- trt.alt.vs.ref / (1 + trt.alt.vs.ref)
plot(trt.alt.vs.total)

With p-values testing for the alt / total ratio not equal to 0.5:

plot(trt.alt.vs.total, 
     col=ifelse(res.trt$padj < .1, "red", "black"))

Finally we can test the difference between the two alt / ref ratios:

res.diff <- results(dds, contrast=list("conditiontreated.countalt",
                                       "conditioncontrol.countalt"))

We can translate this into the difference between the alt / total ratios. The numerator and denominator of the contrast both need to be divided by the factors used in the previous code chunks.

diff.alt.vs.ref <- 2^res.diff$log2FoldChange
diff.alt.vs.total <- diff.alt.vs.ref * (1/(1 + trt.alt.vs.ref)) / (1/(1 + cont.alt.vs.ref))

And we can plot this with p-values from our test of the contrast:

plot(diff.alt.vs.total, 
     col=ifelse(res.diff$padj < .1, "red", "black"))

On Beta Binomial and Negative Binomial

As discussed in Castel (2015), it is common that the alternate counts modeled as a random draw from the total count will be overdispersed for a Binomial random variable. This means that, while you expect many sites to have a ratio around 0.5, instead of seeing histograms that look like a binomial with a probability of 0.5…

hist(rbinom(1000,size=100,prob=0.5),breaks=0:100,
     freq=FALSE,col="grey",border="white", xlim=c(20,80),ylim=c(0,.1))
lines(0:100,dbinom(0:100,size=100,prob=0.5),col="magenta",lwd=3)

You see a wider histogram than expected, which can happen if the underlying ratio driving the counts is varying around 0.5, either for biological or technical reasons:

prob <- rbeta(1000,shape1=20,shape2=20)
hist(prob,freq=FALSE,col="grey",border="white")

hist(rbinom(1000,size=100,prob=prob),breaks=0:100,
     freq=FALSE,col="grey",border="white", xlim=c(20,80),ylim=c(0,.1))
lines(0:100,dbinom(0:100,size=100,prob=0.5),col="magenta",lwd=3)

It is therefore useful to model the overdispersed counts observed for a given allele using a model that incorporates a varying probability parameter of the Binomial. The Beta Binomial distribution gives this flexibility, and requires that we estimate the amount of overdispersion for each site.

This is analagous to the Negative Binomial, also called the Gamma Poisson distribution. This distribution can be generated by constructing a Poisson random variable with a mean parameter that itself is not fixed but distributed as a Gamma random variable. This provides a more flexible distribution for counts, as it allows for changes in location and scale separately, and requires that we estimate the amount of overdispersion at each gene (in the typical differential gene expression setting).

Modeling the alternate and reference allele counts separately as Negative Binomial counts could possibly be sup-optimal for statistical testing compared to a Beta Binomial. I could imagine that when the ratios are close to 0 or 1, considering the total count as fixed and modeling the count of the alternate may provide some benefit. But this is just speculation.

sessionInfo()

## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices datasets  utils    
## [8] methods   base     
## 
## other attached packages:
##  [1] DESeq2_1.16.0              SummarizedExperiment_1.6.1
##  [3] DelayedArray_0.2.0         matrixStats_0.52.2        
##  [5] Biobase_2.36.1             GenomicRanges_1.28.1      
##  [7] GenomeInfoDb_1.12.0        IRanges_2.10.0            
##  [9] S4Vectors_0.14.0           BiocGenerics_0.22.0       
## [11] rmarkdown_1.5              BiocInstaller_1.26.0      
## [13] testthat_1.0.2             devtools_1.12.0           
## 
## loaded via a namespace (and not attached):
##  [1] locfit_1.5-9.1          Rcpp_0.12.10           
##  [3] lattice_0.20-35         rprojroot_1.2          
##  [5] digest_0.6.12           R6_2.2.0               
##  [7] plyr_1.8.4              backports_1.0.5        
##  [9] acepack_1.4.1           RSQLite_1.1-2          
## [11] evaluate_0.10           ggplot2_2.2.1          
## [13] zlibbioc_1.22.0         lazyeval_0.2.0         
## [15] data.table_1.10.4       annotate_1.54.0        
## [17] rpart_4.1-11            Matrix_1.2-10          
## [19] checkmate_1.8.2         splines_3.4.0          
## [21] BiocParallel_1.10.1     geneplotter_1.54.0     
## [23] stringr_1.2.0           foreign_0.8-68         
## [25] htmlwidgets_0.8         RCurl_1.95-4.8         
## [27] munsell_0.4.3           compiler_3.4.0         
## [29] base64enc_0.1-3         htmltools_0.3.6        
## [31] nnet_7.3-12             tibble_1.3.0           
## [33] gridExtra_2.2.1         htmlTable_1.9          
## [35] GenomeInfoDbData_0.99.0 Hmisc_4.0-3            
## [37] XML_3.98-1.6            crayon_1.3.2           
## [39] withr_1.0.2             bitops_1.0-6           
## [41] grid_3.4.0              xtable_1.8-2           
## [43] gtable_0.2.0            DBI_0.6-1              
## [45] magrittr_1.5            scales_0.4.1           
## [47] stringi_1.1.5           XVector_0.16.0         
## [49] genefilter_1.58.0       latticeExtra_0.6-28    
## [51] Formula_1.2-1           RColorBrewer_1.1-2     
## [53] tools_3.4.0             rsconnect_0.8          
## [55] survival_2.41-3         yaml_2.1.14            
## [57] AnnotationDbi_1.38.0    colorspace_1.3-2       
## [59] cluster_2.0.6           memoise_1.1.0          
## [61] knitr_1.15.1

Using RNA-seq DE methods to detect allele-specific expression

Michael Love

12 May, 2017

On Beta Binomial and Negative Binomial