The question has come up a few times on the Bioconductor support site how to use methods like DESeq2 or edgeR to detect allele-specific expression, and how to see if the allelic ratio differs across condition. Aaron Lun has written up some examples about how to do this. The trick here and similarly with multi-assay designs (e.g. CLIP-seq) is that individual samples appear twice in the counts matrix. For testing for allele-specific expression (ASE) or condition-specific ASE, the counts for one allele and the other appear as subsequent columns in the counts matrix.
The design formula used for this analysis is given below. We have a term for the differences across condition
(control and treated samples), as well as a term to account for variation among the different samples within control and treated groups. Finally, we use an interaction between the condition
and the count
(reference or alternate allele counts), to estimate the alternate vs reference allele ratios separately for the two conditions.
design <- ~condition + condition:sample + condition:count
We will show in the following code how the alternate vs reference allele ratios can be translated into more standard alternate vs total allele ratios.
One issue which users might not expect is that library size factors are not needed with this setup, as individual terms in the design are used to model the read count for a gene and for a sample. Actually estimating library sizes could potentially impair inference if there were large shifts in the expression ratio across many genes.
Below I set up some artificial read counts just for demonstration of how the design and steps would look for DESeq2 (it would look similar for edgeR or limma-voom, etc.).
Fair warning: I don’t know the degree to which the Beta Binomial models (arguably a better specified model) are better at ASE and condition-specific ASE. A good discussion of statistical testing and preparing data for ASE is Castel (2015). I discuss the difference between the Negative Binomial and the Beta Binomial a bit more at the end of the document.
n <- 1000 # number of genes
m <- 10 # number of samples
mu <- 100 # mean total counts
disp <- .1 # dispersion of total counts
mat <- matrix(rnbinom(n*m, mu=n, size=1/disp), ncol=m)
prob <- matrix(.5, nrow=n, ncol=m)
sim.ase.prob <- .25
sim.ase.genes <- (.8*n+1):(.9*n)
sim.cs.ase.genes <- (n*.9+1):n
prob[sim.ase.genes,] <- sim.ase.prob
prob[sim.cs.ase.genes,(m/2+1):m] <- sim.ase.prob
alt <- matrix(rbinom(n*m, mat, prob), ncol=m)
ref <- mat - alt
A quick plot of the alt / total ratios from our simulated counts:
image(t((alt/mat)[n:1,]), zlim=c(0,1),
col=colorRampPalette(c("red","white","darkgreen"))(101),
main="ratio of alt allele / total count",
xaxt="n", yaxt="n", xlab="samples", ylab="genes")
legend("top",c("0","0.5","1"),fill=c("red","white","darkgreen"))
The matrix bigmat
will store all the allele counts, for reference and alternate in consecutive columns:
bigmat <- matrix(0, nrow=n, ncol=2*m)
bigmat[,2*(0:(m-1))+1] <- ref
bigmat[,2*(0:(m-1))+2] <- alt
head(bigmat) # the top genes
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 452 476 405 426 398 426 491 488 456 442 707 648 663
## [2,] 342 327 536 587 597 605 337 362 438 423 396 386 362
## [3,] 456 452 436 394 243 238 567 551 502 493 468 453 721
## [4,] 822 846 782 821 510 485 369 336 691 679 509 499 481
## [5,] 436 404 466 448 510 506 269 310 432 411 320 355 678
## [6,] 327 322 433 369 403 390 283 257 557 583 326 324 296
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 752 769 725 351 371 439 410
## [2,] 352 291 281 942 1004 351 316
## [3,] 695 687 687 371 352 406 375
## [4,] 490 557 552 330 368 594 565
## [5,] 660 541 511 575 583 603 583
## [6,] 314 339 367 662 665 440 441
head(bigmat[sim.ase.genes,]) # the genes with ASE
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 354 135 646 218 952 346 1254 388 564 186 445 158 392
## [2,] 496 175 650 230 931 285 711 239 339 111 496 173 579
## [3,] 1167 339 818 221 349 143 634 231 461 174 745 227 922
## [4,] 595 194 775 260 764 266 741 225 790 238 1025 346 489
## [5,] 734 252 1038 339 1171 385 532 171 1181 376 557 189 559
## [6,] 1186 382 563 185 684 223 1057 353 636 175 385 164 1045
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 166 836 285 478 155 944 324
## [2,] 191 872 281 785 204 813 298
## [3,] 336 427 113 1142 382 702 242
## [4,] 151 781 257 576 201 435 157
## [5,] 173 678 218 598 223 768 241
## [6,] 339 609 204 484 161 846 303
head(bigmat[sim.cs.ase.genes,]) # the genes with condition-specific ASE
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## [1,] 426 446 489 494 489 502 556 547 452 439 655 206 433
## [2,] 392 428 246 229 652 655 369 351 537 506 724 241 1081
## [3,] 460 472 505 505 574 612 441 471 366 368 783 289 685
## [4,] 680 696 754 744 247 258 393 429 427 421 571 211 860
## [5,] 525 520 321 339 661 657 536 526 547 589 751 281 1153
## [6,] 458 474 500 474 275 295 475 494 388 370 716 241 934
## [,14] [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] 128 914 281 1171 417 691 221
## [2,] 360 251 73 519 155 747 242
## [3,] 232 649 216 940 272 811 268
## [4,] 254 920 323 1139 398 662 190
## [5,] 347 882 285 888 284 632 207
## [6,] 354 515 188 773 224 461 138
mode(bigmat) <- "integer"
The sample table looks like:
samps <- data.frame(sample=factor(rep(rep(1:(m/2),each=2),2)),
count=factor(rep(c("ref","alt"),m),levels=c("ref","alt")),
condition=factor(rep(c("control","treated"),each=m)))
samps
## sample count condition
## 1 1 ref control
## 2 1 alt control
## 3 2 ref control
## 4 2 alt control
## 5 3 ref control
## 6 3 alt control
## 7 4 ref control
## 8 4 alt control
## 9 5 ref control
## 10 5 alt control
## 11 1 ref treated
## 12 1 alt treated
## 13 2 ref treated
## 14 2 alt treated
## 15 3 ref treated
## 16 3 alt treated
## 17 4 ref treated
## 18 4 alt treated
## 19 5 ref treated
## 20 5 alt treated
Now we can build the DESeqDataSet:
library("DESeq2")
dds <- DESeqDataSetFromMatrix(bigmat, samps, design)
Because we have sample
in the design, and so we are looking at ratios within sample, we do not want to use DESeq2’s size factor normalization, and hence set the size factors all to 1.
Here fitType="mean"
is needed because of artificial data simulation. "parametric"
or "local"
may be more appropriate for real data.
sizeFactors(dds) <- rep(1, 2*m)
dds <- DESeq(dds, fitType="mean")
## using pre-existing size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
resultsNames(dds)
## [1] "Intercept" "condition_treated_vs_control"
## [3] "conditioncontrol.sample2" "conditiontreated.sample2"
## [5] "conditioncontrol.sample3" "conditiontreated.sample3"
## [7] "conditioncontrol.sample4" "conditiontreated.sample4"
## [9] "conditioncontrol.sample5" "conditiontreated.sample5"
## [11] "conditioncontrol.countalt" "conditiontreated.countalt"
The term conditioncontrol.countalt
gives the alt / ref ratio in control:
res.control <- results(dds, name="conditioncontrol.countalt")
cont.alt.vs.ref <- 2^res.control$log2FoldChange
plot(cont.alt.vs.ref)
We can translate this into the alt / total ratio:
cont.alt.vs.total <- cont.alt.vs.ref / (1 + cont.alt.vs.ref)
plot(cont.alt.vs.total)
The p-value in this results table tests if the alt / ref ratio is not equal to 1, so if the alt / total ratio is not equal to 0.5:
plot(cont.alt.vs.total,
col=ifelse(res.control$padj < .1, "red", "black"))
Analagously, the conditiontreated.countalt
term gives the in the alt vs ref ratio in treated samples:
res.trt <- results(dds, name="conditiontreated.countalt")
Again we can translate a difference in the alt / ref ratio to the alt / total ratio in treated samples:
trt.alt.vs.ref <- 2^res.trt$log2FoldChange
trt.alt.vs.total <- trt.alt.vs.ref / (1 + trt.alt.vs.ref)
plot(trt.alt.vs.total)
With p-values testing for the alt / total ratio not equal to 0.5:
plot(trt.alt.vs.total,
col=ifelse(res.trt$padj < .1, "red", "black"))
Finally we can test the difference between the two alt / ref ratios:
res.diff <- results(dds, contrast=list("conditiontreated.countalt",
"conditioncontrol.countalt"))
We can translate this into the difference between the alt / total ratios. The numerator and denominator of the contrast both need to be divided by the factors used in the previous code chunks.
diff.alt.vs.ref <- 2^res.diff$log2FoldChange
diff.alt.vs.total <- diff.alt.vs.ref * (1/(1 + trt.alt.vs.ref)) / (1/(1 + cont.alt.vs.ref))
And we can plot this with p-values from our test of the contrast:
plot(diff.alt.vs.total,
col=ifelse(res.diff$padj < .1, "red", "black"))
As discussed in Castel (2015), it is common that the alternate counts modeled as a random draw from the total count will be overdispersed for a Binomial random variable. This means that, while you expect many sites to have a ratio around 0.5, instead of seeing histograms that look like a binomial with a probability of 0.5…
hist(rbinom(1000,size=100,prob=0.5),breaks=0:100,
freq=FALSE,col="grey",border="white", xlim=c(20,80),ylim=c(0,.1))
lines(0:100,dbinom(0:100,size=100,prob=0.5),col="magenta",lwd=3)
You see a wider histogram than expected, which can happen if the underlying ratio driving the counts is varying around 0.5, either for biological or technical reasons:
prob <- rbeta(1000,shape1=20,shape2=20)
hist(prob,freq=FALSE,col="grey",border="white")
hist(rbinom(1000,size=100,prob=prob),breaks=0:100,
freq=FALSE,col="grey",border="white", xlim=c(20,80),ylim=c(0,.1))
lines(0:100,dbinom(0:100,size=100,prob=0.5),col="magenta",lwd=3)
It is therefore useful to model the overdispersed counts observed for a given allele using a model that incorporates a varying probability parameter of the Binomial. The Beta Binomial distribution gives this flexibility, and requires that we estimate the amount of overdispersion for each site.
This is analagous to the Negative Binomial, also called the Gamma Poisson distribution. This distribution can be generated by constructing a Poisson random variable with a mean parameter that itself is not fixed but distributed as a Gamma random variable. This provides a more flexible distribution for counts, as it allows for changes in location and scale separately, and requires that we estimate the amount of overdispersion at each gene (in the typical differential gene expression setting).
Modeling the alternate and reference allele counts separately as Negative Binomial counts could possibly be sup-optimal for statistical testing compared to a Beta Binomial. I could imagine that when the ratios are close to 0 or 1, considering the total count as fixed and modeling the count of the alternate may provide some benefit. But this is just speculation.
sessionInfo()
## R version 3.4.0 (2017-04-21)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
##
## Matrix products: default
## BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices datasets utils
## [8] methods base
##
## other attached packages:
## [1] DESeq2_1.16.0 SummarizedExperiment_1.6.1
## [3] DelayedArray_0.2.0 matrixStats_0.52.2
## [5] Biobase_2.36.1 GenomicRanges_1.28.1
## [7] GenomeInfoDb_1.12.0 IRanges_2.10.0
## [9] S4Vectors_0.14.0 BiocGenerics_0.22.0
## [11] rmarkdown_1.5 BiocInstaller_1.26.0
## [13] testthat_1.0.2 devtools_1.12.0
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.1 Rcpp_0.12.10
## [3] lattice_0.20-35 rprojroot_1.2
## [5] digest_0.6.12 R6_2.2.0
## [7] plyr_1.8.4 backports_1.0.5
## [9] acepack_1.4.1 RSQLite_1.1-2
## [11] evaluate_0.10 ggplot2_2.2.1
## [13] zlibbioc_1.22.0 lazyeval_0.2.0
## [15] data.table_1.10.4 annotate_1.54.0
## [17] rpart_4.1-11 Matrix_1.2-10
## [19] checkmate_1.8.2 splines_3.4.0
## [21] BiocParallel_1.10.1 geneplotter_1.54.0
## [23] stringr_1.2.0 foreign_0.8-68
## [25] htmlwidgets_0.8 RCurl_1.95-4.8
## [27] munsell_0.4.3 compiler_3.4.0
## [29] base64enc_0.1-3 htmltools_0.3.6
## [31] nnet_7.3-12 tibble_1.3.0
## [33] gridExtra_2.2.1 htmlTable_1.9
## [35] GenomeInfoDbData_0.99.0 Hmisc_4.0-3
## [37] XML_3.98-1.6 crayon_1.3.2
## [39] withr_1.0.2 bitops_1.0-6
## [41] grid_3.4.0 xtable_1.8-2
## [43] gtable_0.2.0 DBI_0.6-1
## [45] magrittr_1.5 scales_0.4.1
## [47] stringi_1.1.5 XVector_0.16.0
## [49] genefilter_1.58.0 latticeExtra_0.6-28
## [51] Formula_1.2-1 RColorBrewer_1.1-2
## [53] tools_3.4.0 rsconnect_0.8
## [55] survival_2.41-3 yaml_2.1.14
## [57] AnnotationDbi_1.38.0 colorspace_1.3-2
## [59] cluster_2.0.6 memoise_1.1.0
## [61] knitr_1.15.1