Microbiome is a set of the small life around us which can not see by eyes directly. It is important for our human health, enviroment protection or industrial and agricultural production and so on. With the rapid development of next generation sequencing technology, we have shed much light on microbiome for our phenotypes or disease, such as obesity, type II diabetes, inflammation. Meanwhile, many different data source have been collected by different studies including HMP, MetaHIT.
Although there are much findings in previous studies, efficient statistical methods are inadequate developed to extract meaningful information for microbiome. Microbiome data has several characteristics: non-negative integer numbers or compostional data, sparisity, high dimension, over-dispersion, phylogenetic relationships.There are two types of methods for microbiome data : based on count data directly and compostional data or other transform.
| Distribution | parameters | correlation | reference |
|---|---|---|---|
| Multinomial (MN) | p | negative | |
| Dirichlet Multinomial (DM) | p+1 | negative | |
| Negative Multinomial (NegMN) | p | positive | |
| Generalized Dirichlet Multinomial (GDM) | 2p+1 | both | |
| Dirichlet Tree Multinomial (DTM) | p | both | Wang and Zhao 2017 |
We incroporate phylogenetic relationship of this OTU to do log ratio transfrom adptively to build efficient test methods, called Log Ratio Tree Test (LRTT). It keeps the advantage of ANCOM while reduce the computational complexity from \(O(p^2)\) down to \(O(p)\) (p is the OTU numbers). At the same way, we extend DTM to ZIGDTM and use the posterior mean to smooth zero.
For DM distribution, it can prove that the prosterior distribution of the prior is also one dirichlet distribution which is conjugate. And the posterior mean:
\[ E_q(P_i) =\frac{a_i+y_i}{\sum_i{(a_i + y_i)}} \]
So we would like to smooth the count data by adding the correspond parameters. It is obviously that when using Jeffrey’s uninformative prior, it will degerate to adding 0.5.
For ZIDM distribution, the posterior mean is much more complicated and here we show binary case as follow:
\[ E_q(P^l) =\frac{\pi^lI(y^l=0) + (1-\pi^l)DFW}{\pi^lI(y^l=0) + (1-\pi^l)DF}, \quad W=\frac{y^l+a^l}{y^l+y^r+a^l+a^r}\\ D=\frac{\Gamma(y^l+y^r+1)}{\Gamma(y^l+1)\Gamma(y^r+1)B(a^l, a^r)},\quad F=B(y^r+a^r, y^l+a^l) \]
Consider whether y=0 or not, we can get results as follow and using the numerator to smooth the count data.
\[ y^l \neq 0,\quad E_q(P^l)=W \\ y^l=0, y^r=N, \quad DF=\frac{B(N+a^r,a^l+a^r)}{B(N+a^r+a^l, a^r)}<1, \quad W=\frac{a^l}{N+a^l+a^r}\\ E_q(P^l)= W+ \frac{1-W}{1+(\frac{1}{\pi^l}-1)DF} =\frac{a^l+\frac{N+a^r}{1+(1/\pi^l-1)DF}}{N+a^l+a^r} \]
It will degerate to DM smooth when using ZIDM estimate it. For DM, \(\pi^l=0\), term \(\frac{N+a^r}{1+(1/\pi^l-1)DF}\) will be 0. So we can use likelihood Ratio test to check which one to use. If one sample in all zero and it will be removed.
To explore our methods effectiveness, we simulated different data for case-control studies. We compare power and fdr of LRTT, ANCOM, t.test (log relative data), wilcox test, ZIG (CSS).
Give one tree (binary), we can simulate MN, DM, ZIGDM on each subtree with given depth. To set differential OTU, we simulate two groups with different parameters. Once the internal nodes is differential, all his offspring are differential one. We also explore degeneration situation for one small tree though it is small probability events.
parameter setting and tree structure
## List of 4
## $ parameters :'data.frame': 38 obs. of 8 variables:
## ..$ edge.1: int [1:38] 21 22 23 24 25 25 24 26 26 23 ...
## ..$ edge.2: int [1:38] 22 23 24 25 1 2 26 3 4 27 ...
## ..$ a1 : num [1:38] 0.322 0.343 0.457 0.695 0.538 ...
## ..$ a2 : num [1:38] 0.322 0.343 0.457 0.695 0.538 ...
## ..$ pi : num [1:38] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ theta : num [1:38] 0 0 0 0 0 0 0 0 0 0 ...
## ..$ Cp1 : num [1:38] 0.3218 0.1105 0.0505 0.0351 0.0189 ...
## ..$ Cp2 : num [1:38] 0.3218 0.1105 0.0505 0.0351 0.0189 ...
## $ diff.otu : num [1:8] 6 7 8 9 10 11 19 20
## $ diff.taxa : num [1:3] 30 34 39
## $ degeneration: num(0)
| edge.1 | edge.2 | a1 | a2 | pi | theta | Cp1 | Cp2 | |
|---|---|---|---|---|---|---|---|---|
| 15 | 30 | 6 | 0.5916392 | 0.4083608 | 0 | 0 | 0.0021903 | 0.0015118 |
| 18 | 32 | 7 | 0.6458318 | 0.6458318 | 0 | 0 | 0.0004095 | 0.0005933 |
| 19 | 32 | 8 | 0.3541682 | 0.3541682 | 0 | 0 | 0.0002246 | 0.0003253 |
| 21 | 33 | 9 | 0.8545802 | 0.8545802 | 0 | 0 | 0.0007501 | 0.0010868 |
| 23 | 34 | 10 | 0.6690358 | 0.3309642 | 0 | 0 | 0.0000854 | 0.0000612 |
| 24 | 34 | 11 | 0.3309642 | 0.6690358 | 0 | 0 | 0.0000422 | 0.0001237 |
| 37 | 39 | 19 | 0.8679595 | 0.1320405 | 0 | 0 | 0.0670754 | 0.0102040 |
| 38 | 39 | 20 | 0.1320405 | 0.8679595 | 0 | 0 | 0.0102040 | 0.0670754 |
DTM using NR and ZIGDTM using EM eatimte on each nodes respectively. Do LR test between MN, DM, ZIGDM for MN(0.5) vs DM (a) and DM vs ZIGDM (f(a)).
## [1] 8.897692 7.916799 3.510668 5.762473 9.666163
## a
## 6->1 8.897692 9.380067 9.407711 -8.299961 7.611532
## 6->2 7.916799 8.608280 8.633248 -11.039870 7.427533
## 6->3 3.510668 3.695227 3.716450 -2.256591 2.842322
## 6->4 5.762473 5.930395 5.945079 -10.963638 5.506855
## 6->5 9.666163 9.994082 10.019246 -27.295086 8.784431
| Methods | Transformation | zero |
|---|---|---|
| t.test | log relative | +1 |
| wilcox | count | 0 |
| ZIG | CSS | 0 |
| ANCOM | log ratio | +1 |
| LRTT+1 | pesudo count | +1 |
| LRTT+0.5 | non-information prior | +0.5 |
| LRTT+a | DM prior | +a |
| LRTT+a+ | ZIDM prior | +f(a, y, pi) |
compare t.test, wilcox, ZIG, ANCOM, LRTT_T1, LRTT_T2, LRTT_T3 (+1/0.5)
| edge.1 | edge.2 | a1 | a2 | pi | theta | Cp1 | Cp2 | |
|---|---|---|---|---|---|---|---|---|
| 15 | 30 | 6 | 0.5916392 | 0.4083608 | 0 | 0 | 0.0021903 | 0.0015118 |
| 18 | 32 | 7 | 0.6458318 | 0.6458318 | 0 | 0 | 0.0004095 | 0.0005933 |
| 19 | 32 | 8 | 0.3541682 | 0.3541682 | 0 | 0 | 0.0002246 | 0.0003253 |
| 21 | 33 | 9 | 0.8545802 | 0.8545802 | 0 | 0 | 0.0007501 | 0.0010868 |
| 23 | 34 | 10 | 0.6690358 | 0.3309642 | 0 | 0 | 0.0000854 | 0.0000612 |
| 24 | 34 | 11 | 0.3309642 | 0.6690358 | 0 | 0 | 0.0000422 | 0.0001237 |
| 37 | 39 | 19 | 0.8679595 | 0.1320405 | 0 | 0 | 0.0670754 | 0.0102040 |
| 38 | 39 | 20 | 0.1320405 | 0.8679595 | 0 | 0 | 0.0102040 | 0.0670754 |
| POWER | SD_POWER | FDR | SD_FDR | |
|---|---|---|---|---|
| t.test | 0.83000 | 0.0932034 | 0.0256349 | 0.0602163 |
| wilcox | 0.76500 | 0.0408248 | 0.6552004 | 0.0164815 |
| ZIG | 0.78750 | 0.0652791 | 0.0581151 | 0.0759042 |
| ANCOM | 0.89125 | 0.0917269 | 0.0161905 | 0.0450960 |
| LRTT_T1+1 | 0.92250 | 0.0919417 | 0.0277341 | 0.0621652 |
| LRTT_T2+1 | 0.92625 | 0.0990724 | 0.0202341 | 0.0509816 |
| LRTT_T3+1 | 1.00000 | 0.0000000 | 0.0392929 | 0.0820371 |
| LRTT_T4+1 | 0.92250 | 0.0919417 | 0.0266230 | 0.0616508 |
| LRTT_T1+0.5 | 0.95375 | 0.0787669 | 0.0234008 | 0.0520611 |
| LRTT_T2+0.5 | 0.96000 | 0.0771984 | 0.0205000 | 0.0512802 |
| LRTT_T3+0.5 | 0.99875 | 0.0125000 | 0.0320606 | 0.0720419 |
| LRTT_T4+0.5 | 0.95375 | 0.0787669 | 0.0254008 | 0.0549164 |
compare t.test, wilcox, ZIG, ANCOM, LRTT_T1+1, LRTT_T1+0.5, LRTT_T1+a, LRTT_T2+1, LRTT_T2+0.5, LRTT_T2+a, LRTT_T3+1, LRTT_T3+0.5, LRTT_T3+a, LRTT_T4+1, LRTT_T4+0.5, LRTT_T4+a
| edge.1 | edge.2 | a1 | a2 | pi | theta | Cp1 | Cp2 | |
|---|---|---|---|---|---|---|---|---|
| 15 | 30 | 6 | 4.382876 | 4.617124 | 0 | 0.1 | 27441.6516 | 28908.3015 |
| 18 | 32 | 7 | 7.374891 | 7.374891 | 0 | 0.1 | 563630.0209 | 535034.5019 |
| 19 | 32 | 8 | 1.625109 | 1.625109 | 0 | 0.1 | 124199.8000 | 117898.5782 |
| 21 | 33 | 9 | 7.709279 | 7.709279 | 0 | 0.1 | 1416573.6667 | 1344704.4304 |
| 23 | 34 | 10 | 1.291536 | 7.708464 | 0 | 0.1 | 306312.1509 | 1735454.9048 |
| 24 | 34 | 11 | 7.708464 | 1.291536 | 0 | 0.1 | 1828208.2384 | 290771.5399 |
| 37 | 39 | 19 | 2.226938 | 6.773062 | 0 | 0.1 | 66.8599 | 203.3492 |
| 38 | 39 | 20 | 6.773062 | 2.226938 | 0 | 0.1 | 203.3492 | 66.8599 |
| POWER | SD_POWER | FDR | SD_FDR | |
|---|---|---|---|---|
| t.test | 0.50750 | 0.0496833 | 0.0185238 | 0.0598902 |
| wilcox | 0.50625 | 0.0411905 | 0.0000000 | 0.0000000 |
| ZIG | 0.26375 | 0.0612759 | 0.0033333 | 0.0333333 |
| ANCOM | 0.49625 | 0.0278377 | 0.0020000 | 0.0200000 |
| LRTT_T1+1 | 0.51125 | 0.0591795 | 0.0296111 | 0.0830797 |
| LRTT_T1+0.5 | 0.51500 | 0.0694495 | 0.0268030 | 0.0774508 |
| LRTT_T1+a | 0.57500 | 0.1080708 | 0.0308571 | 0.0857014 |
| LRTT_T2+1 | 0.50875 | 0.1039847 | 0.0301944 | 0.0745635 |
| LRTT_T2+0.5 | 0.51250 | 0.1058575 | 0.0290278 | 0.0718005 |
| LRTT_T2+a | 0.61000 | 0.1520898 | 0.0352857 | 0.0658081 |
| LRTT_T3+1 | 0.53625 | 0.0945066 | 0.0393013 | 0.1153118 |
| LRTT_T3+0.5 | 0.54000 | 0.1049290 | 0.0355833 | 0.1097828 |
| LRTT_T3+a | 0.64625 | 0.1786020 | 0.0836504 | 0.1513868 |
| LRTT_T4+1 | 0.51375 | 0.0708333 | 0.0423611 | 0.0888682 |
| LRTT_T4+0.5 | 0.51500 | 0.0716860 | 0.0453611 | 0.0929010 |
| LRTT_T4+a | 0.58375 | 0.1179248 | 0.0420635 | 0.0870441 |
compare t.test, wilcox, ZIG, ANCOM, LRTT_T1+1, LRTT_T1+0.5, LRTT_T1+a, LRTT_T1+a+, LRTT_T2+1, LRTT_T2+0.5, LRTT_T2+a,LRTT_T2+a+, LRTT_T3+1, LRTT_T3+0.5, LRTT_T3+a, LRTT_T3+a+, LRTT_T4+1, LRTT_T4+0.5, LRTT_T4+a, LRTT_T4+a+
| edge.1 | edge.2 | a1 | a2 | pi | theta | Cp1 | Cp2 | |
|---|---|---|---|---|---|---|---|---|
| 15 | 30 | 6 | 4.382876 | 4.617124 | 0 | 0.1 | 27441.6516 | 28908.3015 |
| 18 | 32 | 7 | 7.374891 | 7.374891 | 0 | 0.1 | 563630.0209 | 535034.5019 |
| 19 | 32 | 8 | 1.625109 | 1.625109 | 0 | 0.1 | 124199.8000 | 117898.5782 |
| 21 | 33 | 9 | 7.709279 | 7.709279 | 0 | 0.1 | 1416573.6667 | 1344704.4304 |
| 23 | 34 | 10 | 1.291536 | 7.708464 | 0 | 0.1 | 306312.1509 | 1735454.9048 |
| 24 | 34 | 11 | 7.708464 | 1.291536 | 0 | 0.1 | 1828208.2384 | 290771.5399 |
| 37 | 39 | 19 | 2.226938 | 6.773062 | 0 | 0.1 | 66.8599 | 203.3492 |
| 38 | 39 | 20 | 6.773062 | 2.226938 | 0 | 0.1 | 203.3492 | 66.8599 |
| POWER | SD_POWER | FDR | SD_FDR | |
|---|---|---|---|---|
| t.test | 0.37500 | 0.0852062 | 0.0267857 | 0.0843529 |
| wilcox | 0.46875 | 0.0761590 | 0.0168690 | 0.0668145 |
| ZIG | 0.34875 | 0.0541667 | 0.0281667 | 0.1000195 |
| ANCOM | 0.26625 | 0.0656528 | 0.0050000 | 0.0351763 |
| LRTT_T1+1 | 0.43250 | 0.0895767 | 0.0146905 | 0.0597691 |
| LRTT_T1+0.5 | 0.42750 | 0.0836735 | 0.0178333 | 0.0682988 |
| LRTT_T1+a | 0.44875 | 0.1067409 | 0.0401667 | 0.0916887 |
| LRTT_T1+a+ | 0.51625 | 0.0967512 | 0.0404048 | 0.0920196 |
| LRTT_T2+1 | 0.45250 | 0.0902088 | 0.0136190 | 0.0632645 |
| LRTT_T2+0.5 | 0.44375 | 0.0877162 | 0.0136190 | 0.0632645 |
| LRTT_T2+a | 0.47250 | 0.1348166 | 0.0363333 | 0.0824151 |
| LRTT_T2+a+ | 0.57125 | 0.1234180 | 0.0269365 | 0.0683579 |
| LRTT_T3+1 | 0.52250 | 0.0877252 | 0.0281319 | 0.0890917 |
| LRTT_T3+0.5 | 0.52250 | 0.0946485 | 0.0262857 | 0.0833584 |
| LRTT_T3+a | 0.54875 | 0.1279883 | 0.0581786 | 0.1167283 |
| LRTT_T3+a+ | 0.71375 | 0.1796944 | 0.1108330 | 0.1329043 |
| LRTT_T4+1 | 0.44250 | 0.0782946 | 0.0220238 | 0.0761050 |
| LRTT_T4+0.5 | 0.43750 | 0.0699477 | 0.0221667 | 0.0752868 |
| LRTT_T4+a | 0.45750 | 0.1008487 | 0.0455238 | 0.0976694 |
| LRTT_T4+a+ | 0.51000 | 0.0882633 | 0.0454762 | 0.0991064 |