1 INTRODUCTION


Microbiome is a set of the small life around us which can not see by eyes directly. It is important for our human health, enviroment protection or industrial and agricultural production and so on. With the rapid development of next generation sequencing technology, we have shed much light on microbiome for our phenotypes or disease, such as obesity, type II diabetes, inflammation. Meanwhile, many different data source have been collected by different studies including HMP, MetaHIT.

2 PREVIOUS STUDIES

Although there are much findings in previous studies, efficient statistical methods are inadequate developed to extract meaningful information for microbiome. Microbiome data has several characteristics: non-negative integer numbers or compostional data, sparisity, high dimension, over-dispersion, phylogenetic relationships.There are two types of methods for microbiome data : based on count data directly and compostional data or other transform.


  • count data
    • single variable methods such as t.test, rank test, generalized linear regression or some mixture distribution such as zero-inflated poission or negative binomial
    • multivariate methods are much more attractive
Distribution parameters correlation reference
Multinomial (MN) p negative
Dirichlet Multinomial (DM) p+1 negative
Negative Multinomial (NegMN) p positive
Generalized Dirichlet Multinomial (GDM) 2p+1 both
Dirichlet Tree Multinomial (DTM) p both Wang and Zhao 2017

  • compostional data
    • same as count data based on the transformed data
    • log ratio: center log ratio (clr), additive log ratio (alr), quantitle log ratio (qlr)
    • ANCOM takes log ratio for any other OTU for interest one to do test based on the emprical distribution of reject numbers.

  • zero
    • zero-inflated mixture: ZIP, ZIG, ZIGDM
    • adding pseudo count (such as 1)
    • emperical bayesian idea posterior mean, for example DM by adding the parameters or 0.5 for uninformative Jeffreys prior

3 METHODS

We incroporate phylogenetic relationship of this OTU to do log ratio transfrom adptively to build efficient test methods, called Log Ratio Tree Test (LRTT). It keeps the advantage of ANCOM while reduce the computational complexity from \(O(p^2)\) down to \(O(p)\) (p is the OTU numbers). At the same way, we extend DTM to ZIGDTM and use the posterior mean to smooth zero.

3.0.1 LRTT Algorithm


  • LRTT_T1 (down to up)
    • Test on each subtree on the bottom and cut the tree out until to the root
      • if differential, all his childs OTU is differential one
      • else not
    • Get p.value for each OTU
      • if differential one, log ratio test with the sum of non-differential as reference
      • if not differential, log ratio test with the other sum of non-differential as reference

  • LRTT_T2 (internal to leaves + nondifferential)
    • Test on each internal nodes the same time
    • For each OTU do log ratio test with the non-differential internal node’s childs and including his brother

  • LRTT_T3 (internal to leaves + differential )
    • Test on each internal nodes the same time
    • If the OTU is non-differential one, take the pvalue of his parents as his
    • else do log ratio test with the top differential internal node’s child and his brother
  • LRTT_T4 (internal to leaves + differential to non differential)
    • Test on each internal nodes the same time
    • Once internal nodes is differential, his offsprings are differential otherwise not
    • correct each OTU on the leaves to do log ratio test for the sum o the non-differential one or expected itselfs.

3.0.2 NR or EM estimation for DM and ZIGDM


3.0.3 Smooth

For DM distribution, it can prove that the prosterior distribution of the prior is also one dirichlet distribution which is conjugate. And the posterior mean:

\[ E_q(P_i) =\frac{a_i+y_i}{\sum_i{(a_i + y_i)}} \]

So we would like to smooth the count data by adding the correspond parameters. It is obviously that when using Jeffrey’s uninformative prior, it will degerate to adding 0.5.

For ZIDM distribution, the posterior mean is much more complicated and here we show binary case as follow:

\[ E_q(P^l) =\frac{\pi^lI(y^l=0) + (1-\pi^l)DFW}{\pi^lI(y^l=0) + (1-\pi^l)DF}, \quad W=\frac{y^l+a^l}{y^l+y^r+a^l+a^r}\\ D=\frac{\Gamma(y^l+y^r+1)}{\Gamma(y^l+1)\Gamma(y^r+1)B(a^l, a^r)},\quad F=B(y^r+a^r, y^l+a^l) \]

Consider whether y=0 or not, we can get results as follow and using the numerator to smooth the count data.

\[ y^l \neq 0,\quad E_q(P^l)=W \\ y^l=0, y^r=N, \quad DF=\frac{B(N+a^r,a^l+a^r)}{B(N+a^r+a^l, a^r)}<1, \quad W=\frac{a^l}{N+a^l+a^r}\\ E_q(P^l)= W+ \frac{1-W}{1+(\frac{1}{\pi^l}-1)DF} =\frac{a^l+\frac{N+a^r}{1+(1/\pi^l-1)DF}}{N+a^l+a^r} \]

It will degerate to DM smooth when using ZIDM estimate it. For DM, \(\pi^l=0\), term \(\frac{N+a^r}{1+(1/\pi^l-1)DF}\) will be 0. So we can use likelihood Ratio test to check which one to use. If one sample in all zero and it will be removed.

4 SIMULATION

To explore our methods effectiveness, we simulated different data for case-control studies. We compare power and fdr of LRTT, ANCOM, t.test (log relative data), wilcox test, ZIG (CSS).

Give one tree (binary), we can simulate MN, DM, ZIGDM on each subtree with given depth. To set differential OTU, we simulate two groups with different parameters. Once the internal nodes is differential, all his offspring are differential one. We also explore degeneration situation for one small tree though it is small probability events.


4.0.1 parameter setting

parameter setting and tree structure

## List of 4
##  $ parameters  :'data.frame':    38 obs. of  8 variables:
##   ..$ edge.1: int [1:38] 21 22 23 24 25 25 24 26 26 23 ...
##   ..$ edge.2: int [1:38] 22 23 24 25 1 2 26 3 4 27 ...
##   ..$ a1    : num [1:38] 0.322 0.343 0.457 0.695 0.538 ...
##   ..$ a2    : num [1:38] 0.322 0.343 0.457 0.695 0.538 ...
##   ..$ pi    : num [1:38] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ theta : num [1:38] 0 0 0 0 0 0 0 0 0 0 ...
##   ..$ Cp1   : num [1:38] 0.3218 0.1105 0.0505 0.0351 0.0189 ...
##   ..$ Cp2   : num [1:38] 0.3218 0.1105 0.0505 0.0351 0.0189 ...
##  $ diff.otu    : num [1:8] 6 7 8 9 10 11 19 20
##  $ diff.taxa   : num [1:3] 30 34 39
##  $ degeneration: num(0)
edge.1 edge.2 a1 a2 pi theta Cp1 Cp2
15 30 6 0.5916392 0.4083608 0 0 0.0021903 0.0015118
18 32 7 0.6458318 0.6458318 0 0 0.0004095 0.0005933
19 32 8 0.3541682 0.3541682 0 0 0.0002246 0.0003253
21 33 9 0.8545802 0.8545802 0 0 0.0007501 0.0010868
23 34 10 0.6690358 0.3309642 0 0 0.0000854 0.0000612
24 34 11 0.3309642 0.6690358 0 0 0.0000422 0.0001237
37 39 19 0.8679595 0.1320405 0 0 0.0670754 0.0102040
38 39 20 0.1320405 0.8679595 0 0 0.0102040 0.0670754

4.0.2 Parameters estimation

DTM using NR and ZIGDTM using EM eatimte on each nodes respectively. Do LR test between MN, DM, ZIGDM for MN(0.5) vs DM (a) and DM vs ZIGDM (f(a)).

## [1] 8.897692 7.916799 3.510668 5.762473 9.666163
##             a                                       
## 6->1 8.897692 9.380067  9.407711  -8.299961 7.611532
## 6->2 7.916799 8.608280  8.633248 -11.039870 7.427533
## 6->3 3.510668 3.695227  3.716450  -2.256591 2.842322
## 6->4 5.762473 5.930395  5.945079 -10.963638 5.506855
## 6->5 9.666163 9.994082 10.019246 -27.295086 8.784431

4.0.3 Compare with different methods for different simulate distribution

Methods Transformation zero
t.test log relative +1
wilcox count 0
ZIG CSS 0
ANCOM log ratio +1
LRTT+1 pesudo count +1
LRTT+0.5 non-information prior +0.5
LRTT+a DM prior +a
LRTT+a+ ZIDM prior +f(a, y, pi)

4.0.3.1 MN simulation

compare t.test, wilcox, ZIG, ANCOM, LRTT_T1, LRTT_T2, LRTT_T3 (+1/0.5)

edge.1 edge.2 a1 a2 pi theta Cp1 Cp2
15 30 6 0.5916392 0.4083608 0 0 0.0021903 0.0015118
18 32 7 0.6458318 0.6458318 0 0 0.0004095 0.0005933
19 32 8 0.3541682 0.3541682 0 0 0.0002246 0.0003253
21 33 9 0.8545802 0.8545802 0 0 0.0007501 0.0010868
23 34 10 0.6690358 0.3309642 0 0 0.0000854 0.0000612
24 34 11 0.3309642 0.6690358 0 0 0.0000422 0.0001237
37 39 19 0.8679595 0.1320405 0 0 0.0670754 0.0102040
38 39 20 0.1320405 0.8679595 0 0 0.0102040 0.0670754

POWER SD_POWER FDR SD_FDR
t.test 0.83000 0.0932034 0.0256349 0.0602163
wilcox 0.76500 0.0408248 0.6552004 0.0164815
ZIG 0.78750 0.0652791 0.0581151 0.0759042
ANCOM 0.89125 0.0917269 0.0161905 0.0450960
LRTT_T1+1 0.92250 0.0919417 0.0277341 0.0621652
LRTT_T2+1 0.92625 0.0990724 0.0202341 0.0509816
LRTT_T3+1 1.00000 0.0000000 0.0392929 0.0820371
LRTT_T4+1 0.92250 0.0919417 0.0266230 0.0616508
LRTT_T1+0.5 0.95375 0.0787669 0.0234008 0.0520611
LRTT_T2+0.5 0.96000 0.0771984 0.0205000 0.0512802
LRTT_T3+0.5 0.99875 0.0125000 0.0320606 0.0720419
LRTT_T4+0.5 0.95375 0.0787669 0.0254008 0.0549164

4.0.3.2 DM simulation

compare t.test, wilcox, ZIG, ANCOM, LRTT_T1+1, LRTT_T1+0.5, LRTT_T1+a, LRTT_T2+1, LRTT_T2+0.5, LRTT_T2+a, LRTT_T3+1, LRTT_T3+0.5, LRTT_T3+a, LRTT_T4+1, LRTT_T4+0.5, LRTT_T4+a

edge.1 edge.2 a1 a2 pi theta Cp1 Cp2
15 30 6 4.382876 4.617124 0 0.1 27441.6516 28908.3015
18 32 7 7.374891 7.374891 0 0.1 563630.0209 535034.5019
19 32 8 1.625109 1.625109 0 0.1 124199.8000 117898.5782
21 33 9 7.709279 7.709279 0 0.1 1416573.6667 1344704.4304
23 34 10 1.291536 7.708464 0 0.1 306312.1509 1735454.9048
24 34 11 7.708464 1.291536 0 0.1 1828208.2384 290771.5399
37 39 19 2.226938 6.773062 0 0.1 66.8599 203.3492
38 39 20 6.773062 2.226938 0 0.1 203.3492 66.8599

POWER SD_POWER FDR SD_FDR
t.test 0.50750 0.0496833 0.0185238 0.0598902
wilcox 0.50625 0.0411905 0.0000000 0.0000000
ZIG 0.26375 0.0612759 0.0033333 0.0333333
ANCOM 0.49625 0.0278377 0.0020000 0.0200000
LRTT_T1+1 0.51125 0.0591795 0.0296111 0.0830797
LRTT_T1+0.5 0.51500 0.0694495 0.0268030 0.0774508
LRTT_T1+a 0.57500 0.1080708 0.0308571 0.0857014
LRTT_T2+1 0.50875 0.1039847 0.0301944 0.0745635
LRTT_T2+0.5 0.51250 0.1058575 0.0290278 0.0718005
LRTT_T2+a 0.61000 0.1520898 0.0352857 0.0658081
LRTT_T3+1 0.53625 0.0945066 0.0393013 0.1153118
LRTT_T3+0.5 0.54000 0.1049290 0.0355833 0.1097828
LRTT_T3+a 0.64625 0.1786020 0.0836504 0.1513868
LRTT_T4+1 0.51375 0.0708333 0.0423611 0.0888682
LRTT_T4+0.5 0.51500 0.0716860 0.0453611 0.0929010
LRTT_T4+a 0.58375 0.1179248 0.0420635 0.0870441

4.0.3.3 ZIDM simulation

compare t.test, wilcox, ZIG, ANCOM, LRTT_T1+1, LRTT_T1+0.5, LRTT_T1+a, LRTT_T1+a+, LRTT_T2+1, LRTT_T2+0.5, LRTT_T2+a,LRTT_T2+a+, LRTT_T3+1, LRTT_T3+0.5, LRTT_T3+a, LRTT_T3+a+, LRTT_T4+1, LRTT_T4+0.5, LRTT_T4+a, LRTT_T4+a+

edge.1 edge.2 a1 a2 pi theta Cp1 Cp2
15 30 6 4.382876 4.617124 0 0.1 27441.6516 28908.3015
18 32 7 7.374891 7.374891 0 0.1 563630.0209 535034.5019
19 32 8 1.625109 1.625109 0 0.1 124199.8000 117898.5782
21 33 9 7.709279 7.709279 0 0.1 1416573.6667 1344704.4304
23 34 10 1.291536 7.708464 0 0.1 306312.1509 1735454.9048
24 34 11 7.708464 1.291536 0 0.1 1828208.2384 290771.5399
37 39 19 2.226938 6.773062 0 0.1 66.8599 203.3492
38 39 20 6.773062 2.226938 0 0.1 203.3492 66.8599

POWER SD_POWER FDR SD_FDR
t.test 0.37500 0.0852062 0.0267857 0.0843529
wilcox 0.46875 0.0761590 0.0168690 0.0668145
ZIG 0.34875 0.0541667 0.0281667 0.1000195
ANCOM 0.26625 0.0656528 0.0050000 0.0351763
LRTT_T1+1 0.43250 0.0895767 0.0146905 0.0597691
LRTT_T1+0.5 0.42750 0.0836735 0.0178333 0.0682988
LRTT_T1+a 0.44875 0.1067409 0.0401667 0.0916887
LRTT_T1+a+ 0.51625 0.0967512 0.0404048 0.0920196
LRTT_T2+1 0.45250 0.0902088 0.0136190 0.0632645
LRTT_T2+0.5 0.44375 0.0877162 0.0136190 0.0632645
LRTT_T2+a 0.47250 0.1348166 0.0363333 0.0824151
LRTT_T2+a+ 0.57125 0.1234180 0.0269365 0.0683579
LRTT_T3+1 0.52250 0.0877252 0.0281319 0.0890917
LRTT_T3+0.5 0.52250 0.0946485 0.0262857 0.0833584
LRTT_T3+a 0.54875 0.1279883 0.0581786 0.1167283
LRTT_T3+a+ 0.71375 0.1796944 0.1108330 0.1329043
LRTT_T4+1 0.44250 0.0782946 0.0220238 0.0761050
LRTT_T4+0.5 0.43750 0.0699477 0.0221667 0.0752868
LRTT_T4+a 0.45750 0.1008487 0.0455238 0.0976694
LRTT_T4+a+ 0.51000 0.0882633 0.0454762 0.0991064