On this report we denote \(D\) as the observed-part or extant-species, \(+_i\) as the missing-part or extinct-species of the tree and \(D^+\) is then the complete phylogenetic tree.

The EM algorithms consists on two steps. First, we calculate the conditional expectation:

\[ Q(\theta|\theta^*) = E_{\theta^* } [log P(D^+|\theta) | D] \]

and then we perform the maximization:

\[ \theta ^{**} = argmax_{(\theta)} Q( \theta | \theta ^*) \]

Given that the calculation of the conditional expectation is really hard (if not impossible), we use an approximation, sampling complete-phylogenies under a montecarlo approach. This simulations should be sampled from (real density)

\[ f_{\theta^*} (+_i | D) \]

But instead we sample it from

\[ g_{\theta^*}(+_i | D) \]

To correct this we re-weigh the approximation of the expectation by importance scaling:

\[w_i = \frac{f_{\theta^*} (+_i | D)}{g_{\theta^*}(+_i | D)} = \frac{f_i}{g_i}\]

Thus, the montecarlo re-weighted approximation has the form

\[ E_{\theta^* } [log P(D^+|\theta) | D] \approx \frac{1}{N} \sum^{N}_{i=1} log P(D_i^+ | \theta) \frac{f_i}{g_i}\]

we multiply the corresponding probabilities to have every outcome, this is done on the piece of code bellow

  if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s, log = TRUE)
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
    }

Note that we are working with multiplication of many densities, which in the end are very small values. To avoid cutting most numbers to zero we consider the following fact:

\[ w_i = \frac{f_i}{g_i} = e^{log(f_i) - log(g_i)} = e^{loglik - \sum_j log(g_{i,j})} \]

and that is why on the code above the log of the densities is calculated. Considering that we calculate the weight in the following way

f_n = -llik(b=pars,n=n,E=E,t=wt)
logweight = f_n-sum(prob)
return(list(wt=wt,E=E,n=n,weight=logweight,L=L,g=prob,f_n=f_n))

another way to calculate \(g\) is considering the probabilities of not having a new species on the interval \(\Delta t_i\)

\[ \int_0^{\Delta t_i} s_{\lambda_i}e^{-s_{\lambda_i} t}[1-e^{-\mu (r_i-t)}] dt = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}-e^{ t(\mu-s_{\lambda_i})-\mu r_i} = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}dt - s_{\lambda_i} e^{-\mu r_i} \int_0^{\Delta t_i}e^{ t(\mu-s_{\lambda_i})} dt \] \[= 1-e^{-s_{\lambda_i} \Delta t} -\frac{s_{\lambda_i} e^{-\mu r_i}}{\mu - s_{\lambda_i}}[e^{\Delta t(\mu-s_{\lambda_i})}-1] \] so we do it with this function

convol <-function(wt,lambda,mu,remt){
  out = 1-exp(-lambda*wt)-lambda*exp(-mu*remt)/(mu-lambda)*(exp(wt*(mu-lambda))-1)
  return(out)
}

and we add this option to the code

 if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s,log = TRUE)
        if(v2){ prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
      if(v2){prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
    }

Implementation

To have a better idea on how this implementation look like we do some simple observations.

We start simulating a whole tree

s <- sim_phyl()

Them we drop extinct species

s2 <- drop.fossil(s$newick)

we transform phylo format into branching-times, number of species, and topology vectors

s2 <- phylo2p(s2)

Now we simulate a complete tree (extinct+extans species) based on the observed tree (extant species only)

s3 <- rec_tree(wt=s2$wt)

and we observe the calculated weight for this tree

s3$weight
## [1] -133.4394

it seems very small

This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight

l = proc.time()
w=0
nLTT=0
n_it = 10000
st=sim_srt(wt=s2$wt,pars=c(0.8,0.1,40),n_trees = n_it,parallel = TRUE)
for(i in 1:n_it){
  st3 = st[[i]]
  w[i] = st3$weight 
  rec = p2phylo(st3)
  nLTT[i] = ltt_stat(s$newick, rec)
}
print(proc.time()-l)
##     user   system  elapsed 
## 1752.548    0.188 1878.265
su=summary(w)
su
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf  -194.4  -175.7    -Inf  -156.4  -102.1
#boxplot(w)
q3 = w>su[5]
#boxplot(w[q3])

They seems very small to be the logarithm of the weight, we centreate it to avoid numerical issues

w = w - max(w)
boxplot(w)
points(rep(1,length(w)),w)

now we would like to check if trees with larger weight are really ‘’better trees’’. To do that we check the nLTT statistic

qplot(w,nLTT)

qplot(w[q3],nLTT[q3])

we can see that there is not relationship between the weight and the nLTT statistic. There is clearly something wrong about it.

Now let´s check the real weights, that is

w = exp(w)
qplot(w,nLTT)

phyl = st[[which(w==1)]]
pars = c(1.6,0.02,0.5)
subplex(par = pars, fn = llik_st, setoftrees = list(phyl), impsam = F)$par
## [1] 0.64226061 0.01604241 0.12659078
subplex(par = pars, fn = llik_st, setoftrees = st, impsam = F)$par
## [1] 0.62841062 0.01368546 0.13678540
subplex(par = pars, fn = llik_st, setoftrees = st, impsam = T)$par
## [1] 0.442805789 0.008773418 0.085327450
subplex(par = pars, fn = llik_st, setoftrees = list(s), impsam = F)$par
## [1] 1.60 0.02 0.50
st[[which(w==1)]]
## $wt
##   [1] 0.9259415331 0.7390260599 0.2651301756 0.3672189970 0.7663782308
##   [6] 0.1624220017 0.1942216872 0.2648911180 0.2283570168 0.5877104264
##  [11] 0.0354189359 0.0030950031 0.0925330296 0.0043897271 0.2640140312
##  [16] 0.1321830267 0.3409955266 0.0546239057 0.0988871912 0.0897507580
##  [21] 0.0593160916 0.0852365376 0.3669789872 0.0543658942 0.1097759604
##  [26] 0.0387769969 0.0069189211 0.1249082327 0.2925504936 0.8776399417
##  [31] 0.2576620216 0.1738594519 0.2342213527 0.3147841468 0.1230877459
##  [36] 0.0985183776 0.0025746838 0.2743416301 0.0456943574 0.0597350152
##  [41] 0.0097878429 0.1212943139 0.0421904658 0.0086224360 0.0562085299
##  [46] 0.0142245213 0.0547203629 0.1108783814 0.0680931170 0.0221174474
##  [51] 0.0955028734 0.0772388520 0.0690446806 0.1235765915 0.0132075964
##  [56] 0.2082859087 0.1111278980 0.1845529220 0.0038576310 0.0790507818
##  [61] 0.0054198008 0.1387360019 0.1757739221 0.1842308443 0.1879917351
##  [66] 0.0869354559 0.1541876572 0.1335192701 0.2319830778 0.0053386393
##  [71] 0.1001301564 0.1790978741 0.0429072482 0.0405715525 0.0814097073
##  [76] 0.0509026512 0.0578426369 0.0388037901 0.0006203104 0.0400198548
##  [81] 0.0075839987 0.2142566748 0.1409950598 0.1155865368 0.0737075526
##  [86] 0.0761706844 0.0555760661 0.0541423401 0.0649064050 0.0465604025
##  [91] 0.1738110512 0.1230926284 0.0775637275 0.2506985652 0.1151191970
##  [96] 0.1155282058 0.0374325982 0.0276831724 0.1234668528 0.0855108664
## [101] 0.0108180123 0.0584998538 0.0721483568 0.1323057795 0.0643431249
## [106] 0.0543517249
## 
## $E
##   [1] 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1
##  [36] 1 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 1 0 1
##  [71] 1 1 1 0 1 1 1 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 0 1 0
## 
## $n
##   [1]  1  2  3  4  3  4  3  4  5  6  7  6  7  8  9  8  9  8  9 10 11 12 13
##  [24] 14 15 14 15 16 17 18 19 18 19 20 21 22 23 24 25 24 25 26 27 28 29 30
##  [47] 31 32 31 30 31 32 33 32 31 30 31 32 31 32 31 30 29 30 31 30 31 32 33
##  [70] 32 33 34 35 36 35 36 37 38 37 36 35 36 35 34 33 32 33 32 31 32 31 32
##  [93] 31 32 33 32 33 34 33 32 33 34 35 34 35 34
## 
## $weight
## [1] -154.6177
## 
## $L
## [1] 0
## 
## $g
##   [1] -0.724549250 -4.502592057 -2.319279873 -1.072279471 -3.536208660
##   [6] -0.474272245 -2.384207530 -0.773482065 -2.856128743 -1.636106692
##  [11] -0.167974304 -1.377650018 -2.194022127 -0.023177759 -0.614984236
##  [16] -2.333391281 -0.578753196 -2.379275446 -2.437591200 -2.169332934
##  [21] -1.715312727 -0.603474687 -3.733182543 -1.304024112 -0.885068681
##  [26] -1.342744041 -1.241560731 -2.485973391 -0.200883285 -2.794990160
##  [31] -0.916118354 -1.868341439 -0.906323839 -4.378273380 -1.117944452
##  [36] -0.899472788 -0.023539046 -2.070058772 -0.414105114 -0.267331239
##  [41] -1.534905027 -1.088009996 -0.373069194 -0.074842745 -0.476788855
##  [46] -0.117352301 -1.687943030 -0.597938412 -0.543553307 -1.444012671
##  [51] -2.048566130 -2.115909969 -0.306357741 -0.917759238 -0.105429638
##  [56] -1.437671333 -0.661922267 -1.417366441 -1.587328701 -0.456211607
##  [61] -0.043263560 -1.144572016 -2.339675432 -1.804492704 -1.267801830
##  [66] -0.717217511 -1.230802974 -2.548205611 -0.198581668 -1.540820457
##  [71] -0.735205673 -0.438212520 -2.001889911 -0.248297901 -0.166866897
##  [76] -0.311524225 -0.326377079 -0.199063443 -0.003500101 -0.244921511
##  [81] -0.049769991 -0.228343484 -0.686918097 -0.743224288 -0.541197705
##  [86] -0.584990856 -0.136817608 -0.110407062 -1.850874658 -0.230923020
##  [91] -1.740509789 -0.126269055 -2.151298633 -1.572901263 -0.845262704
##  [96] -0.140195194 -0.274848852 -0.104742214 -0.906555367 -0.656723454
## [101] -0.079431255 -0.407743981 -0.473473592 -0.436907365 -0.422251757
## 
## $f_n
## [1] -270.3485
size = NULL
which(w==1)
## [1] 49
for(i in 1:1000){
  sss = st[[i]]
  size[i] = length(sss$wt)
}
hist(size)

summary(size)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    66.0   106.0   120.0   117.6   130.0   176.0
length(st[[which(w==1)]]$wt)
## [1] 106