On this report we denote \(D\) as the observed-part or extant-species, \(+_i\) as the missing-part or extinct-species of the tree and \(D^+\) is then the complete phylogenetic tree.

The EM algorithms consists on two steps. First, we calculate the conditional expectation:

\[ Q(\theta|\theta^*) = E_{\theta^* } [log P(D^+|\theta) | D] \]

and then we perform the maximization:

\[ \theta ^{**} = argmax_{(\theta)} Q( \theta | \theta ^*) \]

Given that the calculation of the conditional expectation is really hard (if not impossible), we use an approximation, sampling complete-phylogenies under a montecarlo approach. This simulations should be sampled from (real density)

\[ f_{\theta^*} (+_i | D) \]

But instead we sample it from

\[ g_{\theta^*}(+_i | D) \]

To correct this we re-weigh the approximation of the expectation by importance scaling:

\[w_i = \frac{f_{\theta^*} (+_i | D)}{g_{\theta^*}(+_i | D)} = \frac{f_i}{g_i}\]

Thus, the montecarlo re-weighted approximation has the form

\[ E_{\theta^* } [log P(D^+|\theta) | D] \approx \frac{1}{N} \sum^{N}_{i=1} log P(D_i^+ | \theta) \frac{f_i}{g_i}\]

we multiply the corresponding probabilities to have every outcome, this is done on the piece of code bellow

  if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s, log = TRUE)
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
    }

Note that we are working with multiplication of many densities, which in the end are very small values. To avoid cutting most numbers to zero we consider the following fact:

\[ w_i = \frac{f_i}{g_i} = e^{log(f_i) - log(g_i)} = e^{loglik - \sum_j log(g_{i,j})} \]

and that is why on the code above the log of the densities is calculated. Considering that we calculate the weight in the following way

f_n = -llik(b=pars,n=n,E=E,t=wt)
weight = exp(f_n-sum(prob))
return(list(wt=wt,E=E,n=n,weight=weight,L=L,g=prob,f_n=f_n))

another way to calculate \(g\) is considering the probabilities of not having a new species on the interval \(\Delta t_i\)

\[ \int_0^{\Delta t_i} s_{\lambda_i}e^{-s_{\lambda_i} t}[1-e^{-\mu (r_i-t)}] dt = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}-e^{ t(\mu-s_{\lambda_i})-\mu r_i} = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}dt - s_{\lambda_i} e^{-\mu r_i} \int_0^{\Delta t_i}e^{ t(\mu-s_{\lambda_i})} dt \] \[= 1-e^{-s_{\lambda_i} \Delta t} -\frac{s_{\lambda_i} e^{-\mu r_i}}{\mu - s_{\lambda_i}}[e^{\Delta t(\mu-s_{\lambda_i})}-1] \] so we do it with this function

convol <-function(wt,lambda,mu,remt){
  out = 1-exp(-lambda*wt)-lambda*exp(-mu*remt)/(mu-lambda)*(exp(wt*(mu-lambda))-1)
  return(out)
}

and we add this option to the code

 if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s,log = TRUE)
        if(v2){ prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
      if(v2){prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
    }

Implementation

To have a better idea on how this implementation look like we do some simple observations.

We start simulating a whole tree

s <- sim_phyl()

Them we drop extinct species

s2 <- drop.fossil(s$newick)

we transform phylo format into branching-times, number of species, and topology vectors

s2 <- phylo2p(s2)

Now we simulate a complete tree (extinct+extans species) based on the observed tree (extant species only)

s3 <- rec_tree(wt=s2$wt)

and we observe the calculated weight for this tree

s3$weight
## [1] -133.4394

it seems very small

This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight

l = proc.time()
w=0
nLTT=0
n_it = 100000
st=sim_srt(wt=s2$wt,pars=c(0.8,0.0175,0.1),n_trees = n_it,parallel = TRUE)
for(i in 1:n_it){
  st3 = st[[i]]
  w[i] = st3$weight 
  rec = p2phylo(st3)
  nLTT[i] = nLTTstat(s$newick, rec, distance_method = "abs")
}
print(proc.time()-l)
##      user    system   elapsed 
##  8389.112    18.380 11439.335
su=summary(w)
su
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf -194.90 -175.60    -Inf -156.80  -87.96
boxplot(w)

q3 = w>su[5]
boxplot(w[q3])

They seems very small to be the logarithm of the weight, we centreate it to avoid numerical issues

w = w - max(w)
boxplot(w)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 1 is not drawn
points(rep(1,length(w)),w)

now we would like to check if trees with larger weight are really ‘’better trees’’. To do that we check the nLTT statistic

qplot(w,nLTT)

qplot(w[q3],nLTT[q3])

we can see that there is not relationship between the weight and the nLTT statistic. There is clearly something wrong about it.

Now let´s check the real weights, that is

w = exp(w)
qplot(w,nLTT)


now the same but with the second way

s3 <- rec_tree(wt=s2$wt,v2=T)

and we observe the calculated weight for this tree

s3$weight
## [1] -97.30386

it seems very small

This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight

w=0
for(i in 1:100){
st3 <- rec_tree(wt=s2$wt,v2=T)
w[i] = st3$weight 
}
print(w)
##   [1]  -75.50004 -119.22524  -49.92861  -76.12820  -55.78735  -54.42933
##   [7]  -86.56262  -75.91960  -58.82038  -92.10284  -57.37053  -59.20558
##  [13]  -68.98488  -66.00951  -45.66047  -85.12248  -76.78035  -62.95812
##  [19]  -68.81449  -96.27697    0.00000  -65.42044  -29.51350  -62.18309
##  [25]  -95.82134  -81.70200  -89.30489  -50.23001  -69.89029  -83.63750
##  [31]  -79.33740  -64.77941    0.00000    0.00000  -54.69348    0.00000
##  [37]    0.00000  -67.46212    0.00000  -75.40278  -49.18557    0.00000
##  [43]  -93.49133  -64.11404  -50.04962  -75.27772    0.00000  -97.35932
##  [49]  -90.37938  -65.92919  -44.49914  -87.49251  -53.23516 -122.27665
##  [55]  -75.75096  -79.75963  -85.89271  -56.37160  -67.91206    0.00000
##  [61]  -57.57453 -119.63715    0.00000  -89.74984  -64.70357    0.00000
##  [67]  -37.90212  -61.74611  -73.69774  -88.51482  -56.35048  -59.08719
##  [73] -213.00931  -53.49589    0.00000  -82.24215  -79.10703  -58.35727
##  [79] -102.86194 -105.45548 -102.80435    0.00000  -69.76295  -80.70787
##  [85]  -90.30860    0.00000 -112.11277  -86.59682  -82.75314  -49.90318
##  [91]  -78.99578  -67.27132  -84.41509  -48.00416  -61.36263  -73.90154
##  [97]  -74.37177  -54.95367  -96.00826  -70.78435
w = w - max(w)
print(w)
##   [1]  -75.50004 -119.22524  -49.92861  -76.12820  -55.78735  -54.42933
##   [7]  -86.56262  -75.91960  -58.82038  -92.10284  -57.37053  -59.20558
##  [13]  -68.98488  -66.00951  -45.66047  -85.12248  -76.78035  -62.95812
##  [19]  -68.81449  -96.27697    0.00000  -65.42044  -29.51350  -62.18309
##  [25]  -95.82134  -81.70200  -89.30489  -50.23001  -69.89029  -83.63750
##  [31]  -79.33740  -64.77941    0.00000    0.00000  -54.69348    0.00000
##  [37]    0.00000  -67.46212    0.00000  -75.40278  -49.18557    0.00000
##  [43]  -93.49133  -64.11404  -50.04962  -75.27772    0.00000  -97.35932
##  [49]  -90.37938  -65.92919  -44.49914  -87.49251  -53.23516 -122.27665
##  [55]  -75.75096  -79.75963  -85.89271  -56.37160  -67.91206    0.00000
##  [61]  -57.57453 -119.63715    0.00000  -89.74984  -64.70357    0.00000
##  [67]  -37.90212  -61.74611  -73.69774  -88.51482  -56.35048  -59.08719
##  [73] -213.00931  -53.49589    0.00000  -82.24215  -79.10703  -58.35727
##  [79] -102.86194 -105.45548 -102.80435    0.00000  -69.76295  -80.70787
##  [85]  -90.30860    0.00000 -112.11277  -86.59682  -82.75314  -49.90318
##  [91]  -78.99578  -67.27132  -84.41509  -48.00416  -61.36263  -73.90154
##  [97]  -74.37177  -54.95367  -96.00826  -70.78435
w = exp(w)
print(w)
##   [1] 1.624603e-33 1.663929e-52 2.071469e-22 8.668448e-34 5.913754e-25
##   [6] 2.299555e-24 2.548769e-38 1.067916e-33 2.848619e-26 1.000559e-40
##  [11] 1.214211e-25 1.937961e-26 1.097101e-30 2.149976e-29 1.478800e-20
##  [16] 1.075915e-37 4.515614e-34 4.546087e-28 1.300911e-30 1.539730e-42
##  [21] 1.000000e+00 3.874943e-29 1.522117e-13 9.867994e-28 2.428398e-42
##  [26] 3.290573e-36 1.642014e-39 1.532440e-22 4.436429e-31 4.750000e-37
##  [31] 3.501098e-35 7.356339e-29 1.000000e+00 1.000000e+00 1.765722e-24
##  [36] 1.000000e+00 1.000000e+00 5.030088e-30 1.000000e+00 1.790552e-33
##  [41] 4.354910e-22 1.000000e+00 2.495933e-41 1.430961e-28 1.835386e-22
##  [46] 2.029087e-33 1.000000e+00 5.216569e-43 5.607072e-40 2.329779e-29
##  [51] 4.723550e-20 1.005736e-38 7.590397e-24 7.869141e-54 1.264088e-33
##  [56] 2.295266e-35 4.980456e-38 3.297051e-25 3.207514e-30 1.000000e+00
##  [61] 9.901431e-26 1.102165e-52 1.000000e+00 1.052304e-39 7.935914e-29
##  [66] 1.000000e+00 3.461938e-17 1.527580e-27 9.850976e-33 3.618264e-39
##  [71] 3.367434e-25 2.181524e-26 3.099074e-93 5.848312e-24 1.000000e+00
##  [76] 1.917295e-36 4.408126e-35 4.526451e-26 2.126321e-45 1.589525e-46
##  [81] 2.252357e-45 1.000000e+00 5.038896e-31 8.892406e-36 6.018307e-40
##  [86] 1.000000e+00 2.041946e-49 2.463076e-38 1.150188e-36 2.124828e-22
##  [91] 4.926850e-35 6.087446e-30 2.182680e-37 1.419253e-21 2.241556e-27
##  [96] 8.034715e-33 5.020561e-33 1.361201e-24 2.014394e-42 1.814455e-31

So, a first observation, the values of w are really small and varies a lot. That is amaking a lot of numerical inestabilities.

boxplot(w)

summary(w)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.14    0.00    1.00

we can also see how few trees has almost the whole weight and defines the outcome