On this report we denote \(D\) as the observed-part or extant-species, \(+_i\) as the missing-part or extinct-species of the tree and \(D^+\) is then the complete phylogenetic tree.

The EM algorithms consists on two steps. First, we calculate the conditional expectation:

\[ Q(\theta|\theta^*) = E_{\theta^* } [log P(D^+|\theta) | D] \]

and then we perform the maximization:

\[ \theta ^{**} = argmax_{(\theta)} Q( \theta | \theta ^*) \]

Given that the calculation of the conditional expectation is really hard (if not impossible), we use an approximation, sampling complete-phylogenies under a montecarlo approach. This simulations should be sampled from (real density)

\[ f_{\theta^*} (+_i | D) \]

But instead we sample it from

\[ g_{\theta^*}(+_i | D) \]

To correct this we re-weigh the approximation of the expectation by importance scaling:

\[w_i = \frac{f_{\theta^*} (+_i | D)}{g_{\theta^*}(+_i | D)} = \frac{f_i}{g_i}\]

Thus, the montecarlo re-weighted approximation has the form

\[ E_{\theta^* } [log P(D^+|\theta) | D] \approx \frac{1}{N} \sum^{N}_{i=1} log P(D_i^+ | \theta) \frac{f_i}{g_i}\]

we multiply the corresponding probabilities to have every outcome, this is done on the piece of code bellow

  if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s, log = TRUE)
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
    }

Note that we are working with multiplication of many densities, which in the end are very small values. To avoid cutting most numbers to zero we consider the following fact:

\[ w_i = \frac{f_i}{g_i} = e^{log(f_i) - log(g_i)} = e^{loglik - \sum_j log(g_{i,j})} \]

and that is why on the code above the log of the densities is calculated. Considering that we calculate the weight in the following way

f_n = -llik(b=pars,n=n,E=E,t=wt)
weight = exp(f_n-sum(prob))
return(list(wt=wt,E=E,n=n,weight=weight,L=L,g=prob,f_n=f_n))

another way to calculate \(g\) is considering the probabilities of not having a new species on the interval \(\Delta t_i\)

\[ \int_0^{\Delta t_i} s_{\lambda_i}e^{-s_{\lambda_i} t}[1-e^{-\mu (r_i-t)}] dt = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}-e^{ t(\mu-s_{\lambda_i})-\mu r_i} = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}dt - s_{\lambda_i} e^{-\mu r_i} \int_0^{\Delta t_i}e^{ t(\mu-s_{\lambda_i})} dt \] \[= 1-e^{-s_{\lambda_i} \Delta t} -\frac{s_{\lambda_i} e^{-\mu r_i}}{\mu - s_{\lambda_i}}[e^{\Delta t(\mu-s_{\lambda_i})}-1] \] so we do it with this function

convol <-function(wt,lambda,mu,remt){
  out = 1-exp(-lambda*wt)-lambda*exp(-mu*remt)/(mu-lambda)*(exp(wt*(mu-lambda))-1)
  return(out)
}

and we add this option to the code

 if (t_spe < cwt){
      t_ext = rexp(1,mu0)
      t_ext = cbt + t_spe + t_ext
      if (t_ext < ct){
        up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
        E = up$E
        n = up$n
        wt = up$wt
        fake = FALSE
        prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
      }else{
        prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s,log = TRUE)
        if(v2){ prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
        fake = TRUE
        i = i-1
      }
    }else{
      fake = FALSE
      prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
      if(v2){prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
    }

Implementation

To have a better idea on how this implementation look like we do some simple observations.

We start simulating a whole tree

s <- sim_phyl()

Them we drop extinct species

s2 <- drop.fossil(s$newick)

we transform phylo format into branching-times, number of species, and topology vectors

s2 <- phylo2p(s2)

Now we simulate a complete tree (extinct+extans species) based on the observed tree (extant species only)

s3 <- rec_tree(wt=s2$wt)

and we observe the calculated weight for this tree

s3$weight
## [1] -Inf

it seems very small

This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight

w=0
nLTT=0
n_it = 100
st=sim_srt(wt=s2$wt,pars=c(0.8,0.0175,0.1),n_trees = n_it)
for(i in 1:n_it){
  st3 = st[[i]]
  w[i] = st3$weight 
  rec = p2phylo(st3)
  nLTT[i] = nLTTstat(s$newick, rec, distance_method = "abs")
}
su=summary(w)
su
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    -Inf  -222.4  -212.5    -Inf  -198.5  -174.5
boxplot(w)

q3 = w>su[5]
boxplot(w[q3])

They seems very small to be the logarithm of the weight, we centreate it to avoid numerical issues

w = w - max(w)
boxplot(w)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 1 is not drawn
points(rep(1,length(w)),w)

now we would like to check if trees with larger weight are really ‘’better trees’’. To do that we check the nLTT statistic

qplot(w,nLTT)

qplot(w[q3],nLTT[q3])

we can see that there is not relationship between the weight and the nLTT statistic. There is clearly something wrong about it.

Now let´s check the real weights, that is

w = exp(w)
qplot(w,nLTT)


now the same but with the second way

s3 <- rec_tree(wt=s2$wt,v2=T)

and we observe the calculated weight for this tree

s3$weight
## [1] -194.5687

it seems very small

This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight

w=0
for(i in 1:100){
st3 <- rec_tree(wt=s2$wt,v2=T)
w[i] = st3$weight 
}
print(w)
##   [1]    0.0000    0.0000 -111.6870 -131.3811    0.0000    0.0000 -116.4134
##   [8] -103.9187 -131.2390 -110.5780    0.0000 -126.2164    0.0000 -127.8409
##  [15] -124.0819 -113.7044    0.0000    0.0000    0.0000    0.0000 -187.8608
##  [22] -120.7836    0.0000 -104.1887 -134.6352 -112.7757    0.0000 -121.1497
##  [29] -106.3640    0.0000 -124.7770 -135.5457 -127.6677      -Inf -119.5629
##  [36] -125.2948 -127.8070    0.0000 -125.0288 -118.8490 -104.3103 -112.2139
##  [43]    0.0000 -115.4659 -104.5541    0.0000 -119.6001 -123.9491    0.0000
##  [50] -129.0546 -131.0357    0.0000 -110.5199 -120.9251      -Inf    0.0000
##  [57] -231.5872    0.0000    0.0000 -127.4698 -138.7073 -121.3749 -111.1664
##  [64] -121.7523 -105.8351 -120.2086    0.0000 -105.6984 -143.6311 -186.6319
##  [71] -128.2664  -99.8520 -178.0863 -130.9410 -121.6407 -122.3238 -137.7087
##  [78] -120.0660 -123.5752 -111.7697 -201.1914 -137.8371 -133.5209 -187.5243
##  [85]    0.0000 -132.6997 -117.4974    0.0000 -181.0748      -Inf    0.0000
##  [92]    0.0000    0.0000 -196.5135      -Inf      -Inf -108.0838 -113.7509
##  [99]      -Inf -106.5920
w = w - max(w)
print(w)
##   [1]    0.0000    0.0000 -111.6870 -131.3811    0.0000    0.0000 -116.4134
##   [8] -103.9187 -131.2390 -110.5780    0.0000 -126.2164    0.0000 -127.8409
##  [15] -124.0819 -113.7044    0.0000    0.0000    0.0000    0.0000 -187.8608
##  [22] -120.7836    0.0000 -104.1887 -134.6352 -112.7757    0.0000 -121.1497
##  [29] -106.3640    0.0000 -124.7770 -135.5457 -127.6677      -Inf -119.5629
##  [36] -125.2948 -127.8070    0.0000 -125.0288 -118.8490 -104.3103 -112.2139
##  [43]    0.0000 -115.4659 -104.5541    0.0000 -119.6001 -123.9491    0.0000
##  [50] -129.0546 -131.0357    0.0000 -110.5199 -120.9251      -Inf    0.0000
##  [57] -231.5872    0.0000    0.0000 -127.4698 -138.7073 -121.3749 -111.1664
##  [64] -121.7523 -105.8351 -120.2086    0.0000 -105.6984 -143.6311 -186.6319
##  [71] -128.2664  -99.8520 -178.0863 -130.9410 -121.6407 -122.3238 -137.7087
##  [78] -120.0660 -123.5752 -111.7697 -201.1914 -137.8371 -133.5209 -187.5243
##  [85]    0.0000 -132.6997 -117.4974    0.0000 -181.0748      -Inf    0.0000
##  [92]    0.0000    0.0000 -196.5135      -Inf      -Inf -108.0838 -113.7509
##  [99]      -Inf -106.5920
w = exp(w)
print(w)
##   [1]  1.000000e+00  1.000000e+00  3.125850e-49  8.747748e-58  1.000000e+00
##   [6]  1.000000e+00  2.768883e-51  7.390368e-46  1.008427e-57  9.474798e-49
##  [11]  1.000000e+00  1.530780e-55  1.000000e+00  3.015763e-56  1.294008e-54
##  [16]  4.157152e-50  1.000000e+00  1.000000e+00  1.000000e+00  1.000000e+00
##  [21]  2.588722e-82  3.502226e-53  1.000000e+00  5.641896e-46  3.378213e-59
##  [26]  1.052274e-49  1.000000e+00  2.428484e-53  6.407605e-47  1.000000e+00
##  [31]  6.456903e-55  1.359120e-59  3.586219e-56  0.000000e+00  1.187123e-52
##  [36]  3.847463e-55  3.119747e-56  1.000000e+00  5.019619e-55  2.423942e-52
##  [41]  4.995741e-46  1.845534e-49  1.000000e+00  7.141399e-51  3.914848e-46
##  [46]  1.000000e+00  1.143792e-52  1.477681e-54  1.000000e+00  8.959484e-57
##  [51]  1.235673e-57  1.000000e+00  1.004223e-48  3.040044e-53  0.000000e+00
##  [56]  1.000000e+00 2.648124e-101  1.000000e+00  1.000000e+00  4.370842e-56
##  [61]  5.757119e-61  1.938897e-53  5.260800e-49  1.329424e-53  1.087426e-46
##  [66]  6.223931e-53  1.000000e+00  1.246675e-46  4.186119e-63  8.847096e-82
##  [71]  1.970696e-56  4.313476e-44  4.550750e-78  1.358498e-57  1.486298e-53
##  [76]  7.506445e-54  1.562729e-60  7.178102e-53  2.147716e-54  2.877661e-49
##  [81]  4.204082e-88  1.374422e-60  1.029511e-58  3.624429e-82  1.000000e+00
##  [86]  2.340148e-58  9.365684e-52  1.000000e+00  2.291980e-79  0.000000e+00
##  [91]  1.000000e+00  1.000000e+00  1.000000e+00  4.521481e-86  0.000000e+00
##  [96]  0.000000e+00  1.147589e-47  3.968524e-50  0.000000e+00  5.101265e-47

So, a first observation, the values of w are really small and varies a lot. That is amaking a lot of numerical inestabilities.

boxplot(w)

summary(w)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00    0.27    1.00    1.00

we can also see how few trees has almost the whole weight and defines the outcome