On this report we denote \(D\) as the observed-part or extant-species, \(+_i\) as the missing-part or extinct-species of the tree and \(D^+\) is then the complete phylogenetic tree.
The EM algorithms consists on two steps. First, we calculate the conditional expectation:
\[ Q(\theta|\theta^*) = E_{\theta^* } [log P(D^+|\theta) | D] \]
and then we perform the maximization:
\[ \theta ^{**} = argmax_{(\theta)} Q( \theta | \theta ^*) \]
Given that the calculation of the conditional expectation is really hard (if not impossible), we use an approximation, sampling complete-phylogenies under a montecarlo approach. This simulations should be sampled from (real density)
\[ f_{\theta^*} (+_i | D) \]
But instead we sample it from
\[ g_{\theta^*}(+_i | D) \]
To correct this we re-weigh the approximation of the expectation by importance scaling:
\[w_i = \frac{f_{\theta^*} (+_i | D)}{g_{\theta^*}(+_i | D)} = \frac{f_i}{g_i}\]
Thus, the montecarlo re-weighted approximation has the form
\[ E_{\theta^* } [log P(D^+|\theta) | D] \approx \frac{1}{N} \sum^{N}_{i=1} log P(D_i^+ | \theta) \frac{f_i}{g_i}\]
\(f_i\) is the density function (likelihood) of the complete phylogenetic tree.
to calculate \(g_i\) we have 2 ways so far. The first one is related with the diagram above
we multiply the corresponding probabilities to have every outcome, this is done on the piece of code bellow
if (t_spe < cwt){
t_ext = rexp(1,mu0)
t_ext = cbt + t_spe + t_ext
if (t_ext < ct){
up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
E = up$E
n = up$n
wt = up$wt
fake = FALSE
prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
}else{
prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s, log = TRUE)
fake = TRUE
i = i-1
}
}else{
fake = FALSE
prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
}
Note that we are working with multiplication of many densities, which in the end are very small values. To avoid cutting most numbers to zero we consider the following fact:
\[ w_i = \frac{f_i}{g_i} = e^{log(f_i) - log(g_i)} = e^{loglik - \sum_j log(g_{i,j})} \]
and that is why on the code above the log of the densities is calculated. Considering that we calculate the weight in the following way
f_n = -llik(b=pars,n=n,E=E,t=wt)
weight = exp(f_n-sum(prob))
return(list(wt=wt,E=E,n=n,weight=weight,L=L,g=prob,f_n=f_n))
another way to calculate \(g\) is considering the probabilities of not having a new species on the interval \(\Delta t_i\)
\[ \int_0^{\Delta t_i} s_{\lambda_i}e^{-s_{\lambda_i} t}[1-e^{-\mu (r_i-t)}] dt = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}-e^{ t(\mu-s_{\lambda_i})-\mu r_i} = s_{\lambda_i}\int_0^{\Delta t_i}e^{-s_{\lambda_i} t}dt - s_{\lambda_i} e^{-\mu r_i} \int_0^{\Delta t_i}e^{ t(\mu-s_{\lambda_i})} dt \] \[= 1-e^{-s_{\lambda_i} \Delta t} -\frac{s_{\lambda_i} e^{-\mu r_i}}{\mu - s_{\lambda_i}}[e^{\Delta t(\mu-s_{\lambda_i})}-1] \] so we do it with this function
convol <-function(wt,lambda,mu,remt){
out = 1-exp(-lambda*wt)-lambda*exp(-mu*remt)/(mu-lambda)*(exp(wt*(mu-lambda))-1)
return(out)
}
and we add this option to the code
if (t_spe < cwt){
t_ext = rexp(1,mu0)
t_ext = cbt + t_spe + t_ext
if (t_ext < ct){
up = update_tree(wt=wt,t_spe = (cbt + t_spe), t_ext = t_ext, E = E, n = n)
E = up$E
n = up$n
wt = up$wt
fake = FALSE
prob[i] = dexp(t_ext,rate=mu0,log=TRUE) + dexp(t_spe,rate=s,log = TRUE)
}else{
prob[i] = pexp(q = ct,rate = mu0,lower.tail = F,log.p = TRUE) + dexp(t_spe,rate=s,log = TRUE)
if(v2){ prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
fake = TRUE
i = i-1
}
}else{
fake = FALSE
prob[i] = pexp(q = cwt,rate = s,lower.tail = F,log.p = TRUE)
if(v2){prob[i] = log(convol(wt = t_spe,lambda = s,mu = mu,remt = ct-cbt))}
}
To have a better idea on how this implementation look like we do some simple observations.
We start simulating a whole tree
s <- sim_phyl()
Them we drop extinct species
s2 <- drop.fossil(s$newick)
we transform phylo format into branching-times, number of species, and topology vectors
s2 <- phylo2p(s2)
Now we simulate a complete tree (extinct+extans species) based on the observed tree (extant species only)
s3 <- rec_tree(wt=s2$wt)
and we observe the calculated weight for this tree
s3$weight
## [1] -133.4394
it seems very small
This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight
l = proc.time()
w=0
nLTT=0
n_it = 100000
st=sim_srt(wt=s2$wt,pars=c(0.8,0.0175,0.1),n_trees = n_it,parallel = TRUE)
for(i in 1:n_it){
st3 = st[[i]]
w[i] = st3$weight
rec = p2phylo(st3)
nLTT[i] = nLTTstat(s$newick, rec, distance_method = "abs")
}
print(proc.time()-l)
## user system elapsed
## 8389.112 18.380 11439.335
su=summary(w)
su
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -Inf -194.90 -175.60 -Inf -156.80 -87.96
boxplot(w)
q3 = w>su[5]
boxplot(w[q3])
They seems very small to be the logarithm of the weight, we centreate it to avoid numerical issues
w = w - max(w)
boxplot(w)
## Warning in bplt(at[i], wid = width[i], stats = z$stats[, i], out = z$out[z
## $group == : Outlier (-Inf) in boxplot 1 is not drawn
points(rep(1,length(w)),w)
now we would like to check if trees with larger weight are really ‘’better trees’’. To do that we check the nLTT statistic
qplot(w,nLTT)
qplot(w[q3],nLTT[q3])
we can see that there is not relationship between the weight and the nLTT statistic. There is clearly something wrong about it.
Now let´s check the real weights, that is
w = exp(w)
qplot(w,nLTT)
now the same but with the second way
s3 <- rec_tree(wt=s2$wt,v2=T)
and we observe the calculated weight for this tree
s3$weight
## [1] -97.30386
it seems very small
This is a random process, if we do it again we end with another complete-tree and its corresponding weight, we do it for 30 iterations just to have an idea of the variability of the weight
w=0
for(i in 1:100){
st3 <- rec_tree(wt=s2$wt,v2=T)
w[i] = st3$weight
}
print(w)
## [1] -75.50004 -119.22524 -49.92861 -76.12820 -55.78735 -54.42933
## [7] -86.56262 -75.91960 -58.82038 -92.10284 -57.37053 -59.20558
## [13] -68.98488 -66.00951 -45.66047 -85.12248 -76.78035 -62.95812
## [19] -68.81449 -96.27697 0.00000 -65.42044 -29.51350 -62.18309
## [25] -95.82134 -81.70200 -89.30489 -50.23001 -69.89029 -83.63750
## [31] -79.33740 -64.77941 0.00000 0.00000 -54.69348 0.00000
## [37] 0.00000 -67.46212 0.00000 -75.40278 -49.18557 0.00000
## [43] -93.49133 -64.11404 -50.04962 -75.27772 0.00000 -97.35932
## [49] -90.37938 -65.92919 -44.49914 -87.49251 -53.23516 -122.27665
## [55] -75.75096 -79.75963 -85.89271 -56.37160 -67.91206 0.00000
## [61] -57.57453 -119.63715 0.00000 -89.74984 -64.70357 0.00000
## [67] -37.90212 -61.74611 -73.69774 -88.51482 -56.35048 -59.08719
## [73] -213.00931 -53.49589 0.00000 -82.24215 -79.10703 -58.35727
## [79] -102.86194 -105.45548 -102.80435 0.00000 -69.76295 -80.70787
## [85] -90.30860 0.00000 -112.11277 -86.59682 -82.75314 -49.90318
## [91] -78.99578 -67.27132 -84.41509 -48.00416 -61.36263 -73.90154
## [97] -74.37177 -54.95367 -96.00826 -70.78435
w = w - max(w)
print(w)
## [1] -75.50004 -119.22524 -49.92861 -76.12820 -55.78735 -54.42933
## [7] -86.56262 -75.91960 -58.82038 -92.10284 -57.37053 -59.20558
## [13] -68.98488 -66.00951 -45.66047 -85.12248 -76.78035 -62.95812
## [19] -68.81449 -96.27697 0.00000 -65.42044 -29.51350 -62.18309
## [25] -95.82134 -81.70200 -89.30489 -50.23001 -69.89029 -83.63750
## [31] -79.33740 -64.77941 0.00000 0.00000 -54.69348 0.00000
## [37] 0.00000 -67.46212 0.00000 -75.40278 -49.18557 0.00000
## [43] -93.49133 -64.11404 -50.04962 -75.27772 0.00000 -97.35932
## [49] -90.37938 -65.92919 -44.49914 -87.49251 -53.23516 -122.27665
## [55] -75.75096 -79.75963 -85.89271 -56.37160 -67.91206 0.00000
## [61] -57.57453 -119.63715 0.00000 -89.74984 -64.70357 0.00000
## [67] -37.90212 -61.74611 -73.69774 -88.51482 -56.35048 -59.08719
## [73] -213.00931 -53.49589 0.00000 -82.24215 -79.10703 -58.35727
## [79] -102.86194 -105.45548 -102.80435 0.00000 -69.76295 -80.70787
## [85] -90.30860 0.00000 -112.11277 -86.59682 -82.75314 -49.90318
## [91] -78.99578 -67.27132 -84.41509 -48.00416 -61.36263 -73.90154
## [97] -74.37177 -54.95367 -96.00826 -70.78435
w = exp(w)
print(w)
## [1] 1.624603e-33 1.663929e-52 2.071469e-22 8.668448e-34 5.913754e-25
## [6] 2.299555e-24 2.548769e-38 1.067916e-33 2.848619e-26 1.000559e-40
## [11] 1.214211e-25 1.937961e-26 1.097101e-30 2.149976e-29 1.478800e-20
## [16] 1.075915e-37 4.515614e-34 4.546087e-28 1.300911e-30 1.539730e-42
## [21] 1.000000e+00 3.874943e-29 1.522117e-13 9.867994e-28 2.428398e-42
## [26] 3.290573e-36 1.642014e-39 1.532440e-22 4.436429e-31 4.750000e-37
## [31] 3.501098e-35 7.356339e-29 1.000000e+00 1.000000e+00 1.765722e-24
## [36] 1.000000e+00 1.000000e+00 5.030088e-30 1.000000e+00 1.790552e-33
## [41] 4.354910e-22 1.000000e+00 2.495933e-41 1.430961e-28 1.835386e-22
## [46] 2.029087e-33 1.000000e+00 5.216569e-43 5.607072e-40 2.329779e-29
## [51] 4.723550e-20 1.005736e-38 7.590397e-24 7.869141e-54 1.264088e-33
## [56] 2.295266e-35 4.980456e-38 3.297051e-25 3.207514e-30 1.000000e+00
## [61] 9.901431e-26 1.102165e-52 1.000000e+00 1.052304e-39 7.935914e-29
## [66] 1.000000e+00 3.461938e-17 1.527580e-27 9.850976e-33 3.618264e-39
## [71] 3.367434e-25 2.181524e-26 3.099074e-93 5.848312e-24 1.000000e+00
## [76] 1.917295e-36 4.408126e-35 4.526451e-26 2.126321e-45 1.589525e-46
## [81] 2.252357e-45 1.000000e+00 5.038896e-31 8.892406e-36 6.018307e-40
## [86] 1.000000e+00 2.041946e-49 2.463076e-38 1.150188e-36 2.124828e-22
## [91] 4.926850e-35 6.087446e-30 2.182680e-37 1.419253e-21 2.241556e-27
## [96] 8.034715e-33 5.020561e-33 1.361201e-24 2.014394e-42 1.814455e-31
So, a first observation, the values of w are really small and varies a lot. That is amaking a lot of numerical inestabilities.
boxplot(w)
summary(w)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 0.14 0.00 1.00
we can also see how few trees has almost the whole weight and defines the outcome