Following previous notation, if we call \(l(\theta;D^+)\) the -log-likelihood function of the complete phylogenetic tree we have a function that looks like this
\[ l(\theta;D^+) = \sum n_i t_i - log(\rho_i) \]
If we focuss only on the influence of extinctions, and we assume constant extinction rate \(\mu\), we can write this equation on the following way
\[ l(\theta;D^+) = \sum (n_i t_i-E_i log(\lambda_i)) \color{blue}{ - (\#extinctions \cdot log(\mu)) } \]
Then, if \(\mu < 1\) we would have that \(l(\theta;D^+)\) increases with the number of extinctions the tree has (note that we want to minimaze \(l(\theta;D^+)\)).
This show us two anoying facts:
Trees with less missing species have larger likelihood (if \(\mu\)<1)
The model is time scale-dependent (the inflextion point \(\mu\)<1 as different meanings with different time-scales).
The following are just simples exercises that helps on understanding the relationship beween the number of extinctions and the likelihood of a tree, in the context of importance sampling.
We would like to see the relationship between the importance weight and the extinctions on a tree. To do that, given an observed tree we reconstruct 100 (extinct included) complete trees.
num_miss = NULL
imp_scal = NULL
g = NULL
f = NULL
st = sim_phyl()
extant = phylo2p(drop.fossil(st$newick))
for(i in 1:100){
rec = rec_tree(wt=extant$t)
imp_scal[i] = rec$weight
num_miss[i] = (length(rec$wt)-length(extant$t))/2
g[i] = rec$prob
f[i] = rec$f_n
}
plot(num_miss,log(imp_scal))
imp_scal
## [1] 9.060714e+54 6.802360e+47 4.883689e+85 2.269092e+91 4.183298e+91
## [6] 2.946904e+86 2.330693e+66 2.386012e+83 3.679027e+52 1.975986e+53
## [11] 3.762526e+81 4.085102e+63 4.034515e+97 8.376088e+85 4.999974e+60
## [16] 3.386231e+63 3.396263e+61 2.746363e+66 5.159520e+68 3.278703e+63
## [21] 1.432248e+56 1.358603e+87 1.924156e+35 1.274405e+48 1.456130e+62
## [26] 3.199462e+67 6.996200e+95 1.313242e+63 4.410979e+75 2.781370e+60
## [31] 2.196557e+50 9.251947e+54 1.582399e+90 1.448022e+49 1.974348e+32
## [36] 1.078424e+85 1.098909e+85 3.254929e+62 5.145122e+35 6.195268e+62
## [41] 3.870969e+41 7.531817e+58 3.693287e+72 7.067143e+72 1.268280e+49
## [46] 3.613428e+56 9.384767e+70 1.580923e+55 5.751134e+78 4.785498e+57
## [51] 4.450246e+81 8.254196e+46 5.436612e+48 1.233795e+105 1.816817e+47
## [56] 2.376060e+54 4.928635e+60 5.075513e+27 4.611850e+03 1.620189e+47
## [61] 3.068153e+22 3.086737e+62 1.938424e+81 2.836441e+58 4.326288e+90
## [66] 6.138280e+81 7.519302e+100 5.036478e+42 7.980758e+67 2.630932e+26
## [71] 1.753466e+88 1.592166e+71 8.282988e+55 8.726683e+41 8.130268e+73
## [76] 2.742928e+67 6.609907e+60 9.843670e+95 5.307920e+65 1.505398e+77
## [81] 3.254044e+46 2.714390e+59 1.309315e+67 1.260203e+52 8.487946e+25
## [86] 0.000000e+00 8.832565e+67 4.029104e+76 6.832198e+71 2.819579e+29
## [91] 1.125884e+64 6.014144e+26 5.423067e+76 8.369506e+63 1.195083e+81
## [96] 6.580560e+69 5.478085e+25 1.228957e+61 4.938990e+112 3.086285e+67
#lm(log(imp_scal) ~ num_miss)
Moreover, the number of real missing species is
number_missing(st)
## [1] 32
and the set number of the simulated species are
summary(num_miss)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.00 32.00 35.00 35.46 39.00 47.00
the relationship between between f ang g as described here
plot(log(f),log(g))
the relationship between f and number of species
plot(num_miss,log(f))
the relationship between g and number of species
plot(num_miss,log(g))
The plots above are interesting, let’s go one step back and see the relationship between the likeihood of a tree and the number of species (extinct and extant)
We simulate trees, count number of extinct species and the corresponding -log-likelihood function
n_extinct = NULL
ll = NULL
sum_spec = NULL
sum_ext = NULL
for(i in 1:100){
s = sim_phyl(seed=TRUE,seed_val = i)
ll[i] = llik(b=c(0.8,0.0175,0.1),n=s$n,E=s$E,t=s$t)
n_extinct[i] = number_missing(s)
sum_spec[i] = sum(s$E)
sum_ext[i] = sum(1-s$E)
}
plot(n_extinct,ll)
lm(ll~n_extinct)
##
## Call:
## lm(formula = ll ~ n_extinct)
##
## Coefficients:
## (Intercept) n_extinct
## 91.938 5.527
(Ernst:) This may be the important missing piece of the puzzle: adding an additional event reduces the likelihood. We need to compare the same number of events (or average the likelihood by the number of events)
the -log-likelihood:
plot(ll/n_extinct)
the likelihood:
plot(exp(ll),exp(n_extinct))
summary(exp(ll))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.211e+63 6.415e+116 1.063e+133 1.183e+178 2.511e+143 1.183e+180