December 14, 2016. updated 27 jan

Following previous notation, if we call \(l(\theta;D^+)\) the -log-likelihood function of the complete phylogenetic tree we have a function that looks like this

\[ l(\theta;D^+) = \sum n_i t_i - log(\rho_i) \]

If we focuss only on the influence of extinctions, and we assume constant extinction rate \(\mu\), we can write this equation on the following way

\[ l(\theta;D^+) = \sum (n_i t_i-E_i log(\lambda_i)) \color{blue}{ - (\#extinctions \cdot log(\mu)) } \]

Then, if \(\mu < 1\) we would have that \(l(\theta;D^+)\) increases with the number of extinctions the tree has (note that we want to minimaze \(l(\theta;D^+)\)).

This show us two anoying facts:

Trees with less missing species have larger likelihood (if \(\mu\)<1)
The model is time scale-dependent (the inflextion point \(\mu\)<1 as different meanings with different time-scales).

The following are just simples exercises that helps on understanding the relationship beween the number of extinctions and the likelihood of a tree, in the context of importance sampling.

We would like to see the relationship between the importance weight and the extinctions on a tree. To do that, given an observed tree we reconstruct 100 (extinct included) complete trees.

num_miss = NULL
imp_scal = NULL
g = NULL
f = NULL
st = sim_phyl()
extant = phylo2p(drop.fossil(st$newick))
for(i in 1:100){
  rec = rec_tree(wt=extant$t)
  imp_scal[i] = rec$weight
  num_miss[i] = (length(rec$wt)-length(extant$t))/2
  g[i] = rec$prob
  f[i] = rec$f_n
}
plot(num_miss,log(imp_scal))

imp_scal

##   [1]  9.060714e+54  6.802360e+47  4.883689e+85  2.269092e+91  4.183298e+91
##   [6]  2.946904e+86  2.330693e+66  2.386012e+83  3.679027e+52  1.975986e+53
##  [11]  3.762526e+81  4.085102e+63  4.034515e+97  8.376088e+85  4.999974e+60
##  [16]  3.386231e+63  3.396263e+61  2.746363e+66  5.159520e+68  3.278703e+63
##  [21]  1.432248e+56  1.358603e+87  1.924156e+35  1.274405e+48  1.456130e+62
##  [26]  3.199462e+67  6.996200e+95  1.313242e+63  4.410979e+75  2.781370e+60
##  [31]  2.196557e+50  9.251947e+54  1.582399e+90  1.448022e+49  1.974348e+32
##  [36]  1.078424e+85  1.098909e+85  3.254929e+62  5.145122e+35  6.195268e+62
##  [41]  3.870969e+41  7.531817e+58  3.693287e+72  7.067143e+72  1.268280e+49
##  [46]  3.613428e+56  9.384767e+70  1.580923e+55  5.751134e+78  4.785498e+57
##  [51]  4.450246e+81  8.254196e+46  5.436612e+48 1.233795e+105  1.816817e+47
##  [56]  2.376060e+54  4.928635e+60  5.075513e+27  4.611850e+03  1.620189e+47
##  [61]  3.068153e+22  3.086737e+62  1.938424e+81  2.836441e+58  4.326288e+90
##  [66]  6.138280e+81 7.519302e+100  5.036478e+42  7.980758e+67  2.630932e+26
##  [71]  1.753466e+88  1.592166e+71  8.282988e+55  8.726683e+41  8.130268e+73
##  [76]  2.742928e+67  6.609907e+60  9.843670e+95  5.307920e+65  1.505398e+77
##  [81]  3.254044e+46  2.714390e+59  1.309315e+67  1.260203e+52  8.487946e+25
##  [86]  0.000000e+00  8.832565e+67  4.029104e+76  6.832198e+71  2.819579e+29
##  [91]  1.125884e+64  6.014144e+26  5.423067e+76  8.369506e+63  1.195083e+81
##  [96]  6.580560e+69  5.478085e+25  1.228957e+61 4.938990e+112  3.086285e+67

#lm(log(imp_scal) ~ num_miss)

Moreover, the number of real missing species is

number_missing(st)

## [1] 32

and the set number of the simulated species are

summary(num_miss)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24.00   32.00   35.00   35.46   39.00   47.00

the relationship between between f ang g as described here

plot(log(f),log(g))

the relationship between f and number of species

plot(num_miss,log(f))

the relationship between g and number of species

plot(num_miss,log(g))

The plots above are interesting, let’s go one step back and see the relationship between the likeihood of a tree and the number of species (extinct and extant)

We simulate trees, count number of extinct species and the corresponding -log-likelihood function

n_extinct = NULL
ll = NULL
sum_spec = NULL
sum_ext = NULL
for(i in 1:100){
  s = sim_phyl(seed=TRUE,seed_val = i)
  ll[i] = llik(b=c(0.8,0.0175,0.1),n=s$n,E=s$E,t=s$t)
  n_extinct[i] = number_missing(s)
  sum_spec[i] = sum(s$E)
  sum_ext[i] = sum(1-s$E)
}
plot(n_extinct,ll)

lm(ll~n_extinct)

## 
## Call:
## lm(formula = ll ~ n_extinct)
## 
## Coefficients:
## (Intercept)    n_extinct  
##      91.938        5.527

(Ernst:) This may be the important missing piece of the puzzle: adding an additional event reduces the likelihood. We need to compare the same number of events (or average the likelihood by the number of events)

the -log-likelihood:

plot(ll/n_extinct)

the likelihood:

plot(exp(ll),exp(n_extinct))

summary(exp(ll))

##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
##  2.211e+63 6.415e+116 1.063e+133 1.183e+178 2.511e+143 1.183e+180