pcm_demo

set.seed(98109) # Fred Hutch zip code
library(pacman)
p_load(phytools, castor, caper)

Settings

Between 60 and 100 tips

lower_bound_tips <- 60
upper_bound_tips <- 100

Birth rate and death rate for the tree

birth.rate <- 0.28
death.rate <- 0.14

Helpers

count_tips <- function(phy) {
  return( length(phy$tip.label) )
}

generateQ <- function(transition_model, n.states = 2) {
  Q <- castor::get_random_mk_transition_matrix(
    Nstates = n.states,
    rate_model = transition_model
  )
  colnames(Q) <- c("A","B")
  rownames(Q) <- c("A","B")
  return( Q )
}
# Presetting the colors that will be used for each trait
traitCols <- setNames(c("#FFC0CB", "#4682B4"), c("A","B"))

rates_list_to_mat <- function(fitMk_obj) {
  rates <- fitMk_obj$rates
  if (length(rates) == 1) {
    rate_B_to_A <- rates[1]
    rate_A_to_B <- rates[1]
  } else {
    rate_B_to_A <- rates[1]
    rate_A_to_B <- rates[2]
  }
  
  matrix(
    c(-rate_A_to_B, rate_A_to_B,
      rate_B_to_A, -rate_B_to_A),
    nrow=2, byrow = TRUE
  )
}


write_matex2 <- function(x) {
  begin <- "\\begin{bmatrix}"
  end <- "\\end{bmatrix}"
  X <-
    apply(x, 1, function(x) {
      paste(
        paste(x, collapse = "&"),
        "\\\\"
      )
    })
  paste(c(begin, X, end), collapse = "")
} # https://stackoverflow.com/a/54088015

Simulate tree

ntips <- 0
retry_counter <- 0
while (lower_bound_tips) {
  tt <- pbtree(b = birth.rate, d = death.rate, n = lower_bound_tips - 10)
  ntips <- count_tips(tt)
  if (ntips >= lower_bound_tips && ntips <= upper_bound_tips) {
    cat(paste("Took", retry_counter, "retries before success"))
    break
  } else {
    retry_counter <- retry_counter + 1
  }
}

Took 3 retries before success

plot(tt, show.tip.label=FALSE); title(paste("Total tips:", count_tips(tt)))

Discrete Trait Simulation

Model Simulation

Equal Rates (ER) model

\(q_{A \rightarrow B} = q_{B \rightarrow A}\)

er.Q <- generateQ("ER")
er_trait <- rTraitDisc(tt, root.value = 1, model = er.Q)
er_simmap <- summary(make.simmap(tt, er_trait, nsim=10, model = "ER", pi = "equal")) # root prior for *i* in *N* states **Pr**(state i) = 1/*N*

make.simmap is sampling character histories conditioned on
the transition matrix

Q =
           A          B
A -0.3321349  0.3321349
B  0.3321349 -0.3321349
(estimated using likelihood);
and (mean) root node prior probabilities
pi =
  A   B 
0.5 0.5

Done.

plot(er_simmap, colors = traitCols, ftype = "off")

table(er_trait)

er_trait
 A  B 
43 56

\[ Q_{ER} = \begin{bmatrix}-0.98&0.98 \\0.98&-0.98 \\\end{bmatrix} \]

All Rates Different (ARD) model

\(q_{A \rightarrow B} \neq q_{B \rightarrow A}\)

ard.Q <- generateQ("ARD")
ard_trait <- rTraitDisc(tt, root.value = 1, model = ard.Q)
ard_simmap <- summary(make.simmap(tt, ard_trait, nsim=10, model = "ARD"))

make.simmap is sampling character histories conditioned on
the transition matrix

Q =
            A           B
A -0.01817035  0.01817035
B  0.30482590 -0.30482590
(estimated using likelihood);
and (mean) root node prior probabilities
pi =
  A   B 
0.5 0.5

Done.

plot(ard_simmap, colors = traitCols, ftype = "off")

table(ard_trait)

ard_trait
 A  B 
89 10

\[ Q_{ARD} = \begin{bmatrix}-0.04&0.04 \\0.31&-0.31 \\\end{bmatrix} \]

Inference

Estimating the \(Q\) matrix

Equal Rates Trait

fitER <- fitMk(tt, er_trait, model = "ER")
fitARD <- fitMk(tt, er_trait, model = "ARD")
teststat <- -2 * (fitER$logLik - fitARD$logLik)
pchisq(teststat, df = 1, lower.tail = FALSE)

[1] 0.4892939

There’s no significant difference between the equal rates model and all rates different model, so fitting the simmap earlier using an equal rates model was appropriate.

\[ Q_{ER} = \begin{bmatrix}-0.98&0.98 \\0.98&-0.98 \\\end{bmatrix}\\ \] \[ \hat{Q}_{ER} = \begin{bmatrix}-0.33&0.33 \\0.33&-0.33 \\\end{bmatrix} \] (The estimated rates \(\hat{Q}\) are 3 times smaller than their true values in \(Q\). I’m not sure if this will matter a lot for analyses. If we can differentiate ER from ARD reliably, then we can at least comment on the rate model most appropriate for the data. However, we should probably not comment on the actual estimated rates. Maybe we can do it better*??

- for our use case

Let’s check the ARD trait…

All Rates Different Trait

fitER <- fitMk(tt, ard_trait, model = "ER")
fitARD <- fitMk(tt, ard_trait, model = "ARD")
teststat <- -2 * (fitER$logLik - fitARD$logLik)
pchisq(teststat, df = 1, lower.tail = FALSE)

[1] 0.0001920527

\[ Q_{ARD} = \begin{bmatrix}-0.04&0.04 \\0.31&-0.31 \\\end{bmatrix} \] \[ \hat{Q}_{ARD} = \begin{bmatrix}-0.02&0.02 \\0.3&-0.3 \\\end{bmatrix} \] (The estimate for rates between states is not too bad, especially relative to the ER model. From prior experience, this is not necessarily reliable. It would be useful to see what downstream stats look like when simulated on a phylogeny, but that’s getting beyond the scope of this.)

There is a significant difference between the ER model and ARD, so generating a simmap with an ARD model prior was, still, appropriate.

We knew all of this beforehand because we generated the data, but I want to demonstrate that its possible to fit all of this, and estimate the transition rates betwen states in our data.

Phylogenetic Signal using \(D\) statistic (Fritz & Purvis 2010)

Fritz & Purvis 2010 – https://www.jstor.org/stable/40864204?seq=1 For binary traits; best used when N \(\ge\) 50 species

**FP2010, Table 1**: Low \(D\) values incidate high signal and high \(D\) values indiciate low signal

Phylogenetic signal of ER trait

er_df <- data.frame(tip = names(er_trait), state = unname(er_trait))
er_compdat <- comparative.data(tt, er_df, names.col = 'tip')
phylo.d(er_compdat, tt, binvar=state)


Calculation of D statistic for the phylogenetic structure of a binary variable

  Data :  er_df
  Binary variable :  state
  Counts of states:  A = 43
                     B = 56
  Phylogeny :  tt
  Number of permutations :  1000

Estimated D :  0.765063
Probability of E(D) resulting from no (random) phylogenetic structure :  0.042
Probability of E(D) resulting from Brownian phylogenetic structure    :  0

Phylogenetic signal of ARD trait

ard_df <- data.frame(tip = names(ard_trait), state = unname(ard_trait))
ard_compdat <- comparative.data(tt, ard_df, names.col = 'tip')
phylo.d(ard_compdat, tt, binvar=state)


Calculation of D statistic for the phylogenetic structure of a binary variable

  Data :  ard_df
  Binary variable :  state
  Counts of states:  A = 89
                     B = 10
  Phylogeny :  tt
  Number of permutations :  1000

Estimated D :  -0.03363157
Probability of E(D) resulting from no (random) phylogenetic structure :  0
Probability of E(D) resulting from Brownian phylogenetic structure    :  0.555

Potential phylodynamics use case

Testing the phylodynamic signal of host characteristics, like zip code median income. We could possibly test if SARS-CoV-2 has a heterogenous transmission structure or if transmission chains are highly “bunched.” In the case of zip code median income, high signal would mean that we see transmission chains stay within income ranges. Low signal would mean that we would see frequently transmissions from high income to low income and low to high.

We could also use Q matricies to estimate if there is bias in transmission rates between vaccinated and unvaccinated individuals (maybe not the actual rates though). If an ARD model was the most appropriate, we might be able to see if individuals in either group are responsible for increased transmission. The fitMk model doesn’t allow for sampled ancestors – I wonder if we could make a model potentially better fit for epidemics.

Continuous trait evolution

Simulation

Brownian Motion ~ Neutral Evolution

Random walk trait evolution, where \(\sigma^2_{t=T} \propto T\), where \(T\) is time since the root, and \(\mu_{t=T} = \mu\).

BM with \(\mu_0 = 0, \sigma^2 = 0.25\)

bm_trait <- fastBM(tt, a = 0, sig2 = 0.25)
phenogram(tt, bm_trait, ftype="off")

Ornstein-Uhlenbeck ~ Adaptive Evolution

“Rubber banded” BM. Imagine a random walk of a point, but it’s held by a rubber band around some fixed point (\(\theta\)). The further away the stick goes from the fixed point, the harder it is pulled. The closer the stick object gets, the weaker the pull. \(\alpha\) represents the strength that rubber band effect. Extremely high \(\alpha\) pulls the random walk towards \(\alpha\) quickly (almost immediately), extremely low \(\alpha\) resembles BM.

OU with \(\mu_0 = 0, \sigma^2 = 0.25, \textcolor{red}{\theta = 10}, \alpha = 0.05\)

ou_trait_1 <- fastBM(tt, a = 0, sig2 = 0.25, theta = 10, alpha = 0.05)
phenogram(tt, ou_trait_1, ftype="off")
abline(h = 10, col = "red", lwd = 2)

OU with \(\mu_0 = 0, \sigma^2 = 0.25, \textcolor{red}{\theta = -20}, \alpha = 0.1\)

ou_trait_2 <- fastBM(tt, a = 0, sig2 = 0.25, theta = -20, alpha = 0.1)
phenogram(tt, ou_trait_2, ftype="off")
abline(h = -20, col = "red", lwd = 2)