Intuitive Exercise 4 - comparison to WMM with log-data

We now replicate the weighted multiplier method as in notebook 2, however this time we use logged data. Specifically, estimates of the root are generated for each leaf with known values as follows. Marginal leaf counts are used with \(m\) sets of sampled branching probabilities along the path to the leaf, generating \(M\) back-calculated values of the root node per leaf. For \(I\) leaves, this process generates an \(M\) by \(I\) matrix \(\mathcal{M}\) of estimates of the root. Depending on the distribution of branching probabilities, variation within columns of \(\mathcal{M}\) can be large, and between column values may also differ significantly in scale depending on the data. To generate weights, a covariance matrix of \(\mathcal{M}\) is generated, but this variation and difference in scale can make the covariance matrix inversion process difficult. For computational reasons, notebook 2 has been taking the logarithm of the values in the columns of \(\mathcal{M}\) before calculating the covariance matrix, which has had the effect of stabilizing the weights calculation. The weights \(\vec{w}\) generated using logged data have then been used with untransformed values of \(\mathcal{M}\) in the weighted mean (\(\mathcal{M}\cdot \vec{w}\)) to generate the \(M\) root estimates, \(\hat{N}\). Here we take a slightly different approach.

We now compare the estimates and performance if we proceed as above to generate weights \(\vec{w}\), but then combine with logged data and then transform the final values to generate \(\hat{N}\). That is, with \(\vec{w}\) generated using the covariance matrix of \(\mathcal{L}=\log\mathcal{M}\) (where we have applied the log transformation to all elements of \(\mathcal{M}\)), we then proceed to generate \(\vec{\theta} = \mathcal{L}\vec{w}\), and ultimately set \(\hat{N} = \exp(\vec{\theta})\). Intuitively, if the data is highly variable and skewed, it is defensible to use logged data in mean calculations, and we certainly expect these values to be better behaved in precision matrix calculations. However, the interpretation of the resulting \(\hat{N}\) is less clear - rather than a weighted mean of raw estiamtes from each leaf, \(\hat{N}_m\) now represents a multiplicative value of estimates raised to the power of the weights \(\vec{w}_i\), where the \(\vec{w}\) are generated using \(\mathcal{L}\): \[ \hat{N}_m = \prod_{i=1}^I \mathcal{M}_{mi}^{\vec{w}_i}. \]

In this document, we repeat the experiments with various distributions of \(p,q\) as in notebook 2, but by also using the log-transformed data in the weighted mean calculations, and exponentiating these estimates to generate \(\hat{N}\).

treedata <- read.table("MM_subtree.txt", sep = ",", header = TRUE)
treedata %>%
  knitr::kable() %>%
  kable_styling()

From	To	TrueEst	Pop	Desc	Prob	ProbLabel
N	A	50	TRUE	A	0.5	a
N	AA	NA	FALSE	AA	0.5	aa
AA	B	150	TRUE	B	0.5	b
AA	BB	NA	FALSE	BB	0.5	bb

We first try using uniform branching distributions:

set.seed(1)

# import data
uniformdata <- treedata
uniformdata$Estimate <- uniformdata$Est
uniformdata$Population <- uniformdata$Pop

# create tree with uniform data
tree <- FromDataFrameNetwork(uniformdata)

# calculate estimates from A,B using 5000 simulated p,q values
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
g <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(g)

pop.x <- log(pop.x)     #take log of esimates to use in weighted mean
uniform.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree)) #weights are internally calculated with log(data)
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(exp(uniformNs)) # transform to population values using exp
uniformNs.restrict <- uniformNs[uniformNs$V1>=10 & uniformNs$V1<=750, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

#ggsave("altuniformMMests.pdf")

Next, we try using peaked (towards 1) \(f(q)\) distribution:

peakedq1 <- treedata
peakedq1[3,3] <- 149
peakedq1[3,4] <- 149
peakedq1$Estimate <- peakedq1$Est
peakedq1$Population <- peakedq1$Pop
peakedq1

# create tree with peaked data
tree <- FromDataFrameNetwork(peakedq1)

# calculate estimates from A,B using 5000 simulated p,q values
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')

# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
g <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(g)

pop.x <- log(pop.x)     #take log of esimates to use in weighted mean
peakedq1.pop.x <- exp(pop.x)


# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(exp(peakedq1Ns)) # transform to population values using exp
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=10 & peakedq1Ns$V1<=750, , drop = FALSE] # restrict to the range from a=10 to b=750
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked distribution for q, uniform for p", subtitle = "Point mass on 1. One MM pass and set of weights, 5000 p,q estimates.")
p

#ggsave("altpeakedq1MMests.pdf")

We see that this doesn’t seem to have much of an effect on the mode, even though the estimates from node \(B\) are significantly higher in the uniform case. There is however, more confidence that the estimate of \(N\) is smaller. This is because in the peaked case, the majority of the mass of \(f(q)\) sits on one, so we have higher confidence that the population at the parent of \(B\) is not likely to be significantly higher than that at \(B\):

print("Uniform estimates from A,B:")

## [1] "Uniform estimates from A,B:"

summary(uniform.pop.x)

##        A                   B          
##  Min.   :    50.02   Min.   :    153  
##  1st Qu.:    66.17   1st Qu.:    390  
##  Median :    97.84   Median :    777  
##  Mean   :   393.26   Mean   :  11320  
##  3rd Qu.:   197.61   3rd Qu.:   2316  
##  Max.   :114780.57   Max.   :9319242

print("Peaked q estimates from A,B:")

## [1] "Peaked q estimates from A,B:"

summary(peakedq1.pop.x)

##        A                   B            
##  Min.   :    50.01   Min.   :    150.2  
##  1st Qu.:    65.76   1st Qu.:    199.9  
##  Median :   100.28   Median :    302.8  
##  Mean   :   415.36   Mean   :   1605.6  
##  3rd Qu.:   197.48   3rd Qu.:    595.7  
##  Max.   :104249.36   Max.   :2193651.5

Once we examine the weighted estimates of \(N\), however, we can see the estimates from \(B\) have been downweighted in the uniform case, resulting in a similar distribution of estimates \(\hat{N}\) because of increased confidence in the estimate of \(B\) in the peaked case:

print("Uniform weighted estimates of N:")

## [1] "Uniform weighted estimates of N:"

summary(uniformNs)

##        V1         
##  Min.   :   75.0  
##  1st Qu.:  148.0  
##  Median :  228.0  
##  Mean   :  456.4  
##  3rd Qu.:  399.0  
##  Max.   :47160.0

print("Weights in uniform case:")

## [1] "Weights in uniform case:"

uniformweights

##                         
## [1,] 0.6721284 0.3278716

print("Peaked q weighted estimates of N:")

## [1] "Peaked q weighted estimates of N:"

summary(peakedq1Ns)

##        V1         
##  Min.   :   87.0  
##  1st Qu.:  139.0  
##  Median :  200.0  
##  Mean   :  340.5  
##  3rd Qu.:  338.0  
##  Max.   :24945.0

print("Weights in peaked f(q) case:")

## [1] "Weights in peaked f(q) case:"

peakedq1weights

##                         
## [1,] 0.4992985 0.5007015

We move the point mass of the peaked \(f(q)\) distribution towards 0 to observe the effects:

peakedq0 <- treedata
peakedq0[3,3] <- 0  #this may need to be changed?
peakedq0[3,4] <- 149
peakedq0$Estimate <- peakedq0$Est
peakedq0$Population <- peakedq0$Pop
peakedq0

# create tree with peaked data
tree <- FromDataFrameNetwork(peakedq0)

# calculate estimates from A,B using 5000 simulated p,q values
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')

# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
g <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(g)

pop.x <- log(pop.x)
peakedq0.pop.x <- exp(pop.x)


# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(exp(peakedq0Ns))
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=10 & peakedq0Ns$V1<=750, , drop = FALSE] # restrict to the range from a=10 to b=750
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked distribution for q, uniform for p", subtitle = "Point mass on 0. One MM pass and set of weights, 5000 p,q estimates.")
p

#ggsave("altpeakedq1MMests.pdf")

Interestingly, we would have an essentially blank plot if we were to restrict to the range \((10,750)\) without using logged data, as much higher values of \(N\) are being estimated. Here, we have a lot of mass in this zone, and though it does extend to larger values, we do not observe the same uncertaintly predicted without using log-transformed data.

peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=10 & peakedq0Ns$V1<=5000, , drop = FALSE] # restrict to the range from a=10 to b=750
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked distribution for q, uniform for p", subtitle = "Point mass on 0. One MM pass and set of weights, 5000 p,q estimates.")
p

In the summary for this case, we do expect to see larger estimates \(\hat{N}\) from \(B\), since the peaked distribution \(f(q)\) with mass on 0 suggests the population at \(B\) only represents a small fraction of the population at the parent of \(B\):

print("Uniform estimates from A,B:")

## [1] "Uniform estimates from A,B:"

summary(uniform.pop.x)

##        A                   B          
##  Min.   :    50.02   Min.   :    153  
##  1st Qu.:    66.17   1st Qu.:    390  
##  Median :    97.84   Median :    777  
##  Mean   :   393.26   Mean   :  11320  
##  3rd Qu.:   197.61   3rd Qu.:   2316  
##  Max.   :114780.57   Max.   :9319242

print("Peaked q estimates from A,B (mass on 0):")

## [1] "Peaked q estimates from A,B (mass on 0):"

summary(peakedq0.pop.x)

##        A                   B            
##  Min.   :    50.00   Min.   :3.561e+03  
##  1st Qu.:    66.72   1st Qu.:3.324e+04  
##  Median :   101.10   Median :8.565e+04  
##  Mean   :   489.10   Mean   :1.901e+06  
##  3rd Qu.:   212.24   3rd Qu.:2.643e+05  
##  Max.   :174974.98   Max.   :3.496e+09

This is indeed supported by the estimates of \(N\) from leaf \(B\). Although the overall weighted estimates are heavily effected by these larger values, the weighting of paths is not as intuitive as I would have hoped. To clarify, let’s consider the variance of estimates from \(A,B\). Weights are based on the sample precision matrix generated using the estimates \(\hat{N}\) from each leaf. Let \(q \sim Beta (150,1)\), the case with mass on 1, without loss of generality. Since \(p \sim Uniform(0,1)\), we expect \(Var(p) = 1/12 \approx 0.083\), while \(Var(q) = \frac{150}{(151)^2(152)} \approx 4.33*10^{-5}\). The effect of the longer path to leaf \(B\) does affect it’s overall weighting, placing more weight on the data from \(A\), but when comparing only estimates from \(B\), we should see smaller variation in the peaked case than the uniform case. I am not sure why this is occurring.

Considering we have strong data from leaf \(B\), I would have expected to see a relative increase in that weight from the uniform case, however this does not appear to be happening, and the coefficient of variation is actually higher for estimates from that leaf:

print("Uniform weighted estimates of N:")

## [1] "Uniform weighted estimates of N:"

summary(uniformNs)

##        V1         
##  Min.   :   75.0  
##  1st Qu.:  148.0  
##  Median :  228.0  
##  Mean   :  456.4  
##  3rd Qu.:  399.0  
##  Max.   :47160.0

print("Weights in uniform case:")

## [1] "Weights in uniform case:"

uniformweights

##                         
## [1,] 0.6721284 0.3278716

print("Peaked q weighted estimates of N (mass on 0):")

## [1] "Peaked q weighted estimates of N (mass on 0):"

summary(peakedq0Ns)

##        V1          
##  Min.   :   171.0  
##  1st Qu.:   476.8  
##  Median :   754.0  
##  Mean   :  1625.5  
##  3rd Qu.:  1309.0  
##  Max.   :291849.0

print("Weights in peaked f(q) case (mass on 0):")

## [1] "Weights in peaked f(q) case (mass on 0):"

peakedq0weights

##                         
## [1,] 0.7226716 0.2773284

print("Ratio of coefficients of variation:")

## [1] "Ratio of coefficients of variation:"

cv.uniform <- sd(uniform.pop.x$B)/mean(uniform.pop.x$B)
cv.peaked <- sd(peakedq0.pop.x$B)/mean(peakedq0.pop.x$B)
cv.peaked/cv.uniform

## [1] 1.839073

We may get a better sense of the differences in these three plots if we do not cut the upper end of the range so low. We re-plot here with the upper end set to \(b=5000\), instead:

uniformNs.restrict <- uniformNs[uniformNs$V1>=10 & uniformNs$V1<=5000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=10 & peakedq1Ns$V1<=5000, , drop = FALSE]
q <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked distribution for q, uniform for p", subtitle = "Mass on 1. One MM pass and set of weights, 5000 p,q estimates")
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=10 & peakedq0Ns$V1<=5000, , drop = FALSE]
r <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked distribution for q, uniform for p", subtitle = "Mass on 0. One MM pass and set of weights, 5000 p,q estimates")
p

Now, we add a “middle case” - that is, let \(q \sim Beta(15,15)\).

middleq <- treedata
middleq[3,3] <- 14
middleq[3,4] <- 28
middleq$Estimate <- middleq$Est
middleq$Population <- middleq$Pop
middleq

# create tree with peaked data
tree <- FromDataFrameNetwork(middleq)

# calculate estimates from A,B using 5000 simulated p,q values
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')

# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
g <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(g)

pop.x <- log(pop.x)
middleq.pop.x <- exp(pop.x)


# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleqweights <- weights
middleqNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleqNs <- round(exp(middleqNs))
middleqNs.restrict <- middleqNs[middleqNs$V1>=10 & middleqNs$V1<=750, , drop = FALSE] # restrict to the range from a=10 to b=750
p <- ggplot(middleqNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for central distribution for q, uniform for p", subtitle = "Beta(15,15). One MM pass and set of weights, 5000 p,q estimates.")
middleqNs.restrict <- middleqNs[middleqNs$V1>=10 & middleqNs$V1<=5000, , drop = FALSE] # restrict to the range from a=10 to b=750
q <- ggplot(middleqNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for central distribution for q, uniform for p", subtitle = "Beta(15,15). One MM pass and set of weights, 5000 p,q estimates.")
p

The mode here looks like it sits just below 200, while using the distribution \(f(N|A,B)\), the mode occurs closer to 350. The MM is also much more uncertain about the prospect of \(N\) having a higher value.

Now we take a look at the summary of \(\hat{N}\) estimates from \(B\) in the middle case, and the coefficient of variation:

print("Uniform estimates from A,B:")

## [1] "Uniform estimates from A,B:"

summary(uniform.pop.x)

##        A                   B          
##  Min.   :    50.02   Min.   :    153  
##  1st Qu.:    66.17   1st Qu.:    390  
##  Median :    97.84   Median :    777  
##  Mean   :   393.26   Mean   :  11320  
##  3rd Qu.:   197.61   3rd Qu.:   2316  
##  Max.   :114780.57   Max.   :9319242

print("Middle case q estimates from A,B:")

## [1] "Middle case q estimates from A,B:"

summary(middleq.pop.x)

##        A                   B           
##  Min.   :    50.01   Min.   :   209.7  
##  1st Qu.:    67.04   1st Qu.:   402.2  
##  Median :   101.01   Median :   613.3  
##  Mean   :   409.35   Mean   :  2324.6  
##  3rd Qu.:   201.61   3rd Qu.:  1230.8  
##  Max.   :235379.88   Max.   :625175.3

print("Uniform weighted estimates of N:")

## [1] "Uniform weighted estimates of N:"

summary(uniformNs)

##        V1         
##  Min.   :   75.0  
##  1st Qu.:  148.0  
##  Median :  228.0  
##  Mean   :  456.4  
##  3rd Qu.:  399.0  
##  Max.   :47160.0

print("Weights in uniform case:")

## [1] "Weights in uniform case:"

uniformweights

##                         
## [1,] 0.6721284 0.3278716

print("Middle case q weighted estimates of N:")

## [1] "Middle case q weighted estimates of N:"

summary(middleqNs)

##        V1         
##  Min.   :  105.0  
##  1st Qu.:  197.0  
##  Median :  281.0  
##  Mean   :  475.2  
##  3rd Qu.:  463.2  
##  Max.   :32471.0

print("Weights in middle case:")

## [1] "Weights in middle case:"

middleqweights

##                         
## [1,] 0.5105496 0.4894504

print("Ratio of coefficients of variation:")

## [1] "Ratio of coefficients of variation:"

cv.uniform <- sd(uniform.pop.x$B)/mean(uniform.pop.x$B)
cv.middle <- sd(middleq.pop.x$B)/mean(middleq.pop.x$B)
cv.middle/cv.uniform

## [1] 0.3958393

We see a corresponding increase in the weight on path \(B\), a sign that the method is recognizing there is more information contained in this branch than in the uniform case. However, this was not echoed in the peaked case with mass on 0, so we should not be so quick to celebrate this.

cv.peakedq1 <- sd(peakedq1.pop.x$B)/mean(peakedq1.pop.x$B)
cv.peakedq1/cv.uniform

## [1] 1.327885

In fact, the COV is higher in all cases that are not uniform. In some cases, this may not be surprising. Since back-calculation requires multiplying by the inverse of probability values, distributionts which often draw small probabilities may produce more variable estimates \(\hat{N}\) even for small variations in sampled values.

Intuitive Exercise 4 - comparison to WMM with log-data

10/14/21