Here we use the opioid tree to test the three methods - distribution plotting, weighted multiplier method (MM), and Bayesian model. First, we import the opioid data and delete the healthcare unattended branch entirely.

treedata <- read_csv("diagram_data_hc_nohc.csv")
## Rows: 23 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): from, to, Description
## dbl (2): Estimate, Total
## lgl (1): Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# adjust nodes g, g1, g2 to become one node (currently ignoring 
# underreporting here); sum all estimate from g nodes and delete g1, g2 from tree
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[4],
                         treedata$Estimate[5])
treedata$Total[6:9] <- treedata$Estimate[3]
treedata <- treedata[-c(4,5),]
# delete healthcare unattended branch
treedata <- treedata[-c(16,17,18,19,20),]

# adjust tree so that it looks like small subtree used in intuitive notebooks - 
# here we combine "Admitted to ED" and "attended by EMS but not transported".
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[2])
treedata$Total[4:7] <- treedata$Estimate[3]
treedata$Description[3] <- "Admitted to ED or not transported by EMS"
treedata <- treedata[-2,]


# adjust population values so only certain leaves are considered as data
treedata$Population <- c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)

# show data and tree
treedata %>%
  knitr::kable() %>%
  kable_styling()
from to Estimate Total Population Description
b e 473 34113 FALSE Attended by EMS but dies(in transport, on scene)
b g 33640 34113 TRUE Admitted to ED or not transported by EMS
g m 45 33640 FALSE Died in ED or after LWBS
g n 12907 33640 FALSE Released (assume survival)
g o 2376 33640 FALSE Admitted to hospital
g p NA 33640 FALSE Underdiagnosed and under-reported hospitalizations
o q 106 2376 FALSE Died in hospital
o r 2270 2376 FALSE Released (assumed survived)
e w 448 473 TRUE Reported by BCCS
e x 25 473 FALSE Under-reported as opioid OD death1
m y 43 45 FALSE Reported by BCCS
m z 2 45 FALSE Under-diagnosed and under-notified
q ab 98 106 FALSE Reported by BCCS
q ac 8 106 FALSE Under-diagnosed and under-notified
xo b 1309 2350 FALSE Healthcare attended
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
  node$probability <- round(node$Estimate/node$Total,4)
})

SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
             label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
             label = function(node) node$Description) #try with node$From to compare to data
plot(tree)

We adjust the data so that the tree resembles our simple subtree as in the intuition notebooks. We do so by assuming an estimate of the node on the right after the first binary split, “Admitted to ED or not transported by EMS”, is known. We eliminate all nodes below this one for this exercise. The other node assumed to be known is the endpoint of the left-most path, “Reported by BCCS”, from those “Admitted by EMS but dies”. The cropped tree is as follows, where the root is now node \(b\), “Healthcare attended overdoses”:

# cropped nodes below "admitted to ED or not transported by EMS"
treedata <- treedata[-c(3:8,11:15),]

# show tree and data
treedata %>%
  knitr::kable() %>%
  kable_styling()
from to Estimate Total Population Description TrueEst
b e 473 34113 FALSE Attended by EMS but dies(in transport, on scene) 473
b g 33640 34113 TRUE Admitted to ED or not transported by EMS 33640
e w 448 473 TRUE Reported by BCCS 448
e x 25 473 FALSE Under-reported as opioid OD death1 25
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
  node$probability <- round(node$Estimate/node$Total,4)
})

SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
             label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
             label = function(node) node$Description) 
plot(tree)

Now we generate a root estimate of \(b\) assuming only the values at the left- and right-most leaves. Maintaining the convention that the path probability to the right-most leaf is \(p\) and that to the left-most leaf is \(q\), we have a subtree analogous to the simulation, but this time we have a cohort estimate of the root with which to compare the estimates given by each method. In this case, we know the cohort value at \(b\) to be 34113. We test each method’s sensitivity to the distributions of \(p,q\) in estimating node \(b\). In each case, we begin by using the actual values generating \(p,q\) to establish adequate estimation of \(b\) when the distributions of \(p,q\) are accurate with respect to the data. Next, we proceed much like in the “intuition notebook 1” - estimate \(b\) using uniform \(p,q\), followed by changing to the peaked distributions for \(q\), then a middle case distribution of \(q\), and finally we will try a peaked distributions of both \(p\) and \(q\), since this closely resembles observed branching structure of this subtree.

Estimation using data values

Distribution

For these high values of \(N\), we must plot the log density for computational reasons, then revert. The density when using the calculated posterior results in the following plot, with mode around \(\hat{N} = 34100\), as expected:

source("intuitiveexercise.R")
dens_plot(473 + 1, 34113 - 473 + 1, 448 + 1, 473 - 448 + 1, 33640, 448, 34075, 34150, 1)

Multiplier Method

We now use the weighted multiplier method to plot a comparable distribution. We use 5000 samples of \(p,q\) to generate weights, and then calculate and plot the 5000 corresponding estimates of \(N\) generated with paired samples \(p,q\):

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
data.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(dataNs)
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

We see the mode of the distribution accurately falls around \(N=34120\).

log-WMM

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
data.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(exp(dataNs))
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

Again, the mode accurately falls at approximately \(N = 34113\).

Bayesian Model

Using the Bayesian model, we get the following posterior plot of \(N\), the population at node \(b\): Posterior of N using data to inform p,q

Here we see the mode clearly sits around the true value given by the data.

Uniform p,q

Distribution

When we assume information about \(p,q\) is not readily available or reliable, we may be tempted to use a uniform distribution for these branching probabilities. The following density plot of the estimates \(\hat{N}\) results:

dens_plot(1, 1, 1, 1, 33640, 448, 33750, 45000, 250)

With no additional information about \(p,q\), we have a slight overestimation of the actual value \(N\) at node \(b\), with a mode around \(N=36500\).

Multiplier Method

# adjust data
# only Popultaion = TRUE values are used and must be changed 
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0 
uniformdata
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
uniform.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(uniformNs)
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(uniformNs)
##        V1          
##  Min.   :   32471  
##  1st Qu.:   54471  
##  Median :   81426  
##  Mean   :  266708  
##  3rd Qu.:  155378  
##  Max.   :73689673
uniformweights
##                         
## [1,] 0.5027943 0.4972057

The multiplier method does not predict a mode near the true value when the distributions of \(p,q\) are uniform. With uniform \(p,q\), backcalculating from the bottom left node will result in a high estimate of \(b\), which the same process will result in lower estimates of \(b\) from the rightmost leaf. The multiplier method is estimating a balance of these, weighted slightly more heavily in favor of the smaller estimates as these are the result of a shorter path to known data.

log-WMM

# adjust data
# only Popultaion = TRUE values are used and must be changed 
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0 
uniformdata
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
uniform.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(exp(uniformNs))
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(uniformNs)
##        V1          
##  Min.   :   32191  
##  1st Qu.:   53172  
##  Median :   76238  
##  Mean   :  137772  
##  3rd Qu.:  123921  
##  Max.   :31132806
uniformweights
##                         
## [1,] 0.5039039 0.4960961

The density is similar to the case of non-transformed data, but more uniformly distributed near the mode, which reamains around approximately \(N=43500\).

Bayesian Model

Using the uniform distributions with the Bayesian model, we get the following posterior plot, where the mode is providing a good estimate of the true value of node \(b\): Posterior of N, uniform p,q

Peaked q

Distribution

For the distribution with \(q\) peaked near 1, we expect to see a better estimate of the true value than with the uniform choice of \(q\), since the cohort data informs a branching probability of \(\approx 94\%\). This is indeed the case, as we see a mode around \(N=34100\):

dens_plot(1, 1, 150, 1, 33640, 448, 34000, 34200, 5)

When \(q\) is peaked “incorrectly”, with mass around 0:

dens_plot(1, 1, 1, 150, 33640, 448, 34000, 60000, 1000)

We can see in this case the mass of the distribution is moved to high values of \(N\), and we don’t have much hope of recovering an estimate close to the true value.

Multiplier Method

We now use the weighted MM with \(q\) peaked near 0:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149 
peakedq0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedq0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(peakedq0Ns)
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq0Ns)
##        V1           
##  Min.   :2.322e+05  
##  1st Qu.:1.415e+06  
##  Median :2.736e+06  
##  Mean   :2.108e+07  
##  3rd Qu.:6.483e+06  
##  Max.   :1.289e+10
peakedq0weights
##                         
## [1,] 0.3708844 0.6291156

As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=800000\).

Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150 
peakedq1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedq1.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(peakedq1Ns)
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq1Ns)
##        V1         
##  Min.   :  27185  
##  1st Qu.:  31972  
##  Median :  33033  
##  Mean   :  33849  
##  3rd Qu.:  34147  
##  Max.   :2441167
peakedq1weights
##                           
## [1,] 0.9971794 0.002820579

In this case, we see good estimation of the population at node \(b\), with \(N \approx 33000\).

log-WMM

We now use the log data with weighted MM with \(q\) peaked near 0:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149 
peakedq0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
peakedq0.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(exp(peakedq0Ns))
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq0Ns)
##        V1          
##  Min.   :  108316  
##  1st Qu.:  278081  
##  Median :  420187  
##  Mean   :  754352  
##  3rd Qu.:  719924  
##  Max.   :61724052
peakedq0weights
##                         
## [1,] 0.3668003 0.6331997

As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=250000\). While this is still a massive overestimate, this is an improvement over the mode estimate of \(N=800000\) given when using the WMM without log-transformed data.

Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150 
peakedq1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
peakedq1.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(exp(peakedq1Ns))
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq1Ns)
##        V1       
##  Min.   :28029  
##  1st Qu.:31735  
##  Median :32738  
##  Mean   :32813  
##  3rd Qu.:33872  
##  Max.   :38694
peakedq1weights
##                           
## [1,] 0.9975978 0.002402227

In this case, we see better estimation of the population at node \(b\), with \(N \approx 33000\), as in the untransformed case. The main difference here the lack of estimates beyond \(N \sim 40000\), whereas the untransformed case is producing estimates beyond this.

Bayesian Model

In the Bayesian model case, with \(q\) ‘incorrectly’ peaked near 0, we have the following plot: Posterior of N, peaked q (0)

For \(q\) peaked near 1, we have very good estimation, as expected, with mode in line with what is predicted using the density: Posterior of N, peaked q (1)

Middle case

Distribution

In the middle case, recall we use \(q \sim Beta(15,15)\). This distribution is bell-shaped and centered around 0.5, with standard deviation just under 0.1. In this case, we don’t do too badly; the mode sits around \(N=34600\):

dens_plot(1, 1, 15, 15, 33640, 448, 34000, 35500, 10)

Multiplier Method

# adjust data
# only Population = TRUE values are used and must be changed 
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28 
middle
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
middle.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(middleNs)
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(middleNs)
##        V1          
##  Min.   :   39147  
##  1st Qu.:   58891  
##  Median :   66860  
##  Mean   :   85082  
##  3rd Qu.:   77427  
##  Max.   :47540386
middleweights
##                          
## [1,] 0.9677737 0.03222626

While the posterior distribution withstands this choice of distribution for \(q\), the multiplier method is estimating a population around \(N=57500\), a large overestimate of the true value.

log-WMM

# adjust data
# only Population = TRUE values are used and must be changed 
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28 
middle
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
middle.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(exp(middleNs))
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(middleNs)
##        V1        
##  Min.   : 38247  
##  1st Qu.: 57990  
##  Median : 65018  
##  Mean   : 67406  
##  3rd Qu.: 74137  
##  Max.   :160704
middleweights
##                          
## [1,] 0.9641484 0.03585157

Unfortunately we do not have an improvement in the mode with the transformed data. The distribution actually suggests more confidence in values around the incorrect mode.

Bayesian Model

For the middle case distribution of \(q\), we have the following posterior using the Bayesian model in blang: Posterior of N, middle case distribution of q

We don’t yet have a clear trend here. However, rerunning the simulation longer to promote better convergence we do see more tendancy towards a mode, but it is not yet in agreeance with the mode in the density: Posterior of N, middle case distribution of q

Peaked p (variable),q (correct - static)

Distribution

Lastly, we estimate using a “correctly” peaked distribution of \(q\), and also adjust the distribution of \(p\) to be peaked around 0, which is close to the values suggested by the data, and also to be peaked around 1, to see the effect of poor priors on \(p\) on estimation of \(N\). The first case shows good estimation, with the mode around \(N=34100\):

dens_plot(1, 150, 150, 1, 33640, 448, 34000, 34200, 5)

In the second case, where the distribution of \(p\) is pushed mostly towards 1 but the actual \(p\) value from the cohort is \(\approx 0.01\), we interesting still see quite good estimation, with the mode of \(N = 34300\):

dens_plot(150, 1, 150, 1, 33640, 448, 34000, 36000, 10)

Essentially, we only see an increase in uncertainty when \(f(p)\) is poorly chosen, and not a significant change in a mode-based estimate \(\hat{N}\).

Multiplier Method

We begin by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(peakedp0Ns)
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0Ns)
##        V1           
##  Min.   :    30235  
##  1st Qu.:    38094  
##  Median :    43025  
##  Mean   :   176299  
##  3rd Qu.:    56844  
##  Max.   :230836733
peakedp0weights
##                           
## [1,] 0.9985963 0.001403686

We do see an overestimate in this case, with a mode close to \(N=37000\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp1.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(peakedp1Ns)
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp1Ns)
##        V1       
##  Min.   :33508  
##  1st Qu.:33814  
##  Median :33978  
##  Mean   :34041  
##  3rd Qu.:34195  
##  Max.   :36365
peakedp1weights
##                          
## [1,] 0.03740866 0.9625913

Interestingly, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\).

log-WMM

We start again by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
peakedp0.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(exp(peakedp0Ns))
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0Ns)
##        V1       
##  Min.   :27441  
##  1st Qu.:31847  
##  Median :32823  
##  Mean   :32861  
##  3rd Qu.:33855  
##  Max.   :39185
peakedp0weights
##                            
## [1,] 0.9992484 0.0007516463

We no longer see an overestimate in this case, with a mode close to \(N=33500\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
peakedp1.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(exp(peakedp1Ns))
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp1Ns)
##        V1       
##  Min.   :33492  
##  1st Qu.:33813  
##  Median :33970  
##  Mean   :34036  
##  3rd Qu.:34192  
##  Max.   :35926
peakedp1weights
##                          
## [1,] 0.04029163 0.9597084

Again, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\). The distribution here is similar to that of untransformed data.

Bayesian Model

The following plots result using the Bayesian model for distribution of \(p\) peaked around 0 and then 1, respectively: Posterior of N, peaked distribution of p (0), q

Posterior of N, peaked distribution of p (1), q

As in the densities, we see both agree on a mode close to the true value. Again, although larger (inaccurate) estimates result in larger uncertainty in the case of an incorrect choice of \(f(p)\), an estimate based on the mode is largely unaffected by the choice of distribution for \(p\), and instead this results only in larger uncertainty surrounding our estimate \(\hat{N}\).

Peaked p (correct),q (incorrect)

To explore this a little further, we check to see how the mode-based estimate is affected if \(p\) is mostly correct, but \(q\) is not. For this, we choose \(p\) peaked close to 0 and choose \(q\) peaked close to 0 also.

Distribution

As in the case of uniform \(f(p)\), a better choice of this distribution does not significantly improve estiamtes of \(N\) if \(f(q)\) is poorly chosen:

dens_plot(1, 150, 1, 150, 33640, 448, 36000, 40000, 100)

Multiplier Method

For a good choice of distribution of \(f(p)\), we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM, much like the case for incorrect \(f(q)\) and uniform \(f(p)\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp0q0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(peakedp0q0Ns)
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0q0Ns)
##        V1           
##  Min.   :1.067e+06  
##  1st Qu.:5.305e+06  
##  Median :9.720e+06  
##  Mean   :6.677e+07  
##  3rd Qu.:2.154e+07  
##  Max.   :3.892e+10
peakedp0q0weights
##                         
## [1,] 0.4964911 0.5035089

log-WMM

Again with the log-transformed case, we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM. The distribution is similar to that of untransformed data.

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- log(pop.x)
peakedp0q0.pop.x <- exp(pop.x)

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(exp(peakedp0q0Ns))
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0q0Ns)
##        V1           
##  Min.   :1.007e+06  
##  1st Qu.:4.590e+06  
##  Median :7.588e+06  
##  Mean   :1.543e+07  
##  3rd Qu.:1.475e+07  
##  Max.   :3.505e+09
peakedp0q0weights
##                         
## [1,] 0.5066745 0.4933255

Bayesian Model

Finally, the Bayesian model also overestimates \(N\) when \(f(p)\) is well specified by \(f(q)\) is poorly chosen: Posterior of N, peaked distribution of p (0), q (0) While there is overestimation in the value of node \(b\) when \(f(q)\) is misspecified, these distributions show clear improvement on the estimates generated with the same choice \(f(q)\) but uniform \(f(p)\).

It should also be noted that while the distribution and the Bayesian model do not show the same level of overestimation, these estimates are constrained by the choice of range for the discrete uniform prior on \(N\).