Here we use the opioid tree to test the three methods - distribution plotting, weighted multiplier method (MM), and Bayesian model. First, we import the opioid data and delete the healthcare unattended branch entirely.

treedata <- read_csv("diagram_data_hc_nohc.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   from = col_character(),
##   to = col_character(),
##   Estimate = col_double(),
##   Total = col_double(),
##   Population = col_logical(),
##   Description = col_character()
## )

# adjust nodes g, g1, g2 to become one node (currently ignoring 
# underreporting here); sum all estimate from g nodes and delete g1, g2 from tree
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[4],
                         treedata$Estimate[5])
treedata$Total[6:9] <- treedata$Estimate[3]
treedata <- treedata[-c(4,5),]
# delete healthcare unattended branch
treedata <- treedata[-c(16,17,18,19,20),]

# adjust tree so that it looks like small subtree used in intuitive notebooks - 
# here we combine "Admitted to ED" and "attended by EMS but not transported".
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[2])
treedata$Total[4:7] <- treedata$Estimate[3]
treedata$Description[3] <- "Admitted to ED or not transported by EMS"
treedata <- treedata[-2,]


# adjust population values so only certain leaves are considered as data
treedata$Population <- c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)

# show data and tree
treedata %>%
  knitr::kable() %>%
  kable_styling()

from	to	Estimate	Total	Population	Description
b	e	473	34113	FALSE	Attended by EMS but dies(in transport, on scene)
b	g	33640	34113	TRUE	Admitted to ED or not transported by EMS
g	m	45	33640	FALSE	Died in ED or after LWBS
g	n	12907	33640	FALSE	Released (assume survival)
g	o	2376	33640	FALSE	Admitted to hospital
g	p	NA	33640	FALSE	Underdiagnosed and under-reported hospitalizations
o	q	106	2376	FALSE	Died in hospital
o	r	2270	2376	FALSE	Released (assumed survived)
e	w	448	473	TRUE	Reported by BCCS
e	x	25	473	FALSE	Under-reported as opioid OD death1
m	y	43	45	FALSE	Reported by BCCS
m	z	2	45	FALSE	Under-diagnosed and under-notified
q	ab	98	106	FALSE	Reported by BCCS
q	ac	8	106	FALSE	Under-diagnosed and under-notified
xo	b	1309	2350	FALSE	Healthcare attended

treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
  node$probability <- round(node$Estimate/node$Total,4)
})

SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
             label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
             label = function(node) node$Description) #try with node$From to compare to data
plot(tree)

We adjust the data so that the tree resembles our simple subtree as in the intuition notebooks. We do so by assuming an estimate of the node on the right after the first binary split, “Admitted to ED or not transported by EMS”, is known. We eliminate all nodes below this one for this exercise. The other node assumed to be known is the endpoint of the left-most path, “Reported by BCCS”, from those “Admitted by EMS but dies”. The cropped tree is as follows, where the root is now node \(b\), “Healthcare attended overdoses”:

# cropped nodes below "admitted to ED or not transported by EMS"
treedata <- treedata[-c(3:8,11:15),]

# show tree and data
treedata %>%
  knitr::kable() %>%
  kable_styling()

from	to	Estimate	Total	Population	Description	TrueEst
b	e	473	34113	FALSE	Attended by EMS but dies(in transport, on scene)	473
b	g	33640	34113	TRUE	Admitted to ED or not transported by EMS	33640
e	w	448	473	TRUE	Reported by BCCS	448
e	x	25	473	FALSE	Under-reported as opioid OD death1	25

treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
  node$probability <- round(node$Estimate/node$Total,4)
})

SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
             label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
             label = function(node) node$Description) 
plot(tree)

Now we generate a root estimate of \(b\) assuming only the values at the left- and right-most leaves. Maintaining the convention that the path probability to the right-most leaf is \(p\) and that to the left-most leaf is \(q\), we have a subtree analogous to the simulation, but this time we have a cohort estimate of the root with which to compare the estimates given by each method. In this case, we know the cohort value at \(b\) to be 34113. We test each method’s sensitivity to the distributions of \(p,q\) in estimating node \(b\). In each case, we begin by using the actual values generating \(p,q\) to establish adequate estimation of \(b\) when the distributions of \(p,q\) are accurate with respect to the data. Next, we proceed much like in the “intuition notebook 1” - estimate \(b\) using uniform \(p,q\), followed by changing to the peaked distributions for \(q\), then a middle case distribution of \(q\), and finally we will try a peaked distributions of both \(p\) and \(q\), since this closely resembles observed branching structure of this subtree.

Estimation using data values

Distribution

For these high values of \(N\), we must plot the log density for computational reasons, then revert. The density when using the calculated posterior results in the following plot, with mode around \(\hat{N} = 34100\), as expected:

source("intuitiveexercise.R")
dens_plot(473 + 1, 34113 - 473 + 1, 448 + 1, 473 - 448 + 1, 33640, 448, 34075, 34150, 1)

Multiplier Method

We now use the weighted multiplier method to plot a comparable distribution. We use 5000 samples of \(p,q\) to generate weights, and then calculate and plot the 5000 corresponding estimates of \(N\) generated with paired samples \(p,q\):

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
data.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(dataNs)
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

We see the mode of the distribution accurately falls around \(N=34120\).

Bayesian Model

Using the Bayesian model, we get the following posterior plot of \(N\), the population at node \(b\): Posterior of N using data to inform p,q

Here we see the mode clearly sits around the true value given by the data.

Uniform p,q

Distribution

When we assume information about \(p,q\) is not readily available or reliable, we may be tempted to use a uniform distribution for these branching probabilities. The following density plot of the estimates \(\hat{N}\) results:

dens_plot(1, 1, 1, 1, 33640, 448, 33750, 45000, 250)

With no additional information about \(p,q\), we have a slight overestimation of the actual value \(N\) at node \(b\), with a mode around \(N=36500\).

Multiplier Method

# adjust data
# only Popultaion = TRUE values are used and must be changed 
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0 
uniformdata

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
uniform.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(uniformNs)
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(uniformNs)

##        V1           
##  Min.   :    32918  
##  1st Qu.:    55138  
##  Median :    85767  
##  Mean   :   411383  
##  3rd Qu.:   162066  
##  Max.   :496625385

uniformweights

##                       
## [1,] 0.500372 0.499628

The multiplier method does not predict a mode near the true value when the distributions of \(p,q\) are uniform. With uniform \(p,q\), backcalculating from the bottom left node will result in a high estimate of \(b\), which the same process will result in lower estimates of \(b\) from the rightmost leaf. The multiplier method is estimating a balance of these, weighted slightly more heavily in favor of the smaller estimates as these are the result of a shorter path to known data.

Bayesian Model

Using the uniform distributions with the Bayesian model, we get the following posterior plot, where the mode is providing a good estimate of the true value of node \(b\): Posterior of N, uniform p,q

Peaked q

Distribution

For the distribution with \(q\) peaked near 1, we expect to see a better estimate of the true value than with the uniform choice of \(q\), since the cohort data informs a branching probability of \(\approx 94\%\). This is indeed the case, as we see a mode around \(N=34100\):

dens_plot(1, 1, 150, 1, 33640, 448, 34000, 34200, 5)

When \(q\) is peaked “incorrectly”, with mass around 0:

dens_plot(1, 1, 1, 150, 33640, 448, 34000, 60000, 1000)

We can see in this case the mass of the distribution is moved to high values of \(N\), and we don’t have much hope of recovering an estimate close to the true value.

Multiplier Method

We now use the weighted MM with \(q\) peaked near 0:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149 
peakedq0

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedq0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(peakedq0Ns)
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq0Ns)

##        V1           
##  Min.   :2.408e+05  
##  1st Qu.:1.448e+06  
##  Median :2.805e+06  
##  Mean   :1.514e+07  
##  3rd Qu.:6.654e+06  
##  Max.   :4.436e+09

peakedq0weights

##                         
## [1,] 0.3809853 0.6190147

As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=80000\).

Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150 
peakedq1

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedq1.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(peakedq1Ns)
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedq1Ns)

##        V1        
##  Min.   : 27748  
##  1st Qu.: 31821  
##  Median : 32868  
##  Mean   : 33333  
##  3rd Qu.: 33963  
##  Max.   :987819

peakedq1weights

##                           
## [1,] 0.9986495 0.001350467

In this case, we see good estimation of the population at node \(b\), with \(N \approx 33000\).

Bayesian Model

In the Bayesian model case, with \(q\) ‘incorrectly’ peaked near 0, we have the following plot: Posterior of N, peaked q (0)

For \(q\) peaked near 1, we have very good estimation, as expected, with mode in line with what is predicted using the density: Posterior of N, peaked q (1)

Middle case

Distribution

In the middle case, recall we use \(q \sim Beta(15,15)\). This distribution is bell-shaped and centered around 0.5, with standard deviation just under 0.1. In this case, we don’t do too badly; the mode sits around \(N=34600\):

dens_plot(1, 1, 15, 15, 33640, 448, 34000, 35500, 10)

Multiplier Method

# adjust data
# only Population = TRUE values are used and must be changed 
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28 
middle

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
middle.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(middleNs)
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(middleNs)

##        V1         
##  Min.   :  40138  
##  1st Qu.:  59178  
##  Median :  67377  
##  Mean   :  75884  
##  3rd Qu.:  78372  
##  Max.   :4184778

middleweights

##                         
## [1,] 0.9613264 0.0386736

While the posterior distribution withstands this choice of distribution for \(q\), the multiplier method is estimating a population around \(N=57500\), a large overestimate of the true value.

Bayesian Model

For the middle case distribution of \(q\), we have the following posterior using the Bayesian model in blang: Posterior of N, middle case distribution of q

We don’t yet have a clear trend here. However, rerunning the simulation longer to promote better convergence we do see more tendancy towards a mode, but it is not yet in agreeance with the mode in the density: Posterior of N, middle case distribution of q

Peaked p (variable),q (correct - static)

Distribution

Lastly, we estimate using a “correctly” peaked distribution of \(q\), and also adjust the distribution of \(p\) to be peaked around 0, which is close to the values suggested by the data, and also to be peaked around 1, to see the effect of poor priors on \(p\) on estimation of \(N\). The first case shows good estimation, with the mode around \(N=34100\):

dens_plot(1, 150, 150, 1, 33640, 448, 34000, 34200, 5)

In the second case, where the distribution of \(p\) is pushed mostly towards 1 but the actual \(p\) value from the cohort is \(\approx 0.01\), we interesting still see quite good estimation, with the mode of \(N = 34300\):

dens_plot(150, 1, 150, 1, 33640, 448, 34000, 36000, 10)

Essentially, we only see an increase in uncertainty when \(f(p)\) is poorly chosen, and not a significant change in a mode-based estimate \(\hat{N}\).

Multiplier Method

We begin by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(peakedp0Ns)
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0Ns)

##        V1           
##  Min.   :3.061e+04  
##  1st Qu.:3.862e+04  
##  Median :4.448e+04  
##  Mean   :3.186e+05  
##  3rd Qu.:6.046e+04  
##  Max.   :1.106e+09

peakedp0weights

##                           
## [1,] 0.9984049 0.001595073

We do see an overestimate in this case, with a mode close to \(N=37000\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp1.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(peakedp1Ns)
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp1Ns)

##        V1       
##  Min.   :33476  
##  1st Qu.:33808  
##  Median :33973  
##  Mean   :34037  
##  3rd Qu.:34187  
##  Max.   :36084

peakedp1weights

##                         
## [1,] 0.0380689 0.9619311

Interestingly, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\).

Bayesian Model

The following plots result using the Bayesian model for distribution of \(p\) peaked around 0 and then 1, respectively: Posterior of N, peaked distribution of p (0), q

Posterior of N, peaked distribution of p (1), q

As in the densities, we see both agree on a mode close to the true value. Again, although larger (inaccurate) estimates result in larger uncertainty in the case of an incorrect choice of \(f(p)\), an estimate based on the mode is largely unaffected by the choice of distribution for \(p\), and instead this results only in larger uncertainty surrounding our estimate \(\hat{N}\).

Peaked p (correct),q (incorrect)

To explore this a little further, we check to see how the mode-based estimate is affected if \(p\) is mostly correct, but \(q\) is not. For this, we choose \(p\) peaked close to 0 and choose \(q\) peaked close to 0 also.

Distribution

As in the case of uniform \(f(p)\), a better choice of this distribution does not significantly improve estiamtes of \(N\) if \(f(q)\) is poorly chosen:

dens_plot(1, 150, 1, 150, 33640, 448, 36000, 40000, 100)

Multiplier Method

For a good choice of distribution of \(f(p)\), we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM, much like the case for incorrect \(f(q)\) and uniform \(f(p)\):

# adjust data
# only Population = TRUE values are used and must be changed 
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0

# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')

## [1] "using weighted mean estimates"

tree$Do(function(node) {
  if(node$isRoot){
    node$uncertainty <-pop.conf.int(tree)
    }
  else{
    node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
    }}
  )

# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){    
  mat.x <- NULL
  for (i in 1:length(x)) {
    if (length(x[[i]] > 0)) {
      mat.x <- cbind(mat.x, x[[i]])
      colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
    }
    }
  x <- mat.x
  }
x <- as.data.frame(x)
  
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>% 
  select(all_of(getleaves))

pop.x <- round(pop.x)
peakedp0q0.pop.x <- pop.x

# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(peakedp0q0Ns)
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") + 
   ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p

summary(peakedp0q0Ns)

##        V1           
##  Min.   :1.267e+06  
##  1st Qu.:5.457e+06  
##  Median :9.792e+06  
##  Mean   :3.974e+07  
##  3rd Qu.:2.107e+07  
##  Max.   :9.509e+09

peakedp0q0weights

##                         
## [1,] 0.4960676 0.5039324

Bayesian Model

Finally, the Bayesian model also overestimates \(N\) when \(f(p)\) is well specified by \(f(q)\) is poorly chosen: Posterior of N, peaked distribution of p (0), q (0) While there is overestimation in the value of node \(b\) when \(f(q)\) is misspecified, these distributions show clear improvement on the estimates generated with the same choice \(f(q)\) but uniform \(f(p)\).

It should also be noted that while the distribution and the Bayesian model do not show the same level of overestimation, these estimates are constrained by the choice of range for the discrete uniform prior on \(N\).

Methods Testing on Lower Tree

24/09/2021

Estimation using data values

Distribution

Multiplier Method

Bayesian Model

Uniform p,q

Distribution

Multiplier Method

Bayesian Model

Peaked q

Distribution

Multiplier Method

Bayesian Model

Middle case

Distribution

Multiplier Method

Bayesian Model

Peaked p (variable),q (correct - static)

Distribution

Multiplier Method

Bayesian Model

Peaked p (correct),q (incorrect)

Distribution

Multiplier Method

Bayesian Model