Here we use the opioid tree to test the three methods - distribution plotting, weighted multiplier method (MM), and Bayesian model. First, we import the opioid data and delete the healthcare unattended branch entirely.
treedata <- read_csv("diagram_data_hc_nohc.csv")
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## from = col_character(),
## to = col_character(),
## Estimate = col_double(),
## Total = col_double(),
## Population = col_logical(),
## Description = col_character()
## )
# adjust nodes g, g1, g2 to become one node (currently ignoring
# underreporting here); sum all estimate from g nodes and delete g1, g2 from tree
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[4],
treedata$Estimate[5])
treedata$Total[6:9] <- treedata$Estimate[3]
treedata <- treedata[-c(4,5),]
# delete healthcare unattended branch
treedata <- treedata[-c(16,17,18,19,20),]
# adjust tree so that it looks like small subtree used in intuitive notebooks -
# here we combine "Admitted to ED" and "attended by EMS but not transported".
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[2])
treedata$Total[4:7] <- treedata$Estimate[3]
treedata$Description[3] <- "Admitted to ED or not transported by EMS"
treedata <- treedata[-2,]
# adjust population values so only certain leaves are considered as data
treedata$Population <- c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
# show data and tree
treedata %>%
knitr::kable() %>%
kable_styling()
from | to | Estimate | Total | Population | Description |
---|---|---|---|---|---|
b | e | 473 | 34113 | FALSE | Attended by EMS but dies(in transport, on scene) |
b | g | 33640 | 34113 | TRUE | Admitted to ED or not transported by EMS |
g | m | 45 | 33640 | FALSE | Died in ED or after LWBS |
g | n | 12907 | 33640 | FALSE | Released (assume survival) |
g | o | 2376 | 33640 | FALSE | Admitted to hospital |
g | p | NA | 33640 | FALSE | Underdiagnosed and under-reported hospitalizations |
o | q | 106 | 2376 | FALSE | Died in hospital |
o | r | 2270 | 2376 | FALSE | Released (assumed survived) |
e | w | 448 | 473 | TRUE | Reported by BCCS |
e | x | 25 | 473 | FALSE | Under-reported as opioid OD death1 |
m | y | 43 | 45 | FALSE | Reported by BCCS |
m | z | 2 | 45 | FALSE | Under-diagnosed and under-notified |
q | ab | 98 | 106 | FALSE | Reported by BCCS |
q | ac | 8 | 106 | FALSE | Under-diagnosed and under-notified |
xo | b | 1309 | 2350 | FALSE | Healthcare attended |
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
node$probability <- round(node$Estimate/node$Total,4)
})
SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
label = function(node) node$Description) #try with node$From to compare to data
plot(tree)
We adjust the data so that the tree resembles our simple subtree as in the intuition
notebooks. We do so by assuming an estimate of the node on the right after the first binary split, “Admitted to ED or not transported by EMS”, is known. We eliminate all nodes below this one for this exercise. The other node assumed to be known is the endpoint of the left-most path, “Reported by BCCS”, from those “Admitted by EMS but dies”. The cropped tree is as follows, where the root is now node \(b\), “Healthcare attended overdoses”:
# cropped nodes below "admitted to ED or not transported by EMS"
treedata <- treedata[-c(3:8,11:15),]
# show tree and data
treedata %>%
knitr::kable() %>%
kable_styling()
from | to | Estimate | Total | Population | Description | TrueEst |
---|---|---|---|---|---|---|
b | e | 473 | 34113 | FALSE | Attended by EMS but dies(in transport, on scene) | 473 |
b | g | 33640 | 34113 | TRUE | Admitted to ED or not transported by EMS | 33640 |
e | w | 448 | 473 | TRUE | Reported by BCCS | 448 |
e | x | 25 | 473 | FALSE | Under-reported as opioid OD death1 | 25 |
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
node$probability <- round(node$Estimate/node$Total,4)
})
SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
label = function(node) node$Description)
plot(tree)
Now we generate a root estimate of \(b\) assuming only the values at the left- and right-most leaves. Maintaining the convention that the path probability to the right-most leaf is \(p\) and that to the left-most leaf is \(q\), we have a subtree analogous to the simulation, but this time we have a cohort estimate of the root with which to compare the estimates given by each method. In this case, we know the cohort value at \(b\) to be 34113. We test each method’s sensitivity to the distributions of \(p,q\) in estimating node \(b\). In each case, we begin by using the actual values generating \(p,q\) to establish adequate estimation of \(b\) when the distributions of \(p,q\) are accurate with respect to the data. Next, we proceed much like in the “intuition notebook 1” - estimate \(b\) using uniform \(p,q\), followed by changing to the peaked distributions for \(q\), then a middle case distribution of \(q\), and finally we will try a peaked distributions of both \(p\) and \(q\), since this closely resembles observed branching structure of this subtree.
For these high values of \(N\), we must plot the log density for computational reasons, then revert. The density when using the calculated posterior results in the following plot, with mode around \(\hat{N} = 34100\), as expected:
source("intuitiveexercise.R")
dens_plot(473 + 1, 34113 - 473 + 1, 448 + 1, 473 - 448 + 1, 33640, 448, 34075, 34150, 1)
We now use the weighted multiplier method to plot a comparable distribution. We use 5000 samples of \(p,q\) to generate weights, and then calculate and plot the 5000 corresponding estimates of \(N\) generated with paired samples \(p,q\):
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
data.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(dataNs)
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
We see the mode of the distribution accurately falls around \(N=34120\).
Using the Bayesian model, we get the following posterior plot of \(N\), the population at node \(b\):
Here we see the mode clearly sits around the true value given by the data.
When we assume information about \(p,q\) is not readily available or reliable, we may be tempted to use a uniform distribution for these branching probabilities. The following density plot of the estimates \(\hat{N}\) results:
dens_plot(1, 1, 1, 1, 33640, 448, 33750, 45000, 250)
With no additional information about \(p,q\), we have a slight overestimation of the actual value \(N\) at node \(b\), with a mode around \(N=36500\).
# adjust data
# only Popultaion = TRUE values are used and must be changed
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0
uniformdata
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
uniform.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(uniformNs)
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(uniformNs)
## V1
## Min. : 32918
## 1st Qu.: 55138
## Median : 85767
## Mean : 411383
## 3rd Qu.: 162066
## Max. :496625385
uniformweights
##
## [1,] 0.500372 0.499628
The multiplier method does not predict a mode near the true value when the distributions of \(p,q\) are uniform. With uniform \(p,q\), backcalculating from the bottom left node will result in a high estimate of \(b\), which the same process will result in lower estimates of \(b\) from the rightmost leaf. The multiplier method is estimating a balance of these, weighted slightly more heavily in favor of the smaller estimates as these are the result of a shorter path to known data.
Using the uniform distributions with the Bayesian model, we get the following posterior plot, where the mode is providing a good estimate of the true value of node \(b\):
For the distribution with \(q\) peaked near 1, we expect to see a better estimate of the true value than with the uniform choice of \(q\), since the cohort data informs a branching probability of \(\approx 94\%\). This is indeed the case, as we see a mode around \(N=34100\):
dens_plot(1, 1, 150, 1, 33640, 448, 34000, 34200, 5)
When \(q\) is peaked “incorrectly”, with mass around 0:
dens_plot(1, 1, 1, 150, 33640, 448, 34000, 60000, 1000)
We can see in this case the mass of the distribution is moved to high values of \(N\), and we don’t have much hope of recovering an estimate close to the true value.
We now use the weighted MM with \(q\) peaked near 0:
# adjust data
# only Population = TRUE values are used and must be changed
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149
peakedq0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedq0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(peakedq0Ns)
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq0Ns)
## V1
## Min. :2.408e+05
## 1st Qu.:1.448e+06
## Median :2.805e+06
## Mean :1.514e+07
## 3rd Qu.:6.654e+06
## Max. :4.436e+09
peakedq0weights
##
## [1,] 0.3809853 0.6190147
As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=80000\).
Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150
peakedq1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedq1.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(peakedq1Ns)
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq1Ns)
## V1
## Min. : 27748
## 1st Qu.: 31821
## Median : 32868
## Mean : 33333
## 3rd Qu.: 33963
## Max. :987819
peakedq1weights
##
## [1,] 0.9986495 0.001350467
In this case, we see good estimation of the population at node \(b\), with \(N \approx 33000\).
In the Bayesian model case, with \(q\) ‘incorrectly’ peaked near 0, we have the following plot:
For \(q\) peaked near 1, we have very good estimation, as expected, with mode in line with what is predicted using the density:
In the middle case, recall we use \(q \sim Beta(15,15)\). This distribution is bell-shaped and centered around 0.5, with standard deviation just under 0.1. In this case, we don’t do too badly; the mode sits around \(N=34600\):
dens_plot(1, 1, 15, 15, 33640, 448, 34000, 35500, 10)
# adjust data
# only Population = TRUE values are used and must be changed
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28
middle
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
middle.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(middleNs)
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(middleNs)
## V1
## Min. : 40138
## 1st Qu.: 59178
## Median : 67377
## Mean : 75884
## 3rd Qu.: 78372
## Max. :4184778
middleweights
##
## [1,] 0.9613264 0.0386736
While the posterior distribution withstands this choice of distribution for \(q\), the multiplier method is estimating a population around \(N=57500\), a large overestimate of the true value.
For the middle case distribution of \(q\), we have the following posterior using the Bayesian model in blang
:
We don’t yet have a clear trend here. However, rerunning the simulation longer to promote better convergence we do see more tendancy towards a mode, but it is not yet in agreeance with the mode in the density:
Lastly, we estimate using a “correctly” peaked distribution of \(q\), and also adjust the distribution of \(p\) to be peaked around 0, which is close to the values suggested by the data, and also to be peaked around 1, to see the effect of poor priors on \(p\) on estimation of \(N\). The first case shows good estimation, with the mode around \(N=34100\):
dens_plot(1, 150, 150, 1, 33640, 448, 34000, 34200, 5)
In the second case, where the distribution of \(p\) is pushed mostly towards 1 but the actual \(p\) value from the cohort is \(\approx 0.01\), we interesting still see quite good estimation, with the mode of \(N = 34300\):
dens_plot(150, 1, 150, 1, 33640, 448, 34000, 36000, 10)
Essentially, we only see an increase in uncertainty when \(f(p)\) is poorly chosen, and not a significant change in a mode-based estimate \(\hat{N}\).
We begin by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(peakedp0Ns)
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0Ns)
## V1
## Min. :3.061e+04
## 1st Qu.:3.862e+04
## Median :4.448e+04
## Mean :3.186e+05
## 3rd Qu.:6.046e+04
## Max. :1.106e+09
peakedp0weights
##
## [1,] 0.9984049 0.001595073
We do see an overestimate in this case, with a mode close to \(N=37000\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:
# adjust data
# only Population = TRUE values are used and must be changed
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp1.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(peakedp1Ns)
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp1Ns)
## V1
## Min. :33476
## 1st Qu.:33808
## Median :33973
## Mean :34037
## 3rd Qu.:34187
## Max. :36084
peakedp1weights
##
## [1,] 0.0380689 0.9619311
Interestingly, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\).
The following plots result using the Bayesian model for distribution of \(p\) peaked around 0 and then 1, respectively:
Posterior of N, peaked distribution of p (1), q
As in the densities, we see both agree on a mode close to the true value. Again, although larger (inaccurate) estimates result in larger uncertainty in the case of an incorrect choice of \(f(p)\), an estimate based on the mode is largely unaffected by the choice of distribution for \(p\), and instead this results only in larger uncertainty surrounding our estimate \(\hat{N}\).
To explore this a little further, we check to see how the mode-based estimate is affected if \(p\) is mostly correct, but \(q\) is not. For this, we choose \(p\) peaked close to 0 and choose \(q\) peaked close to 0 also.
As in the case of uniform \(f(p)\), a better choice of this distribution does not significantly improve estiamtes of \(N\) if \(f(q)\) is poorly chosen:
dens_plot(1, 150, 1, 150, 33640, 448, 36000, 40000, 100)
For a good choice of distribution of \(f(p)\), we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM, much like the case for incorrect \(f(q)\) and uniform \(f(p)\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp0q0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(peakedp0q0Ns)
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0q0Ns)
## V1
## Min. :1.267e+06
## 1st Qu.:5.457e+06
## Median :9.792e+06
## Mean :3.974e+07
## 3rd Qu.:2.107e+07
## Max. :9.509e+09
peakedp0q0weights
##
## [1,] 0.4960676 0.5039324
Finally, the Bayesian model also overestimates \(N\) when \(f(p)\) is well specified by \(f(q)\) is poorly chosen: While there is overestimation in the value of node \(b\) when \(f(q)\) is misspecified, these distributions show clear improvement on the estimates generated with the same choice \(f(q)\) but uniform \(f(p)\).
It should also be noted that while the distribution and the Bayesian model do not show the same level of overestimation, these estimates are constrained by the choice of range for the discrete uniform prior on \(N\).