Here we use the opioid tree to test the three methods - distribution plotting, weighted multiplier method (MM), and Bayesian model. First, we import the opioid data and delete the healthcare unattended branch entirely.
treedata <- read_csv("diagram_data_hc_nohc.csv")
## Rows: 23 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): from, to, Description
## dbl (2): Estimate, Total
## lgl (1): Population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# adjust nodes g, g1, g2 to become one node (currently ignoring
# underreporting here); sum all estimate from g nodes and delete g1, g2 from tree
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[4],
treedata$Estimate[5])
treedata$Total[6:9] <- treedata$Estimate[3]
treedata <- treedata[-c(4,5),]
# delete healthcare unattended branch
treedata <- treedata[-c(16,17,18,19,20),]
# adjust tree so that it looks like small subtree used in intuitive notebooks -
# here we combine "Admitted to ED" and "attended by EMS but not transported".
treedata$Estimate[3] <- sum(treedata$Estimate[3],treedata$Estimate[2])
treedata$Total[4:7] <- treedata$Estimate[3]
treedata$Description[3] <- "Admitted to ED or not transported by EMS"
treedata <- treedata[-2,]
# adjust population values so only certain leaves are considered as data
treedata$Population <- c(FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE)
# show data and tree
treedata %>%
knitr::kable() %>%
kable_styling()
from | to | Estimate | Total | Population | Description |
---|---|---|---|---|---|
b | e | 473 | 34113 | FALSE | Attended by EMS but dies(in transport, on scene) |
b | g | 33640 | 34113 | TRUE | Admitted to ED or not transported by EMS |
g | m | 45 | 33640 | FALSE | Died in ED or after LWBS |
g | n | 12907 | 33640 | FALSE | Released (assume survival) |
g | o | 2376 | 33640 | FALSE | Admitted to hospital |
g | p | NA | 33640 | FALSE | Underdiagnosed and under-reported hospitalizations |
o | q | 106 | 2376 | FALSE | Died in hospital |
o | r | 2270 | 2376 | FALSE | Released (assumed survived) |
e | w | 448 | 473 | TRUE | Reported by BCCS |
e | x | 25 | 473 | FALSE | Under-reported as opioid OD death1 |
m | y | 43 | 45 | FALSE | Reported by BCCS |
m | z | 2 | 45 | FALSE | Under-diagnosed and under-notified |
q | ab | 98 | 106 | FALSE | Reported by BCCS |
q | ac | 8 | 106 | FALSE | Under-diagnosed and under-notified |
xo | b | 1309 | 2350 | FALSE | Healthcare attended |
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
node$probability <- round(node$Estimate/node$Total,4)
})
SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
label = function(node) node$Description) #try with node$From to compare to data
plot(tree)
We adjust the data so that the tree resembles our simple subtree as in the intuition
notebooks. We do so by assuming an estimate of the node on the right after the first binary split, “Admitted to ED or not transported by EMS”, is known. We eliminate all nodes below this one for this exercise. The other node assumed to be known is the endpoint of the left-most path, “Reported by BCCS”, from those “Admitted by EMS but dies”. The cropped tree is as follows, where the root is now node \(b\), “Healthcare attended overdoses”:
# cropped nodes below "admitted to ED or not transported by EMS"
treedata <- treedata[-c(3:8,11:15),]
# show tree and data
treedata %>%
knitr::kable() %>%
kable_styling()
from | to | Estimate | Total | Population | Description | TrueEst |
---|---|---|---|---|---|---|
b | e | 473 | 34113 | FALSE | Attended by EMS but dies(in transport, on scene) | 473 |
b | g | 33640 | 34113 | TRUE | Admitted to ED or not transported by EMS | 33640 |
e | w | 448 | 473 | TRUE | Reported by BCCS | 448 |
e | x | 25 | 473 | FALSE | Under-reported as opioid OD death1 | 25 |
treedata$TrueEst <- treedata$Estimate
tree <- FromDataFrameNetwork(treedata)
tree$Do(function(node){
node$probability <- round(node$Estimate/node$Total,4)
})
SetGraphStyle(tree,scale=2)
SetEdgeStyle(tree,penwidth=3)
SetEdgeStyle(tree, arrowhead = "vee", color = "grey35", penwidth = 2,
label = function(node) node$probability)
SetNodeStyle(tree, fontsize=25, penwidth=3,width=1,
label = function(node) node$Description)
plot(tree)
Now we generate a root estimate of \(b\) assuming only the values at the left- and right-most leaves. Maintaining the convention that the path probability to the right-most leaf is \(p\) and that to the left-most leaf is \(q\), we have a subtree analogous to the simulation, but this time we have a cohort estimate of the root with which to compare the estimates given by each method. In this case, we know the cohort value at \(b\) to be 34113. We test each method’s sensitivity to the distributions of \(p,q\) in estimating node \(b\). In each case, we begin by using the actual values generating \(p,q\) to establish adequate estimation of \(b\) when the distributions of \(p,q\) are accurate with respect to the data. Next, we proceed much like in the “intuition notebook 1” - estimate \(b\) using uniform \(p,q\), followed by changing to the peaked distributions for \(q\), then a middle case distribution of \(q\), and finally we will try a peaked distributions of both \(p\) and \(q\), since this closely resembles observed branching structure of this subtree.
For these high values of \(N\), we must plot the log density for computational reasons, then revert. The density when using the calculated posterior results in the following plot, with mode around \(\hat{N} = 34100\), as expected:
source("intuitiveexercise.R")
dens_plot(473 + 1, 34113 - 473 + 1, 448 + 1, 473 - 448 + 1, 33640, 448, 34075, 34150, 1)
We now use the weighted multiplier method to plot a comparable distribution. We use 5000 samples of \(p,q\) to generate weights, and then calculate and plot the 5000 corresponding estimates of \(N\) generated with paired samples \(p,q\):
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
data.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(dataNs)
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
We see the mode of the distribution accurately falls around \(N=34120\).
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(treedata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
data.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
dataweights <- weights
dataNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
dataNs <- round(exp(dataNs))
dataNs.restrict <- dataNs[dataNs$V1>=30000 & dataNs$V1<=60000, , drop = FALSE]
p <- ggplot(dataNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for data-informed distribution for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
Again, the mode accurately falls at approximately \(N = 34113\).
Using the Bayesian model, we get the following posterior plot of \(N\), the population at node \(b\):
Here we see the mode clearly sits around the true value given by the data.
When we assume information about \(p,q\) is not readily available or reliable, we may be tempted to use a uniform distribution for these branching probabilities. The following density plot of the estimates \(\hat{N}\) results:
dens_plot(1, 1, 1, 1, 33640, 448, 33750, 45000, 250)
With no additional information about \(p,q\), we have a slight overestimation of the actual value \(N\) at node \(b\), with a mode around \(N=36500\).
# adjust data
# only Popultaion = TRUE values are used and must be changed
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0
uniformdata
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
uniform.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(uniformNs)
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(uniformNs)
## V1
## Min. : 32471
## 1st Qu.: 54471
## Median : 81426
## Mean : 266708
## 3rd Qu.: 155378
## Max. :73689673
uniformweights
##
## [1,] 0.5027943 0.4972057
The multiplier method does not predict a mode near the true value when the distributions of \(p,q\) are uniform. With uniform \(p,q\), backcalculating from the bottom left node will result in a high estimate of \(b\), which the same process will result in lower estimates of \(b\) from the rightmost leaf. The multiplier method is estimating a balance of these, weighted slightly more heavily in favor of the smaller estimates as these are the result of a shorter path to known data.
# adjust data
# only Popultaion = TRUE values are used and must be changed
uniformdata <- treedata
uniformdata[2,3] <- 0
uniformdata[2,4] <- 0
uniformdata[3,3] <- 0
uniformdata[3,4] <- 0
uniformdata
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(uniformdata)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
uniform.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
uniformweights <- weights
uniformNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
uniformNs <- round(exp(uniformNs))
uniformNs.restrict <- uniformNs[uniformNs$V1>=30000 & uniformNs$V1<=60000, , drop = FALSE]
p <- ggplot(uniformNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform distributions for p,q", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(uniformNs)
## V1
## Min. : 32191
## 1st Qu.: 53172
## Median : 76238
## Mean : 137772
## 3rd Qu.: 123921
## Max. :31132806
uniformweights
##
## [1,] 0.5039039 0.4960961
The density is similar to the case of non-transformed data, but more uniformly distributed near the mode, which reamains around approximately \(N=43500\).
Using the uniform distributions with the Bayesian model, we get the following posterior plot, where the mode is providing a good estimate of the true value of node \(b\):
For the distribution with \(q\) peaked near 1, we expect to see a better estimate of the true value than with the uniform choice of \(q\), since the cohort data informs a branching probability of \(\approx 94\%\). This is indeed the case, as we see a mode around \(N=34100\):
dens_plot(1, 1, 150, 1, 33640, 448, 34000, 34200, 5)
When \(q\) is peaked “incorrectly”, with mass around 0:
dens_plot(1, 1, 1, 150, 33640, 448, 34000, 60000, 1000)
We can see in this case the mass of the distribution is moved to high values of \(N\), and we don’t have much hope of recovering an estimate close to the true value.
We now use the weighted MM with \(q\) peaked near 0:
# adjust data
# only Population = TRUE values are used and must be changed
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149
peakedq0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedq0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(peakedq0Ns)
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq0Ns)
## V1
## Min. :2.322e+05
## 1st Qu.:1.415e+06
## Median :2.736e+06
## Mean :2.108e+07
## 3rd Qu.:6.483e+06
## Max. :1.289e+10
peakedq0weights
##
## [1,] 0.3708844 0.6291156
As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=800000\).
Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150
peakedq1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedq1.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(peakedq1Ns)
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq1Ns)
## V1
## Min. : 27185
## 1st Qu.: 31972
## Median : 33033
## Mean : 33849
## 3rd Qu.: 34147
## Max. :2441167
peakedq1weights
##
## [1,] 0.9971794 0.002820579
In this case, we see good estimation of the population at node \(b\), with \(N \approx 33000\).
We now use the log data with weighted MM with \(q\) peaked near 0:
# adjust data
# only Population = TRUE values are used and must be changed
peakedq0 <- treedata
peakedq0[2,3] <- 0
peakedq0[2,4] <- 0
peakedq0[3,3] <- 0
peakedq0[3,4] <- 149
peakedq0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
peakedq0.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq0weights <- weights
peakedq0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq0Ns <- round(exp(peakedq0Ns))
peakedq0Ns.restrict <- peakedq0Ns[peakedq0Ns$V1>=30000 & peakedq0Ns$V1<=900000, , drop = FALSE]
p <- ggplot(peakedq0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) distribution for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq0Ns)
## V1
## Min. : 108316
## 1st Qu.: 278081
## Median : 420187
## Mean : 754352
## 3rd Qu.: 719924
## Max. :61724052
peakedq0weights
##
## [1,] 0.3668003 0.6331997
As expected, we expect the mode of this distribution to be overestimated. In this case, it is around \(N=250000\). While this is still a massive overestimate, this is an improvement over the mode estimate of \(N=800000\) given when using the WMM without log-transformed data.
Now when the distribution of \(q\) is peaked near 1, we expect to see good estimation of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedq1 <- treedata
peakedq1[2,3] <- 0
peakedq1[2,4] <- 0
peakedq1[3,3] <- 149
peakedq1[3,4] <- 150
peakedq1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedq1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
peakedq1.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedq1weights <- weights
peakedq1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedq1Ns <- round(exp(peakedq1Ns))
peakedq1Ns.restrict <- peakedq1Ns[peakedq1Ns$V1>=30000 & peakedq1Ns$V1<=50000, , drop = FALSE]
p <- ggplot(peakedq1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for uniform peaked (1) for q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedq1Ns)
## V1
## Min. :28029
## 1st Qu.:31735
## Median :32738
## Mean :32813
## 3rd Qu.:33872
## Max. :38694
peakedq1weights
##
## [1,] 0.9975978 0.002402227
In this case, we see better estimation of the population at node \(b\), with \(N \approx 33000\), as in the untransformed case. The main difference here the lack of estimates beyond \(N \sim 40000\), whereas the untransformed case is producing estimates beyond this.
In the Bayesian model case, with \(q\) ‘incorrectly’ peaked near 0, we have the following plot:
For \(q\) peaked near 1, we have very good estimation, as expected, with mode in line with what is predicted using the density:
In the middle case, recall we use \(q \sim Beta(15,15)\). This distribution is bell-shaped and centered around 0.5, with standard deviation just under 0.1. In this case, we don’t do too badly; the mode sits around \(N=34600\):
dens_plot(1, 1, 15, 15, 33640, 448, 34000, 35500, 10)
# adjust data
# only Population = TRUE values are used and must be changed
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28
middle
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
middle.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(middleNs)
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(middleNs)
## V1
## Min. : 39147
## 1st Qu.: 58891
## Median : 66860
## Mean : 85082
## 3rd Qu.: 77427
## Max. :47540386
middleweights
##
## [1,] 0.9677737 0.03222626
While the posterior distribution withstands this choice of distribution for \(q\), the multiplier method is estimating a population around \(N=57500\), a large overestimate of the true value.
# adjust data
# only Population = TRUE values are used and must be changed
middle <- treedata
middle[2,3] <- 0
middle[2,4] <- 0
middle[3,3] <- 14
middle[3,4] <- 28
middle
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(middle)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
middle.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
middleweights <- weights
middleNs <- as.data.frame(as.matrix(pop.x) %*% t(weights))
middleNs <- round(exp(middleNs))
middleNs.restrict <- middleNs[middleNs$V1>=30000 & middleNs$V1<=60000, , drop = FALSE]
p <- ggplot(middleNs.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for middle case q, uniform for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(middleNs)
## V1
## Min. : 38247
## 1st Qu.: 57990
## Median : 65018
## Mean : 67406
## 3rd Qu.: 74137
## Max. :160704
middleweights
##
## [1,] 0.9641484 0.03585157
Unfortunately we do not have an improvement in the mode with the transformed data. The distribution actually suggests more confidence in values around the incorrect mode.
For the middle case distribution of \(q\), we have the following posterior using the Bayesian model in blang
:
We don’t yet have a clear trend here. However, rerunning the simulation longer to promote better convergence we do see more tendancy towards a mode, but it is not yet in agreeance with the mode in the density:
Lastly, we estimate using a “correctly” peaked distribution of \(q\), and also adjust the distribution of \(p\) to be peaked around 0, which is close to the values suggested by the data, and also to be peaked around 1, to see the effect of poor priors on \(p\) on estimation of \(N\). The first case shows good estimation, with the mode around \(N=34100\):
dens_plot(1, 150, 150, 1, 33640, 448, 34000, 34200, 5)
In the second case, where the distribution of \(p\) is pushed mostly towards 1 but the actual \(p\) value from the cohort is \(\approx 0.01\), we interesting still see quite good estimation, with the mode of \(N = 34300\):
dens_plot(150, 1, 150, 1, 33640, 448, 34000, 36000, 10)
Essentially, we only see an increase in uncertainty when \(f(p)\) is poorly chosen, and not a significant change in a mode-based estimate \(\hat{N}\).
We begin by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(peakedp0Ns)
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0Ns)
## V1
## Min. : 30235
## 1st Qu.: 38094
## Median : 43025
## Mean : 176299
## 3rd Qu.: 56844
## Max. :230836733
peakedp0weights
##
## [1,] 0.9985963 0.001403686
We do see an overestimate in this case, with a mode close to \(N=37000\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:
# adjust data
# only Population = TRUE values are used and must be changed
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp1.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(peakedp1Ns)
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp1Ns)
## V1
## Min. :33508
## 1st Qu.:33814
## Median :33978
## Mean :34041
## 3rd Qu.:34195
## Max. :36365
peakedp1weights
##
## [1,] 0.03740866 0.9625913
Interestingly, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\).
We start again by peaking the \(p\) distribution around 0, which should result in accurate estimates of node \(b\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0 <- treedata
peakedp0[2,3] <- 0
peakedp0[2,4] <- 149
peakedp0[3,3] <- 149
peakedp0[3,4] <- 150
peakedp0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
peakedp0.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0weights <- weights
peakedp0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0Ns <- round(exp(peakedp0Ns))
peakedp0Ns.restrict <- peakedp0Ns[peakedp0Ns$V1>=30000 & peakedp0Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (0) p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0Ns)
## V1
## Min. :27441
## 1st Qu.:31847
## Median :32823
## Mean :32861
## 3rd Qu.:33855
## Max. :39185
peakedp0weights
##
## [1,] 0.9992484 0.0007516463
We no longer see an overestimate in this case, with a mode close to \(N=33500\).
When the distribution of \(p\) is instead peaked around 1, we may see more overestimation, as the true value of \(p\) is very small:
# adjust data
# only Population = TRUE values are used and must be changed
peakedp1 <- treedata
peakedp1[2,3] <- 149
peakedp1[2,4] <- 150
peakedp1[3,3] <- 149
peakedp1[3,4] <- 150
peakedp1
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp1)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
peakedp1.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp1weights <- weights
peakedp1Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp1Ns <- round(exp(peakedp1Ns))
peakedp1Ns.restrict <- peakedp1Ns[peakedp1Ns$V1>=30000 & peakedp1Ns$V1<=60000, , drop = FALSE]
p <- ggplot(peakedp1Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked q, peaked (1) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp1Ns)
## V1
## Min. :33492
## 1st Qu.:33813
## Median :33970
## Mean :34036
## 3rd Qu.:34192
## Max. :35926
peakedp1weights
##
## [1,] 0.04029163 0.9597084
Again, we don’t see the over-estimation of \(N\) predicted. This indicates that this method, like the posterior, is less affected by the prior knowledge of the distribution of \(p\). The distribution here is similar to that of untransformed data.
The following plots result using the Bayesian model for distribution of \(p\) peaked around 0 and then 1, respectively:
Posterior of N, peaked distribution of p (1), q
As in the densities, we see both agree on a mode close to the true value. Again, although larger (inaccurate) estimates result in larger uncertainty in the case of an incorrect choice of \(f(p)\), an estimate based on the mode is largely unaffected by the choice of distribution for \(p\), and instead this results only in larger uncertainty surrounding our estimate \(\hat{N}\).
To explore this a little further, we check to see how the mode-based estimate is affected if \(p\) is mostly correct, but \(q\) is not. For this, we choose \(p\) peaked close to 0 and choose \(q\) peaked close to 0 also.
As in the case of uniform \(f(p)\), a better choice of this distribution does not significantly improve estiamtes of \(N\) if \(f(q)\) is poorly chosen:
dens_plot(1, 150, 1, 150, 33640, 448, 36000, 40000, 100)
For a good choice of distribution of \(f(p)\), we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM, much like the case for incorrect \(f(q)\) and uniform \(f(p)\):
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- round(pop.x)
peakedp0q0.pop.x <- pop.x
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(peakedp0q0Ns)
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0q0Ns)
## V1
## Min. :1.067e+06
## 1st Qu.:5.305e+06
## Median :9.720e+06
## Mean :6.677e+07
## 3rd Qu.:2.154e+07
## Max. :3.892e+10
peakedp0q0weights
##
## [1,] 0.4964911 0.5035089
Again with the log-transformed case, we see that a poor choice of \(f(q)\) results in large overestimation using the weighted MM. The distribution is similar to that of untransformed data.
# adjust data
# only Population = TRUE values are used and must be changed
peakedp0q0 <- treedata
peakedp0q0[2,3] <- 0
peakedp0q0[2,4] <- 149
peakedp0q0[3,3] <- 0
peakedp0q0[3,4] <- 149
peakedp0q0
# calculate estimates from A,B using 5000 simulated p,q values
tree <- FromDataFrameNetwork(peakedp0q0)
weightedPopTree(tree,sample_length=5000,method='weightedEstimate')
## [1] "using weighted mean estimates"
tree$Do(function(node) {
if(node$isRoot){
node$uncertainty <-pop.conf.int(tree)
}
else{
node$uncertainty <- prettyConfidenceIntervals(node$overdose_samples)
}}
)
# create dataframe of N estimates from each leaf
x <- tree$Get('overdose_samples', filterFun = isLeaf, traversal = 'post-order')
# Ignore NAs and then also choose leaves with population=TRUE values only (these are the data)
if(is.list(x)){
mat.x <- NULL
for (i in 1:length(x)) {
if (length(x[[i]] > 0)) {
mat.x <- cbind(mat.x, x[[i]])
colnames(mat.x)[dim(mat.x)[2]] <- rownames(x)[[i]]
}
}
x <- mat.x
}
x <- as.data.frame(x)
# rename column to be of the correct leaves (with population=TRUE values)
getleaves <- which(tree$Get('Population', filterFun = isLeaf, traversal = 'post-order'))
pop.x <- x %>%
select(all_of(getleaves))
pop.x <- log(pop.x)
peakedp0q0.pop.x <- exp(pop.x)
# generate values \hat{N} to plot
weights <- as.matrix(pop.ko.weights(tree))
peakedp0q0weights <- weights
peakedp0q0Ns <- as.data.frame(as.matrix(pop.x) %*% t(weights))
peakedp0q0Ns <- round(exp(peakedp0q0Ns))
peakedp0q0Ns.restrict <- peakedp0q0Ns[peakedp0q0Ns$V1>=30000 & peakedp0q0Ns$V1<=6000000, , drop = FALSE]
p <- ggplot(peakedp0q0Ns.restrict, aes(x = V1)) + geom_density(alpha=.2, fill="#FF6666") +
ggtitle("MM estimates for peaked (0) q, peaked (0) for p", subtitle = "One MM pass and set of weights, 5000 p,q estimates")
p
summary(peakedp0q0Ns)
## V1
## Min. :1.007e+06
## 1st Qu.:4.590e+06
## Median :7.588e+06
## Mean :1.543e+07
## 3rd Qu.:1.475e+07
## Max. :3.505e+09
peakedp0q0weights
##
## [1,] 0.5066745 0.4933255
Finally, the Bayesian model also overestimates \(N\) when \(f(p)\) is well specified by \(f(q)\) is poorly chosen: While there is overestimation in the value of node \(b\) when \(f(q)\) is misspecified, these distributions show clear improvement on the estimates generated with the same choice \(f(q)\) but uniform \(f(p)\).
It should also be noted that while the distribution and the Bayesian model do not show the same level of overestimation, these estimates are constrained by the choice of range for the discrete uniform prior on \(N\).