Now that we’ve made a bunch of simulated data, we need to interpret
it. The output from the model (the input to this document) is a list
with names that follow the following schema:
nPC_nARU_psi_p11_p_aru11_p_aru01, where all of the values
are numeric.
Here are a few of the questions I want to ask of the results:
mean_psi
parameter asymptote?A key component of this document is finding when the precision hits a limit and will not increase.
This normally happens at around 5 to 10 ARU recordings, and adding PCs will make it slightly earlier and make the asymptote a little higher.
knitr::include_graphics(here("models/output/asymptote_example.png"))
Instead of having to graph every single combination of parameters, it would be nice to be able to systematically find the asymptote given the points.
Here is a function that will fit linear models to the data until it finds one that it sufficiently flat. It records the nARU needed to get a sufficiently flat line.
find_asymptote <- function(df, nPCs) {
asymptote <- NULL
for(i in 1:24) {
model <- lm(precision ~ nARU, data = filter(df, nPC == nPCs, nARU > i)) %>%
summary()
slope <- model$coefficients[2,1] # the slope for the linear model
if (slope < 0.2) {
asymptote = i
break()
}
}
return(asymptote)
}
We will make a tibble with the names of from the output of the simulation and then split the names into different columns. We can then group by everything except nARU and get a single value (when it hits the asymptote) for every combination of parameters.
First, we need to make a function that will get the precision from the trialResults list given the list element name.
precision <- function(values) {
1 / var(values)
}
get_prec <- function(trialResults, element) {
singleResult <- trialResults[[element]]
singleResult$sims.list$mean_psi %>%
precision()
}
Then, we can use that function to get the precision and add it to the tibble.
indices <- names(trialResults) %>%
tibble() %>%
rename(params = 1) %>%
separate(params, c("nPC", "nARU", "psi", "p11", "p_aru11", "p_aru01"),
sep = "_", remove = F) %>%
mutate(nPC = as.numeric(nPC), nARU = as.numeric(nARU), psi = as.numeric(psi),
p11 = as.numeric(p11), p_aru11 = as.numeric(p_aru11),
p_aru01 = as.numeric(p_aru01))
prec_values <- map_df(indices$params,
~ get_prec(trialResults, .x) %>%
tibble(params = .x)) %>%
rename(precision = 1) %>%
left_join(indices, by = "params")
result <- prec_values %>%
group_by(nPC, psi, p11, p_aru11, p_aru01) %>%
summarize(nARU_for_asymp = find_asymptote(., max(nPC)))
## `summarise()` has grouped output by 'nPC', 'psi', 'p11', 'p_aru11'. You can
## override using the `.groups` argument.
head(result)
## # A tibble: 6 × 6
## # Groups: nPC, psi, p11, p_aru11 [6]
## nPC psi p11 p_aru11 p_aru01 nARU_for_asymp
## <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 0.1 0 0 0.025 7
## 2 1 0.1 0 0.1 0.025 7
## 3 1 0.1 0 0.2 0.025 7
## 4 1 0.1 0 0.3 0.025 7
## 5 1 0.1 0.15 0 0.025 7
## 6 1 0.1 0.15 0.1 0.025 7
Wow, that’s pretty cool. So, basically, it only takes 6-7 nARU to get to the asymptote. Maybe we can visualize these asymptotes to see if more PC means a higher precision.
ggplot(prec_values, aes(nARU, precision, color = as.factor(nPC))) +
geom_jitter() +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Okay, so it seems that nPC does not affect the precision of the posterior. But there must be something to explain the 4 “levels” of values we see in the figure.
So we can get the precision that each combination asymptotes at and see if we can find correlations between certain parameters and the precision.
asymptote <- prec_values %>%
filter(nARU > 10) %>% # 10 just to be sure it's at the asymptote
group_by(nPC, psi, p11, p_aru11, p_aru01) %>%
summarize(asym_prec = mean(precision))
## `summarise()` has grouped output by 'nPC', 'psi', 'p11', 'p_aru11'. You can
## override using the `.groups` argument.
head(asymptote)
## # A tibble: 6 × 6
## # Groups: nPC, psi, p11, p_aru11 [6]
## nPC psi p11 p_aru11 p_aru01 asym_prec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.1 0 0 0.025 779.
## 2 1 0.1 0 0.1 0.025 775.
## 3 1 0.1 0 0.2 0.025 778.
## 4 1 0.1 0 0.3 0.025 775.
## 5 1 0.1 0.15 0 0.025 776.
## 6 1 0.1 0.15 0.1 0.025 775.
So it seems that the model only struggles when psi is high and p11 is low. This makes sense since it would have very little to judge the high probability of occupancy on. It actually still does a decent job with just a tiny bit of data!
ggplot(asymptote, aes(psi, asym_prec, color = as.factor(p11))) +
geom_jitter() +
labs(title = "psi vs the Precision once the model is saturated with data")
ggplot(asymptote, aes(p11, asym_prec, color = as.factor(psi))) +
geom_jitter() +
labs(title = "p11 vs the Precision once the model is saturated with data")
ggplot(asymptote, aes(nPC, asym_prec)) +
geom_jitter() +
geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
Psi values that are near the middle (0.5ish) tend to have lower precision values than those closer to the edge of the spectrum. Of course, the times when p11 is 0, the model struggles and falls outside of the norm.
This makes sense because the variance for the binomial distribution is described by \(\sigma ^2 = npq\). So when p is close to 0.5, the variance is the highest (and therefore precision is low).
Now that we know that we don’t need a ton of data to hit the asymtote and that psi values close to the extremes change the variance, let’s see how well the model did at “recovering” the parameters I inputted into the simulation.
To do this, we’ll get the means from the trialResults
and compare them to the names of the trialResults (the names correspond
to the parameters inputted).
get_means <- function(trialResults, element) {
singleResult <- trialResults[[element]]
singleResult$mean %>% unlist()
}
indices <- names(trialResults) %>%
tibble() %>%
rename(params = 1) %>%
separate(params, c("nPC", "nARU", "psi", "p11", "p_aru11", "p_aru01"),
sep = "_", remove = F) %>%
mutate(nPC = as.numeric(nPC), nARU = as.numeric(nARU), psi = as.numeric(psi),
p11 = as.numeric(p11), p_aru11 = as.numeric(p_aru11),
p_aru01 = as.numeric(p_aru01))
mean_values <- map_df(indices$params,
~ get_means(trialResults, .x) %>%
t() %>%
as_tibble() %>%
tibble(params = .x)) %>%
left_join(indices, by = "params", suffix = c(".model", ".sim"))
Now we can plot some of the values to see if they are related (they should be).
ggplot(mean_values, aes(psi, mean_psi)) +
geom_jitter() +
geom_smooth(method = "lm") +
ylab("model") +
xlab("simulated parameter") +
labs(title = "psi") +
xlim(c(0,1)) +
ylim(c(0,1))
## `geom_smooth()` using formula 'y ~ x'
ggplot(mean_values, aes(p11.sim, p11.model)) +
geom_jitter() +
geom_smooth(method = "lm") +
ylab("model") +
xlab("simulated parameter") +
labs(title = "p11") +
xlim(c(0,1)) +
ylim(c(0,1))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1314 rows containing missing values (geom_point).
ggplot(mean_values, aes(p_aru11.sim, p_aru11.model)) +
geom_jitter() +
geom_smooth(method = "lm") +
ylab("model") +
xlab("simulated parameter") +
labs(title = "p_aru11") +
xlim(c(0,1)) +
ylim(c(0,1))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 1925 rows containing missing values (geom_point).
ggplot(mean_values, aes(p_aru01.sim, p_aru01.model)) +
geom_jitter() +
geom_smooth(method = "lm") +
ylab("model") +
xlab("simulated parameter") +
labs(title = "p_aru01") +
xlim(c(0,0.25)) +
ylim(c(0,0.25))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 10657 rows containing missing values (geom_point).