Overview

Now that we’ve made a bunch of simulated data, we need to interpret it. The output from the model (the input to this document) is a list with names that follow the following schema: nPC_nARU_psi_p11_p_aru11_p_aru01, where all of the values are numeric.

Here are a few of the questions I want to ask of the results:

After how much data does the precision of the mean_psi parameter asymptote?
Do the other parameters, such as psi, affect how well the model is able to estimate the parameters? In other words, does the model struggle on the extreme ends of these parameters? Does more data help the model estimate these extreme parameters?

Finding the asymptote

A key component of this document is finding when the precision hits a limit and will not increase.

This normally happens at around 5 to 10 ARU recordings, and adding PCs will make it slightly earlier and make the asymptote a little higher.

knitr::include_graphics(here("models/output/asymptote_example.png"))

Instead of having to graph every single combination of parameters, it would be nice to be able to systematically find the asymptote given the points.

Here is a function that will fit linear models to the data until it finds one that it sufficiently flat. It records the nARU needed to get a sufficiently flat line.

find_asymptote <- function(df, nPCs) {
  asymptote <- NULL
  for(i in 1:24) {
    model <- lm(precision ~ nARU, data = filter(df, nPC == nPCs, nARU > i)) %>% 
      summary()
    slope <- model$coefficients[2,1] # the slope for the linear model
    if (slope < 0.2) {
      asymptote = i
      break()
    }
  }
  return(asymptote)
}

Using the Asymptote

We will make a tibble with the names of from the output of the simulation and then split the names into different columns. We can then group by everything except nARU and get a single value (when it hits the asymptote) for every combination of parameters.

First, we need to make a function that will get the precision from the trialResults list given the list element name.

precision <- function(values) {
  1 / var(values)
}
get_prec <- function(trialResults, element) {
  singleResult <- trialResults[[element]]
  singleResult$sims.list$mean_psi %>% 
    precision()
}

Then, we can use that function to get the precision and add it to the tibble.

indices <- names(trialResults) %>% 
  tibble() %>% 
  rename(params = 1) %>% 
  separate(params, c("nPC", "nARU", "psi", "p11", "p_aru11", "p_aru01"), 
           sep = "_", remove = F) %>% 
  mutate(nPC = as.numeric(nPC), nARU = as.numeric(nARU), psi = as.numeric(psi), 
         p11 = as.numeric(p11), p_aru11 = as.numeric(p_aru11), 
         p_aru01 = as.numeric(p_aru01))

prec_values <- map_df(indices$params, 
              ~ get_prec(trialResults, .x) %>%
                  tibble(params = .x)) %>% 
                  rename(precision = 1) %>% 
                  left_join(indices, by = "params")

result <- prec_values %>% 
  group_by(nPC, psi, p11, p_aru11, p_aru01) %>% 
  summarize(nARU_for_asymp = find_asymptote(., max(nPC)))

## `summarise()` has grouped output by 'nPC', 'psi', 'p11', 'p_aru11'. You can
## override using the `.groups` argument.

head(result)

## # A tibble: 6 × 6
## # Groups:   nPC, psi, p11, p_aru11 [6]
##     nPC   psi   p11 p_aru11 p_aru01 nARU_for_asymp
##   <dbl> <dbl> <dbl>   <dbl>   <dbl>          <int>
## 1     1   0.1  0        0     0.025              7
## 2     1   0.1  0        0.1   0.025              7
## 3     1   0.1  0        0.2   0.025              7
## 4     1   0.1  0        0.3   0.025              7
## 5     1   0.1  0.15     0     0.025              7
## 6     1   0.1  0.15     0.1   0.025              7

Wow, that’s pretty cool. So, basically, it only takes 6-7 nARU to get to the asymptote. Maybe we can visualize these asymptotes to see if more PC means a higher precision.

Understanding the Precision

ggplot(prec_values, aes(nARU, precision, color = as.factor(nPC))) +
  geom_jitter() +
  geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Okay, so it seems that nPC does not affect the precision of the posterior. But there must be something to explain the 4 “levels” of values we see in the figure.

So we can get the precision that each combination asymptotes at and see if we can find correlations between certain parameters and the precision.

asymptote <- prec_values %>% 
  filter(nARU > 10) %>% # 10 just to be sure it's at the asymptote
  group_by(nPC, psi, p11, p_aru11, p_aru01) %>% 
  summarize(asym_prec = mean(precision))

## `summarise()` has grouped output by 'nPC', 'psi', 'p11', 'p_aru11'. You can
## override using the `.groups` argument.

head(asymptote)

## # A tibble: 6 × 6
## # Groups:   nPC, psi, p11, p_aru11 [6]
##     nPC   psi   p11 p_aru11 p_aru01 asym_prec
##   <dbl> <dbl> <dbl>   <dbl>   <dbl>     <dbl>
## 1     1   0.1  0        0     0.025      779.
## 2     1   0.1  0        0.1   0.025      775.
## 3     1   0.1  0        0.2   0.025      778.
## 4     1   0.1  0        0.3   0.025      775.
## 5     1   0.1  0.15     0     0.025      776.
## 6     1   0.1  0.15     0.1   0.025      775.

So it seems that the model only struggles when psi is high and p11 is low. This makes sense since it would have very little to judge the high probability of occupancy on. It actually still does a decent job with just a tiny bit of data!

ggplot(asymptote, aes(psi, asym_prec, color = as.factor(p11))) +
  geom_jitter() +
  labs(title = "psi vs the Precision once the model is saturated with data")

ggplot(asymptote, aes(p11, asym_prec, color = as.factor(psi))) +
  geom_jitter() +
  labs(title = "p11 vs the Precision once the model is saturated with data")

ggplot(asymptote, aes(nPC, asym_prec)) +
  geom_jitter() +
  geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Psi values that are near the middle (0.5ish) tend to have lower precision values than those closer to the edge of the spectrum. Of course, the times when p11 is 0, the model struggles and falls outside of the norm.

This makes sense because the variance for the binomial distribution is described by \(\sigma ^2 = npq\). So when p is close to 0.5, the variance is the highest (and therefore precision is low).

Testing the Parameters

Now that we know that we don’t need a ton of data to hit the asymtote and that psi values close to the extremes change the variance, let’s see how well the model did at “recovering” the parameters I inputted into the simulation.

To do this, we’ll get the means from the trialResults and compare them to the names of the trialResults (the names correspond to the parameters inputted).

get_means <- function(trialResults, element) {
  singleResult <- trialResults[[element]]
  singleResult$mean %>% unlist()
}

indices <- names(trialResults) %>% 
  tibble() %>% 
  rename(params = 1) %>% 
  separate(params, c("nPC", "nARU", "psi", "p11", "p_aru11", "p_aru01"), 
           sep = "_", remove = F) %>% 
  mutate(nPC = as.numeric(nPC), nARU = as.numeric(nARU), psi = as.numeric(psi), 
         p11 = as.numeric(p11), p_aru11 = as.numeric(p_aru11), 
         p_aru01 = as.numeric(p_aru01))

mean_values <- map_df(indices$params, 
              ~ get_means(trialResults, .x) %>%
                  t() %>% 
                  as_tibble() %>% 
                  tibble(params = .x)) %>%
                  left_join(indices, by = "params", suffix = c(".model", ".sim"))

Now we can plot some of the values to see if they are related (they should be).

ggplot(mean_values, aes(psi, mean_psi)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  ylab("model") +
  xlab("simulated parameter") +
  labs(title = "psi") +
  xlim(c(0,1)) +
  ylim(c(0,1))

## `geom_smooth()` using formula 'y ~ x'

ggplot(mean_values, aes(p11.sim, p11.model)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  ylab("model") +
  xlab("simulated parameter") +
  labs(title = "p11") +
  xlim(c(0,1)) +
  ylim(c(0,1))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1314 rows containing missing values (geom_point).

ggplot(mean_values, aes(p_aru11.sim, p_aru11.model)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  ylab("model") +
  xlab("simulated parameter") +
  labs(title = "p_aru11") +
  xlim(c(0,1)) +
  ylim(c(0,1))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1925 rows containing missing values (geom_point).

ggplot(mean_values, aes(p_aru01.sim, p_aru01.model)) +
  geom_jitter() +
  geom_smooth(method = "lm") +
  ylab("model") +
  xlab("simulated parameter") +
  labs(title = "p_aru01") +
  xlim(c(0,0.25)) +
  ylim(c(0,0.25))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 10657 rows containing missing values (geom_point).

Investigating the Simulation Results

Mark Schulist

Overview

Finding the asymptote

Using the Asymptote

Understanding the Precision

Testing the Parameters