mb6-power-simulations

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(broom)
library(lme4)

Loading required package: Matrix

Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

library(broom.mixed)
theme_set(theme_bw())
library(here)

here() starts at /home/vboyce/Research/mb6-power-analysis

library(haven)
library(ggthemes)
library(viridis)

Loading required package: viridisLite

Summary

The approach here was to use real data from Oostenbroek to get actual distributions of baby behavior and then modify it in various ways to approximate what MB6 data might look like under different options. In particular, I added different possible effect sizes of imitation across MO and TP behavior and across younger and older babies. Additionally, I considered a few different “baseline” options for what more alert babies over a longer time period might do.

For each data-modification, I ran 100 simulations (re-creating and re-sampling the data) each for different sample sizes to get estimates of how often we would detect significant effects (in the predicted direction). I looked at power for interaction effects as well as MO-MO match and TP-TP match, all both in the overall sample and in each age group (younger and older).

I considered effect sizes from cohen’s d of 0 to .7 (given meta-analysis of d=.68) for how much more an infant does behavior X when adult does X. Of note, we don’t know that TP and MO will have the same size “match” effect, and distinguishing between some hypotheses hinge on whether both show a match effect. There is also uncertainty around whether match affects will occur evenly or unevenly across age groups.

To summarize the results,

Assuming a certain size effect for MO-MO match and TP-TP match, and that these are equal, then we would have 80% power on the overall interaction effect to

detect a .5 effect with 50 younger and 50 older babies
detect a .3 effect with 100 younger and 100 older babies
detect a .2 effect with 200 younger and 200 older babies

To confirm a match-to-adult-sample effect, we would additionally like power to detect a significant increase of MO for MO-demo versus TP-demo and a significant increase of TP for TP-demo versus MO-demo. We would have 80% power for each of these for effect sizes of

.6 in 50 younger and 50 older babies
.4 in 100 younger and 100 older babies
.3 in 200 younger and 200 older babies

If the effect sizes are uneven, the smaller effect size is the limiting factor on confirming that there is a match to target for each behavior. The overall interaction depends on both.

If the effect is only present in one age group, then approximately double the numbers for detecting the effect in that age group. (Or see analyses below, considering effects only in one age group.)

It would be useful to get pilot data using the proposed paradigm to better estimate distributional factors than could get better power estimates because we’d better know what a baseline distribution of TP and MO behavior in alert babies over 90 seconds is. Shown above is when we would have 80% power under any of the options I considered, but this is a worst case, so piloting could narrow down the distributional uncertainty and determine that we actually have more power.

Most of the analyses below were done focus on a specific “middle-of-the-road” distribution (and thus show better power), but they are still useful for comparing across different effect sizes or effect size distributions.

Another note, given that there may be a preference to do more behavior overall to one adult demo compared to the other – an overall preference for adult_TP displays will impede our ability to detect an MO-MO match effect, depending on the relevant sizes of the preferences.

Real data

We have data for Oostenbroek which in raw form has the number of behaviors averaged over 4 15 seond periods.

We use the data from 1 and 3 week old babies doing MO and TP behaviors were used for the younger category, and the data from 6 and 9 week old babies doing MO and TP behaviors were used for the older category. We use data regardless of the adult demonstrator behavior. We randomly reassign the data to be for the adult demos and TP and MO.

week_1 <- read_sav(here("sample_data/Cross-sectional analyses (SPSS)/1 week responses (cross-sectional data).sav")) |> mutate(source = "week1")

week_3 <- read_sav(here("sample_data/Cross-sectional analyses (SPSS)/3 week responses (cross-sectional data).sav")) |> mutate(source = "week3")

week_6 <- read_sav(here("sample_data/Cross-sectional analyses (SPSS)/6 week responses (cross-sectional data).sav")) |> mutate(source = "week6")

week_9 <- read_sav(here("sample_data/Cross-sectional analyses (SPSS)/9 week responses (cross-sectional data).sav")) |> mutate(source = "week9")

raw_data <- week_1 |>
  bind_rows(week_3, week_6, week_9) |>
  select(Subject, source, starts_with("TP"), starts_with("MO")) |>
  pivot_longer(!c(source), names_to = "type", values_to = "count") |>
  mutate(
    infant_behavior = str_split_i(type, 1, pattern = "[a-z0-9]+"),
    adult_demo = str_split_i(type, 2, pattern = "[a-z0-9]+")
  ) |>
  filter(infant_behavior %in% c("TP", "MO")) |>
  filter(!is.na(count)) |>
  filter(adult_demo %in% c(
    "TP", "MO", "HAP", "SAD", "TUBE", "BOX",
    "IP", "GRA", "EEE", "MMM", "CLK"
  )) |>
  mutate(count = count * 4) |>
  select(-type, -adult_demo) |>
  mutate(age = ifelse(source %in% c("week3", "week1"), "younger", "older")) |>
  rowwise() |>
  mutate(adult_demo = ifelse(runif(1) > .5, "TP", "MO")) |>
  select(-source) |>
  write_csv(here("raw.csv"))

Simulation process

This is done in run_power_sim.R which was run on a cluster. ## data simulation for each run of the simulation:

the data were at baseline juiced with:

a multiplier of 1.5 to go from 60 seconds to 90 seconds
an addition of a rnbinom with mean of 1, dispersion of 1 (to boost towards less sleepy baby behavior)

then depending on the condition, the data in certain conditions was boosted.

we sampled effect values of

(younger| older) (TP-match | MO-match) each was a value on {0, .1, .2, .3, .4,.5, .6, .7}.

These were multiplied by the sd of that condition (after baseline juicing) and then sampled with a rnbinom.

For equal effect sizes in the 4 options (younger/older and TP/MO) we also considered some other parameter settings including

dispersion values: .3, 1, 3 (applied to all added rnbinoms)
overall add of rnbinom with mean of 0, 1
overall multiplier 1.5 or 2 to the raw data
optionally add another rnbinom with mean .2 std deviations to just the adult_demo=TP conditions (this represents if babies are more aroused by adult_TP than adult_MO and do more stuff across the board) (a larger value of .6 was also tried, not shown because that swamps nearly all MO-match effects)

sampling

We then sampled 50, 100, 200 or 400 trials from each condition (age x infant x adult). So the sample sizes are per age group.

We then ran glm.nb(count ~ age * adult_demo * infant_behavior, data = data) and used emmeans to extract the p-values on the contrasts of interest.

Of note, this procedure does not include mixed effects both so that model will run faster and because this data juicing/sampling procedure breaks infant-level correlations that might exist in the source data (we just sampled trials, not babies). We also don’t introduce simulated lab-level correlations. This choice could be revisited in the future.

terms

We consider the significance of a number of terms

(age: all, younger, older) x (interaction, MO-match, TP-match)

Specifically, we count as “positive” the cases where p<.05 and the coefficient is in the “correct” direction.

This both lets us see where we’d have power and also see what level of false-positive/non-interpretable we’d get for conditions where the measured effect is not actually there (ex: if we set the effect to 0 for younger babies, we should not see effects for younger babies in the outcome).

main_results <- readRDS("output.rds") |> bind_rows(readRDS("output2.rds")) |> bind_rows(readRDS("output3.rds")) |> 
  filter(younger_TP_preference%in%c(0), overall_multiplier==1.5, overall_add==1, dispersion==1) |> 
  mutate(baseline_factors=interaction(overall_multiplier, overall_add, dispersion, younger_TP_preference, older_TP_preference) |> as.factor(),
         baseline_TP_preference=older_TP_preference |> as.factor())

plot_opts <- list(
geom_point(aes(x=samples, y=sig, color=as.factor(older_TP_match),
                                 group = interaction(older_TP_match, baseline_TP_preference))),
                  geom_line(aes(x=samples, y=sig, color=as.factor(older_TP_match),
                                 group = interaction(older_TP_match, baseline_TP_preference))),
  scale_color_viridis(discrete = T),
  geom_hline(yintercept = .8, lty = "dotted"),
    facet_grid(str_c("age:",age)~str_c("term:",type)),
    theme(legend.position = "bottom"), labs(y="power")
)

Results: no age effects

We first consider simulations where the effects are consistent across the two ages.

effects are equal in all conditions

Here we consider effects (of TP-TP match preference) of different sizes from 0 to .7. Here whatever effect size we set for TP-TP match is also the effect size for MO-MO match and is equal across both age groups.

main_results |>  filter( older_TP_match == younger_TP_match, older_MO_match == younger_MO_match, older_TP_match == older_MO_match) |>
  ggplot()+plot_opts+labs(color="effect size", title="All effects equal")

Note that it takes about twice as many babies to have the same power if you go from the interaction to MO-MO match and TP-TP match (half the data) or if you go from all babies to only one age group (half the data).

effects are only present for TP

If there’s no MO match. Note that on the TP and interaction terms, this is difficult to distinguish from a small MO effect (see next).

main_results |> filter(overall_multiplier==1.5, overall_add==1, dispersion==1) |> filter( older_TP_match == younger_TP_match, older_MO_match==0, younger_MO_match==0) |> 
  ggplot()+plot_opts+labs(color="effect size (TP)", title="No MO effect, equal TP effect across age")

effects are small for MO

Here we consider if MO-match has an effect size of .2, how does variation in TP effect size play out in our ability to detect an interaction.

main_results |> filter(overall_multiplier==1.5, overall_add==1, dispersion==1) |> filter( older_TP_match == younger_TP_match, older_MO_match==.2, younger_MO_match==.2) |>
  ggplot()+plot_opts+labs(color="effect size (TP)", title="Small MO effect, equal TP across age")

Results: Effects are only in older babies

What if the effects are only in one age group? Here we consider effects for older babies, but similarish patterns should exist for effects in only younger babies.

equal effects

main_results |> filter(overall_multiplier==1.5, overall_add==1, dispersion==1) |> filter(younger_TP_match ==0,  older_TP_match==older_MO_match, younger_MO_match == 0) |>
  ggplot()+plot_opts+labs(color="effect size", title="Equal effects in older babies only")

Only TP

main_results |> filter(younger_TP_match ==0,  older_MO_match==0, younger_MO_match == 0) |>
  filter(overall_multiplier==1.5, overall_add==1, dispersion==1) |> 
  ggplot()+plot_opts+labs(color="effect size (TP)", title="Only TP effect, in older babies only")

small MO

main_results |> filter(younger_TP_match ==0,  older_MO_match==.2, younger_MO_match == 0) |>
  filter(overall_multiplier==1.5, overall_add==1, dispersion==1) |> 
  ggplot()+plot_opts+labs(color="effect size (TP)", title="Older babies only, small MO effect, varying TP effect")

Results: robustness to distributional factors

In the above analyses, we looked only at 1 set of distributional/background settings. Here we explore the wider range, but only for the case of equal effects. Shown in red is the results that correspond to the condition used above.

to detect interaction effect

robustness_results <-  readRDS("output.rds") |> bind_rows(readRDS("output2.rds")) |> bind_rows(readRDS("output3.rds")) |> 
  filter(younger_TP_preference%in%c(0, .2), older_TP_match == younger_TP_match, older_MO_match == younger_MO_match, older_TP_match == older_MO_match) |> 
  mutate(baseline_factors=interaction(overall_multiplier, overall_add, dispersion, younger_TP_preference, older_TP_preference) |> as.factor(),
         baseline_TP_preference=older_TP_preference |> as.factor())

prev <- main_results |>filter(age=="all") |>  filter(younger_TP_preference%in%c(0, .2), older_TP_match == younger_TP_match, older_MO_match == younger_MO_match, older_TP_match == older_MO_match)

robustness_results |> filter(type=="interaction", age=="all") |>ggplot()+
geom_point(aes(x=samples, y=sig, color=as.factor(older_TP_match),
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
                  geom_line(aes(x=samples, y=sig, lty=baseline_TP_preference, color=as.factor(older_TP_match),
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
  scale_color_viridis(discrete = T)+
  geom_point(data=prev |> filter(type=="interaction"), aes(x=samples, y=sig,
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)), color="red")+
  geom_hline(yintercept = .8, lty = "dotted")+
    facet_wrap(~older_TP_match)+
    theme(legend.position = "bottom")+ labs(y="power")+labs(color="effect size", title="All effects equal, interaction")

on MO-MO match effect

robustness_results |> filter(type=="MO", age=="all") |>ggplot()+
geom_point(aes(x=samples, y=sig, color=as.factor(older_TP_match),
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
                  geom_line(aes(x=samples, y=sig, lty=baseline_TP_preference, color=as.factor(older_TP_match),
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
   geom_point(data=prev |> filter(type=="MO"), aes(x=samples, y=sig,
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)), color="red")+
  scale_color_viridis(discrete = T)+
  geom_hline(yintercept = .8, lty = "dotted")+
    facet_wrap(~older_TP_match)+
    theme(legend.position = "bottom")+ labs(y="power")+labs(color="effect size", title="All effects equal, MO effect")

An overall preference for TP (more likely to do more in response to adult TP demo) makes it harder to detect the MO-MO match because it raises the MO-TP rate.

on TP-TP match effect

robustness_results |> filter(type=="TP", age=="all")  |>ggplot()+
geom_point(aes(x=samples, y=sig, color=as.factor(older_TP_match),
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
                  geom_line(aes(x=samples, y=sig, lty=baseline_TP_preference, color=as.factor(older_TP_match),
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)))+
  scale_color_viridis(discrete = T)+
   geom_point(data=prev |> filter(type=="MO"), aes(x=samples, y=sig,
               shape=baseline_TP_preference,
                                 group = interaction(baseline_factors, older_TP_match, baseline_TP_preference)), color="red")+
  geom_hline(yintercept = .8, lty = "dotted")+
    facet_wrap(~older_TP_match)+
    theme(legend.position = "bottom")+ labs(y="power")+labs(color="effect size", title="All effects equal, TP effect")

Conversely, an overall preference for TP (more likely to do more in response to adult TP demo) makes it easier to detect the TP-TP effect because it raises the TP-TP rate.