Intro

What are ref games?

What are ref games used for / what have people found

Why aggregate?

note: discuss both the other similar aggregation schemes have been helpful within the given topic – childes, wordbank, peekbank, etc

makes data reuse easier – make some things possible that otherwise wouldn’t have been

and that the aggregation makes it more helpful for further fields – an important first step for CS/AI benchmarking/applications or for task corpora in linguistics

Here we …

“Methods”

TODO see also how other papers do this like peekbank dataset paper etc

Data description

(format and processing)

Schema

Processing

dataset normalization / transcription / target coding / etc (where are the places where we had to make guesses)

In here could discuss that this makes it not human subjects data b/c it doesn’t have identifiers

criteria for inclusion / choices to record exclusions but not apply them

Metadata coding

Derivative measures

things like word count, sbert, pos, other stuff discuss that for processing measures that include comparison across trials, need our own light exclusions

Current datasets

(+ how solicitation happened) this is where a table of included datasets & vague properties could be helpful!

Using refbank

how to access etc Shiny app / API / redivis versioning etc

Analyses

Note: As a pilot to make sure we got the model specifications right (and that the models would run) we tested these on boyce2024 & hawkins2020.

All of these analyses are run only on “stage 1” data

Reduction

There’s a question of what the correct functional form is for the words ~ rep_num relationship. We fit 4 options – the full 2x2 of log or raw words x log or raw rep num.

p_beta <- prior_string("normal(0,.5)", class = "b")
p_sd <- prior_string("normal(0,.5)", class = "sd")

p_intercept_logscale <- prior_string("normal(2,.5)", class = "Intercept")
p_intercept_linear <- prior_string("normal(10,10)", class = "Intercept")
p_beta_linear <- prior_string("normal(0,5)", class = "b")
p_sd_linear <- prior_string("normal(0,5)", class = "sd")

log_dv_priors <- c(p_intercept_logscale, p_beta, p_sd)
linear_dv_priors <- c(p_intercept_linear, p_beta_linear, p_sd_linear)

red_mod_log_log <- brm(log_words ~ log_rep_num + (log_rep_num || dataset_id / condition_id),
  prior = log_dv_priors)

red_mod_log_lin <- brm(log_words ~ rep_num + (rep_num || dataset_id / condition_id),
  prior = log_dv_priors
)

red_mod_lin_log <- brm(words ~ log_rep_num + (log_rep_num || dataset_id / condition_id),
  prior = linear_dv_priors
)

red_mod_lin_lin <- brm(words ~ rep_num + (rep_num || dataset_id / condition_id),
  prior = linear_dv_priors
)

Log-log

Note: here and elsewhere, the panels and colors are just different conditions. We want to make sure that the random effects can capture condition differences, and we spread them across panels for viewability.

note that log words the actual data is spikey because number of words is discrete, so it can be 0 or log(2) or etc, but not in between.

Log-lin

Lin-log

Lin-lin

Loo

##         elpd_diff se_diff   elpd_loo  se_elpd_loo p_loo     se_p_loo  looic    
## log_log       0.0       0.0  -91826.1     177.2        24.7       0.3  183652.1
## log_lin     -66.7      25.2  -91892.7     177.2        25.5       0.3  183785.4
## lin_log  -11092.3     232.8 -102918.3     274.5        40.1       1.2  205836.6
## lin_lin  -11415.3     233.2 -103241.3     271.8        37.2       1.1  206482.6
##         se_looic 
## log_log     354.3
## log_lin     354.4
## lin_log     548.9
## lin_lin     543.6

##         elpd_diff se_diff
## log_log   0.0       0.0  
## log_lin -66.7      25.2

Log words fits way better, and seems right – whether it’s a lot or power law relationship is hard to tell.

Reduction Moderators

Two possible approaches to moderators.

take the (well-performing) above models (log-log and log-lin), then predict the slopes for each condition based on all the predictor variables (runs fast)

p_beta_linear <- prior_string("normal(0,.2)", class = "b")

log_lin_pred_mod <- brm(
  slope ~ n_players +
    # option_size +
    # image_type +
    # partner_constancy +
    role_constancy +
    # population +
    # modality +
    feedback +
    backchannel,
  prior = c(p_beta_linear),
  data = log_lin_preds
)

(and same for log-log model)

For the full model, we’d have all the predictors, but we don’t have variation on many of them with just the two pilot datasets.

Run full models with groups of predictors

(and same for log-lin relationship) For the full model, we’d have all the predictors and all three models, but we don’t have variation on many of them with just the two pilot datasets.

p_intercept_logscale <- prior_string("normal(2,.5)", class = "Intercept")
p_intercept_linear <- prior_string("normal(10,10)", class = "Intercept")
p_beta_linear <- prior_string("normal(0,5)", class = "b")
p_sd_linear <- prior_string("normal(0,5)", class = "sd")

log_dv_priors <- c(p_intercept_logscale, p_beta, p_sd)
linear_dv_priors <- c(p_intercept_linear, p_beta_linear, p_sd_linear)

red_mod_log_log_participants <- brm(
  log_words ~ log_rep_num *
    # population*
    n_players + (log_rep_num || dataset_id / condition_id),
  prior = log_dv_priors,
)

# not run because no variation in pilot set
red_mod_log_log_images <- brm(
  log_words ~ log_rep_num * (option_size + image_type) +
    (log_rep_num || dataset_id / condition_id),
  prior = log_dv_priors,
)

red_mod_log_log_channel <- brm(
  log_words ~ log_rep_num * (role_constancy +
    # modality+
    feedback + backchannel) +
    (log_rep_num || dataset_id / condition_id),
  prior = log_dv_priors,
)

Moderation by participant structure

n-players, age-group (on stage 1 only, using n-players who are active at this point)

So, comparing for n-players

log-log slope model: -.01 (-.06 - .04 )
log-lin slope model: -.01 (-.02 - .01 )
log-log full model: -.01 (-.04 - .02 )
log-lin full model: -.01 (-.02 - -0 )

pilot samples doesn’t have age-group variation

Moderation by expt design / communication channel

thickness, modality

For role-constancy (yes):

log-log slope model: -.03 (-.23 - .18)
log-lin slope model: .00 (-.08 - .09)
log-log full model: -.05 (-.31 - .18)
log-lin full model: 0.00 (-.09 - .09)

For feedback (limited):

log-log slope model: .01 (-.19 - .21)
log-lin slope model: -.01 (-.09 - .08)
log-log full model: -.05 (-.30 - .20)
log-lin full model: -.01 (-.10 - .08)

For backchannel (limited):

log-log slope model: .31 (.08 - .50)
log-lin slope model: .14 (.06 - .21)
log-log full model: .38 (.15 - .60)
log-lin full model: .15 (.06 - .23)

Stim sets

type of stims x n targets

no variation to compare in this set

So which to choose?

Estimates between slope and full model with the same functional form are quite consistent – so which one should we use?

PoS

We have PoS data for all monolingual corpora.

After much model wrangling, Alvin and I got multinomial models for PoS working.

The best functional form relationship is using log(rep_num)

p_beta_pos <- prior_string("normal(0,1.5)", class = "b", dpar = c("muDET", "muFUNCTION", "muMODIFIER", "muNOUN", "muVERB"))
p_sd_pos <- prior_string("normal(0,1.5)", class = "sd", dpar = c("muDET", "muFUNCTION", "muMODIFIER", "muNOUN", "muVERB"))
p_intercept_pos <- prior_string("normal(0, 1.5)", class = "Intercept", dpar = c("muDET", "muFUNCTION", "muMODIFIER", "muNOUN", "muVERB"))

logistic_pos_priors <- c(p_beta_pos, p_sd_pos, p_intercept_pos)

per_describer_for_model <- read_rds(here("cached_model_files/data_for_mods/per_describer_for_model.rds")) |>
  mutate(
    condition_id = as.factor(condition_id),
    total = NOUN + VERB + MODIFIER + FUNCTION + DET + PRON,
    w = 1 / total
  ) |>
  filter(total != 0)

pos_mod_log <- brm(
  cbind(NOUN, VERB, MODIFIER, FUNCTION, DET, PRON) | trials(total) + weights(w) ~ log_rep_num +
    (log_rep_num || dataset_id / condition_id),
  family = multinomial(refcat = "PRON"),
  prior = logistic_pos_priors,
)

Semantic embedding similiarities

We now have multilingual embeddings for all corpora. We look at the similarity to the next rep (same game, same target) as a measure of within game similiarity. We look at the cross-game (same condition, same target, same rep) dissimilarity.

The fit shown here isn’t great and we are currently running with log_rep_num instead. If that looks better, we will switch to that.

# Priors now on logit scale (ordbeta uses logit link)
# logit(0.72) ≈ 0.94, so intercept around 1
p_beta_sim <- prior_string("normal(0, 0.5)", class = "b")
p_sd_sim <- prior_string("normal(0, 0.5)", class = "sd")
p_intercept_sim <- prior_string("normal(1, 1.5)", class = "Intercept")

sim_priors <- c(p_intercept_sim, p_beta_sim, p_sd_sim)

sims_for_model <- read_rds(here("cached_model_files/data_for_mods/sims_for_model.rds"))

to_next_mod <- ordbetareg(sim ~ rep_num + (rep_num || dataset_id / condition_id),
  manual_prior = sim_priors,
  file = here("cached_model_files/mods/to_next_mod.rds"),
  data = sims_for_model |> filter(sim_type == "to_next")
)

diverge_mod <- ordbetareg(sim ~ rep_num + (rep_num || dataset_id / condition_id),
  manual_prior = sim_priors,
  file = here("cached_model_files/mods/diverge_mod.rds"),
  data = sims_for_model |> filter(sim_type == "diverge")
)

Discussion

summary of what we did and results

commentary on that this data can be shared once it’s transcribed b/c its not human subjects data & that data reuse is one of our favorite things (so, y’know, give us more datasets!) (could crib from the peekbank behavioral methods paper for how they framed this!)

Useful for exploration and for future directions for new data collection (ex. multi-lingual, or cleanly comparable modality or whatever we want to point towards)

Refbank the dataset the paper

Veronica Boyce

Alvin W. M. Tan

MORE AUTHORS GO HERE

Michael C. Frank