This markdown documents a new way of thinking about modeling variation in LWL data. The idea is to try to:

  1. extract a summary statistic for each trial
  2. model these summaries with LMMs of various types

The trouble is, what’s the right summary statistic? It might be that there’s not just one! But let’s assume there is one and we just need to sum it up right.

So we’ll start by trying to figure out what the best measure is. We’re going to make decisions to maximize within-experiment reliability via ICCs.

We’re focused on familiar words here.

First step: develop a summary measure

Get data

all_aoi_timepoints <- get_aoi_timepoints()
all_stimuli <- get_stimuli()
all_administrations <- get_administrations()
all_subjects <- get_subjects()
all_trial_types <- get_trial_types() 
all_trials <- get_trials() 

aoi_data_joined <- all_aoi_timepoints |>
  right_join(all_administrations) |>
  right_join(all_subjects) |>
  right_join(all_trials) |>
  right_join(all_trial_types) |>
  mutate(stimulus_id = target_id) |>
  right_join(all_stimuli) |>
  select(dataset_name, subject_id, administration_id, trial_id, dataset_id, stimulus_id, t_norm, age, aoi, english_stimulus_label, stimulus_novelty)
# TO DO: looks like a very small number of rows get added in the join, probably a sign of an issue/ ambiguity somewhere.

save(aoi_data_joined, file = here("explorations","data","aoi_data_joined.Rds"))
load(file = here("explorations","data","aoi_data_joined.Rds"))

Take only familiar word data. (Note: retaining monolingual+bilingual data; only excluding Tseltal dataset)

d_trial <- aoi_data_joined |>
  filter(dataset_name != "casillas_tseltal_2015") |>
  filter(age > 12, age <= 60, 
         stimulus_novelty == "familiar") |> 
  select(dataset_name, subject_id, administration_id, trial_id, 
         dataset_id, stimulus_id, t_norm, age, aoi, english_stimulus_label) |>
  mutate(correct = ifelse(aoi == "target", 1, 
                          ifelse(aoi == "distractor", 0, NA)))

Curve summaries

Now think about what the basic curve is and how to get out various measures.

ggplot(d_trial, aes(x = t_norm, y = correct)) + 
  xlim(-2000,3500)+
  geom_smooth()

Seems like we want something that captures 1) the accuracy and 2) the rise coming soon after zero (RT).

(Comment MZ: this summary is a little bit atypical in that the common convention is to summarize average accuracy for each trial first, and then average across items/ within subjects). This is meant e.g. to deal with missing data, e.g. the fact that different trials will have different amounts of (non-missing) looking data. Also note that we are summarizing by administrations, not subjects.)

d_summary <- d_trial |>
  group_by(dataset_name, dataset_id, subject_id, administration_id, stimulus_id) |>
  summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
            prop_data = mean(!is.na(correct[t_norm > 0])))

ggplot(d_summary, aes(x = prop_data, y = accuracy)) +
  geom_point(alpha = .05)

There’s a lot of missing data and a lot of “zoners” (kids who look only at one side). Zoners are not just missing data kids.

ggplot(filter(d_summary, prop_data > .75),
       aes(x = accuracy)) + 
  geom_histogram()

Exclusions

Should we exclude data to get a more reliable measure? Here are two different decisions we could optimize:

  1. exclude zoners?
  2. exclude based on prop data

Let’s try to figure those out.

We’re going to use ICCs, with McGraw & Wong (1996). It seems like we want two-way random effects, no interaction (subjects and items are meaningful). This is type “2A.” We want average agreement across units.

One big decision is whether to look across stimulus items, rather than across kids. Across stimulus items returns much higher values. This is in part because we typically have more kids than items, and kids are sort of like “raters.”

#devtools::install_github("jmgirard/agreement")
library(agreement)

get_icc <- function (x, column = "accuracy", object = "stimulus") {
  if (object == "stimulus") {
    iccs <- dim_icc(x, 
                    model = "2A", 
                    type = "agreement", 
                    unit = "average",
                    object = stimulus_id, 
                    rater = administration_id,
                    score = {{column}}, 
                    bootstrap = 0)
  } else {
    iccs <- dim_icc(x, 
                    model = "2A", 
                    type = "agreement", 
                    unit = "average",
                    object = administration_id, 
                    rater = stimulus_id,
                    score = {{column}}, 
                    bootstrap = 0)
  }
  
  return(iccs$Inter_ICC)
}

iccs <- d_summary |>
  group_by(dataset_name) |> 
  nest() |>
  mutate(icc_acc = unlist(map(data, get_icc))) |>
  select(-data) |>
  unnest(cols = c())

knitr::kable(iccs, digits = 2)
dataset_name icc_acc
adams_marchman_2018 0.91
attword_processed NaN
byers-heinlein_2017 0.85
fmw_2013 0.88
frank_tablet_2016 0.86
garrison_bergelson_2020 0.12
hurtado_2008 0.78
mahr_coartic 0.85
perry_cowpig 0.63
pomper_saffran_2016 0.88
pomper_salientme 0.70
potter_canine 0.71
potter_remix 0.41
reflook_socword NaN
reflook_v4 NaN
ronfard_2021 0.65
swingley_aslin_2002 0.65
weisleder_stl 0.78
xsectional_2007 0.39

Let’s look at one dataset. Here are the stimulus and administration ICCs for Swingley & Aslin (2002).

sa <- d_summary |> 
  filter(dataset_name == "swingley_aslin_2002")

get_icc(sa, object = "stimulus")
## [1] 0.6534529
get_icc(sa, object = "administration")
## [1] 0
# ggplot(sa, aes(x = administration_id, y = accuracy, col = factor(stimulus_id))) +
  # geom_jitter(alpha = .2, width = .5) + 
  # geom_smooth(method = "lm", formula = y ~ 1, se = FALSE)

I don’t understand the zero.

Now try to do this programmatically across all datasets.

icc_sim <- function (zoners_included, exclude_less_than, object) 
{
  df <- d_summary |>
    filter(prop_data > exclude_less_than)
  
  # drop zoners
  if (zoners_included == FALSE) { 
    df <- filter(df, accuracy > 0, accuracy < 1) 
  }
  
  # compute ICCs
  df |> 
    group_by(dataset_name) |> 
    nest() |>
    mutate(icc = unlist(map(data, ~get_icc(., "accuracy", object)))) |>
    select(-data) |>
    unnest(cols = c()) 
}

excl_params <- expand_grid(zoners_included = c(FALSE, TRUE),
                           exclude_less_than = seq(.1, .9, .1), 
                           object = c("stimulus", "administration")) |>
  mutate(icc = pmap(list(zoners_included, exclude_less_than, object), icc_sim)) |>
  unnest(col = icc)
ggplot(excl_params,
       aes(x = exclude_less_than, y = icc, col = zoners_included)) + 
  geom_jitter(width = .01, alpha = .5) + 
  geom_smooth(method = "lm") + 
  facet_wrap(~object)

Looks to me like excluding zoners isn’t a clear win (and a loss for stimulus ICC). Further, excluding on amount of data doesn’t seem to gain us reliability.

I find this surprising and want to double check from other perspectives.

Window size

These simulations use ICCs as a way to understand how we summarize accuracy data. In particular, we’re going to look at how ICCs change as a function of window size.

icc_window_sim <- function (t_start = 0, t_end = 4000, object) 
{
  df <- d_trial |>
    filter(t_norm > t_start, t_norm < t_end) |>
    group_by(dataset_name, dataset_id, administration_id, stimulus_id) |>
    summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
              prop_data = mean(!is.na(correct[t_norm > 0])))
  
  # compute ICCs
  df |> 
    group_by(dataset_name) |> 
    nest() |>
    mutate(icc = unlist(map(data, ~get_icc(., "accuracy", object)))) |>
    select(-data) |>
    unnest(cols = c()) 
}

window_params <- expand_grid(t_start = seq(0,1750,250),
                             t_end = seq(2000,4000,250),
                             object = c("stimulus", "administration")) |>
  mutate(icc = pmap(list(t_start, t_end, object), icc_window_sim)) |>
  unnest(col = icc)
ggplot(window_params, aes(x = t_start, y = icc, col = as.factor(t_end))) + 
  geom_jitter() + 
  facet_wrap(~object) + 
  geom_smooth(aes(group = as.factor(t_end)), se = FALSE)

Looks like for stimulus and administration you get consistent but modest gains if you take the longest window. BUT for stimuli, the early part of the trial adds reliability (probably because of bias due to stimulus-level preferences?). In contrast, for administrations, the early part of the trial is less informative. 500ms seems like a pretty good compromise.

Summarising and modeling the data

Let’s move forward with ALL THE DATA (tm). Window will be something like 500 – 4000 based on the above analysis (which is at best incomplete and at worst…).

df <- d_trial |>
    filter(t_norm > 500, t_norm < 4000) |>
    group_by(dataset_name, dataset_id, administration_id, 
             age, stimulus_id, english_stimulus_label) |>
    summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
              prop_data = mean(!is.na(correct[t_norm > 0]))) 
df <- df[complete.cases(df),]

Plot all trials by all participants.

ggplot(df, aes(x = age/12, y = accuracy)) + 
  geom_point(alpha = .01) + 
  geom_smooth() + 
  geom_hline(lty = 2, yintercept = .5)

Try breaking this down by word.

hf_words = df |>
  group_by(english_stimulus_label) |> 
  count() |>
  filter(n > 500)

ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label, 
              age <= 36),
       aes(x = age/12, y = accuracy, col = english_stimulus_label)) + 
  geom_point(alpha = .05) + 
  geom_smooth(se=FALSE) + 
  geom_hline(lty = 2, yintercept = .5)

We see lots of wiggliness here. Let’s try to use a model to get this out.

Trying to get in some random effect structure without blowing everything up.

df$age_scaled <- scale(df$age)
mod <- lmer(accuracy ~ age_scaled + (1 | administration_id) + 
       (age_scaled | english_stimulus_label) + 
       (1 | dataset_name), 
     data = df)

summary(mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled |  
##     english_stimulus_label) + (1 | dataset_name)
##    Data: df
## 
## REML criterion at convergence: 8021.4
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0078 -0.6181  0.1363  0.7362  2.3438 
## 
## Random effects:
##  Groups                 Name        Variance  Std.Dev. Corr 
##  administration_id      (Intercept) 0.0052859 0.07270       
##  english_stimulus_label (Intercept) 0.0027322 0.05227       
##                         age_scaled  0.0002781 0.01668  -0.19
##  dataset_name           (Intercept) 0.0021459 0.04632       
##  Residual                           0.0771708 0.27780       
## Number of obs: 24234, groups:  
## administration_id, 1908; english_stimulus_label, 151; dataset_name, 19
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept) 0.625429   0.012956   48.27
## age_scaled  0.072917   0.005183   14.07
## 
## Correlation of Fixed Effects:
##            (Intr)
## age_scaled -0.107

Let’s extract and plot the random effects.

ranefs <- ranef(mod)$english_stimulus_label |>
  tibble() |>
  mutate(english_stimulus_label = rownames(ranef(mod)$english_stimulus_label)) |>
  mutate(english_stimulus_label = fct_reorder(english_stimulus_label, 
                                              `(Intercept)`)) |>
  pivot_longer(cols = -english_stimulus_label, 
               names_to = "coefficient", 
               values_to = "value")
  

ggplot(filter(ranefs, english_stimulus_label %in% hf_words$english_stimulus_label),
       aes(x = english_stimulus_label, y = value)) +
  facet_wrap(~coefficient) + 
  geom_point() + 
  coord_flip()

One question we might want to ask is, what’s the best measure of word difficulty. I’m not sure! I would have guessed that baby and car would be easy, but not frog or cookie necessarily! I also wouldn’t have said that apple would be hard.

If we do the predictions, though, we do see there are some very big study effects for the reflook studies, which have many kids and much older kids. This is tough variance to deal with.

newdata <- expand_grid(age_scaled = seq(min(df$age_scaled), max(df$age_scaled), .1), 
                       english_stimulus_label = hf_words$english_stimulus_label)
newdata$pred <- predict(mod, 
                        newdata = newdata, 
                        re.form = ~  (age_scaled | english_stimulus_label), 
                        type = "response")
newdata$age <- (newdata$age_scaled * sd(df$age)) + mean(df$age)

ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
       aes(x = age/12, y = accuracy, col = english_stimulus_label)) + 
  geom_point(alpha = .1) + 
  geom_line(data = newdata, 
            aes(x = age/12, y = pred, col = english_stimulus_label)) + 
  ggrepel::geom_text_repel(data = filter(newdata, 
                                         age == max(age)), 
                           aes(label = english_stimulus_label, 
                               y = pred)) + 
  geom_hline(lty = 2, yintercept = .5) + 
  xlab("Age (years)") + 
  xlim(1, 6)

More generally, we can ask, what are our big goals for this analysis.

  1. Look at the “average” word recognition trajectory, item independent and the functional form of the developmental change.
  2. Predict item-wise variation, e.g. in slope or intercept.

Let’s try to sketch each of these.

Goal 1: Average recognition trajectory

Let’s start by noting that there appears to be some functional form to these curves. Maybe capturing this form will help us summarize. Descriptively, let’s take a look at it word by word.

Exponentials

These asymptotic forms might be captured well by an exponential (following Kail 1990): \[ y \sim \alpha + \beta e^{-\gamma x} \]

Where \(x\) is age.

filter(df, english_stimulus_label %in% hf_words$english_stimulus_label) |>
ggplot(aes(x = age/12, y = accuracy)) + 
  geom_point(alpha = .1) + 
  geom_smooth() + 
  geom_smooth(col = "red", method = "lm", formula = y ~ I((exp(1)**(-x)))) + 
  geom_hline(lty = 2, yintercept = .5) + 
  facet_wrap(~english_stimulus_label)

# df$age_scaled <- scale(df$age)
exp_mod <- lmer(accuracy ~ I((exp(1)**(-age_scaled))) + (1 | administration_id) + 
       (I((exp(1)**(-age_scaled))) | english_stimulus_label) + 
       (1 | dataset_name), 
     data = df)

anova(mod,exp_mod)
## Data: df
## Models:
## mod: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled | english_stimulus_label) + (1 | dataset_name)
## exp_mod: accuracy ~ I((exp(1)^(-age_scaled))) + (1 | administration_id) + (I((exp(1)^(-age_scaled))) | english_stimulus_label) + (1 | dataset_name)
##         npar    AIC    BIC  logLik deviance Chisq Df Pr(>Chisq)
## mod        8 8021.8 8086.6 -4002.9   8005.8                    
## exp_mod    8 8024.5 8089.3 -4004.3   8008.5     0  0

So the exponential model doesn’t fit better, at least by AIC.

We can still visualize. But it’s clear that e.g. “car” is just increasing much more.

newdata$pred <- predict(exp_mod, 
                        newdata = newdata, 
                        re.form = ~ (I((exp(1)**(-age_scaled))) | english_stimulus_label), 
                        type = "response")

ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
       aes(x = age/12, y = accuracy, col = english_stimulus_label)) + 
  geom_point(alpha = .1) + 
  geom_line(data = newdata, 
            aes(x = age/12, y = pred, col = english_stimulus_label)) + 
  ggrepel::geom_text_repel(data = filter(newdata, 
                                         age == max(age)), 
                           aes(label = english_stimulus_label, 
                               y = pred)) + 
  geom_hline(lty = 2, yintercept = .5) + 
  xlab("Age (years)") + 
  xlim(1, 6)

Logits

One more try at finding a functional form, let’s use a half-logit. That would be:

\[ y \sim .5 + .5\frac{1}{1+e^{\alpha + \beta x}} \]

# df$age_scaled <- scale(df$age)
hl_mod <- lmer(accuracy ~ I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) + (1 | administration_id) + 
       (I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled))))| english_stimulus_label) + 
       (1 | dataset_name), 
     data = df)

summary(hl_mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: accuracy ~ I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) + (1 |  
##     administration_id) + (I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) |  
##     english_stimulus_label) + (1 | dataset_name)
##    Data: df
## 
## REML criterion at convergence: 7981.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0776 -0.6184  0.1387  0.7338  2.3626 
## 
## Random effects:
##  Groups                 Name                                         Variance
##  administration_id      (Intercept)                                  0.005218
##  english_stimulus_label (Intercept)                                  0.022429
##                         I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) 0.038737
##  dataset_name           (Intercept)                                  0.002209
##  Residual                                                            0.077067
##  Std.Dev. Corr 
##  0.07223       
##  0.14976       
##  0.19682  -0.94
##  0.04700       
##  0.27761       
## Number of obs: 24234, groups:  
## administration_id, 1908; english_stimulus_label, 151; dataset_name, 19
## 
## Fixed effects:
##                                              Estimate Std. Error t value
## (Intercept)                                   1.26132    0.04425   28.50
## I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) -0.84056    0.05762  -14.59
## 
## Correlation of Fixed Effects:
##             (Intr)
## I(0.5+0*(+e -0.956
anova(mod,hl_mod)
## Data: df
## Models:
## mod: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled | english_stimulus_label) + (1 | dataset_name)
## hl_mod: accuracy ~ I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) + (1 | administration_id) + (I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) | english_stimulus_label) + (1 | dataset_name)
##        npar    AIC    BIC  logLik deviance  Chisq Df Pr(>Chisq)
## mod       8 8021.8 8086.6 -4002.9   8005.8                     
## hl_mod    8 7986.8 8051.5 -3985.4   7970.8 35.065  0

Somewhat surprisingly, the half logit approach works and fits better by quite a bit!

newdata$pred <- predict(hl_mod, 
                        newdata = newdata, 
                        re.form = ~ (I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) | english_stimulus_label), 
                        type = "response")

ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
       aes(x = age/12, y = accuracy, col = english_stimulus_label)) + 
  geom_point(alpha = .1) + 
  geom_line(data = newdata, 
            aes(x = age/12, y = pred, col = english_stimulus_label)) + 
  ggrepel::geom_text_repel(data = filter(newdata, 
                                         age == max(age)), 
                           aes(label = english_stimulus_label, 
                               y = pred)) + 
  geom_hline(lty = 2, yintercept = .5) + 
  scale_color_discrete(guide = "none") + 
  xlab("Age (years)") + 
  xlim(1, 6)

Ok, so this is kind of entertaining, but it’s not working perfectly in that the curves are all being shifted by an intercept, rather than the parameters of the logit being shifted… really we want to push the intercept inside the half-logit so that its slope gets moved around. Not sure how to do that right now.

Onward.

Average developmental trajectory

newdata$pred <- predict(hl_mod, 
                        newdata = newdata, 
                        re.form = NA, 
                        type = "response")

ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
       aes(x = age/12, y = accuracy, col = english_stimulus_label)) + 
  geom_point(alpha = .1) + 
  geom_line(data = newdata, 
            aes(x = age/12, y = pred), col = "black", size = 1) + 
  ggrepel::geom_text_repel(data = filter(newdata, 
                                         age == max(age)), 
                           aes(label = english_stimulus_label, 
                               y = pred)) + 
  geom_hline(lty = 2, yintercept = .5) + 
  scale_color_discrete(guide = "none") + 
  xlab("Age (years)") + 
  xlim(1, 6)

Doesn’t look half bad.

Goal 2: Compare to Wordbank AoAs

Load AoAs - this is from aoa_prediction.R, which I assume I wrote at some point but can’t remember.

items <- wordbankr::get_item_data(language = "English (American)") 

ws_data <- wordbankr::get_instrument_data(language = "English (American)", 
                                          form = "WS", 
                                          administrations = TRUE, 
                                          items = items$item_id[items$form == "WS"]) %>%
  right_join(items)
wg_data <- wordbankr::get_instrument_data(language = "English (American)", 
                                          form = "WG", 
                                          administrations = TRUE, 
                                          items = items$item_id[items$form == "WG"]) %>%
  right_join(items)

wordbank_data <- bind_rows(ws_data, wg_data) %>%
  mutate(produces = value == "produces", 
         form = "both", 
         num_item_id = definition, # stupid stuff to make fit_aoa work on joint data
         item_id = definition) %>%
  filter(type == "word")

aoas <- wordbankr::fit_aoa(wordbank_data, 
                           measure = "produces", 
                           method = "glmrob",
                           age_min = 8, 
                           age_max = 36)
saveRDS(aoas, here("explorations","data","aoas.rds"))

Load these from cache since it’s time-consuming.

aoas <- readRDS(here("explorations","data","aoas.rds"))

Now let’s look at the relationship between AoAs and accuracies.

Naive AoA-accuracy

What’s the right metric to compare? Here are some ideas:

  1. average accuracy (confounded with age and study)
  2. accuracy random intercepts
  3. predicted point at which accuracy is greater than XYZ%

Let’s try each. First let’s do average accuracy.

mdf <- df |>
  group_by(english_stimulus_label) |>
  summarise(avg_accuracy = mean(accuracy),
            n_trials = n()) |>
  inner_join(aoas |>
              ungroup() |>
              select(definition, aoa) |>
              rename(english_stimulus_label = definition))

ggplot(mdf, aes(x = aoa, y = avg_accuracy)) + 
  geom_point(alpha = .5) + 
  geom_smooth(method = "lm") + 
  ggpmisc::stat_correlation() + 
  ggrepel::geom_label_repel(aes(label = english_stimulus_label))

There’s a correlation, but it goes in the WRONG direction! That’s probably because studies for older kids have harder words.

Let’s try methods 2 and 3.

hl_labels <- tibble(ranef(hl_mod)$english_stimulus_label) |>
  rename(hl_intercept = `(Intercept)`,
         hl_exponent = `I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled))))`)
hl_labels$english_stimulus_label <- rownames(ranef(hl_mod)$english_stimulus_label) 

mdf <- left_join(mdf, hl_labels)

Extract “AoAs” from predicted curves.

mod_aoas <- expand_grid(age_scaled = seq(min(df$age_scaled), max(df$age_scaled), .01), 
                       english_stimulus_label = mdf$english_stimulus_label)
mod_aoas$pred <- predict(hl_mod, 
                        newdata = mod_aoas, 
                        re.form = ~(I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) | english_stimulus_label), 
                        type = "response")
mod_aoas$age <- (mod_aoas$age_scaled * sd(df$age)) + mean(df$age)

mdf <- left_join(mdf, 
                 mod_aoas |>
                   group_by(english_stimulus_label) |>
                   summarise(aoa_60 = age[pred > .60][1],
                             aoa_65 = age[pred > .65][1],
                             aoa_70 = age[pred > .70][1],
                             aoa_75 = age[pred > .75][1]))

Now plot all of these.

mdf_long <- mdf |> 
  pivot_longer(cols = -c("english_stimulus_label", "aoa", "n_trials"), 
               names_to = "predictor", 
               values_to = "value")
ggplot(mdf_long, 
       aes(x = aoa, y = value)) + 
  geom_point(alpha = .5) + 
  geom_smooth(method = "lm") + 
  ggpmisc::stat_correlation() + 
  ggrepel::geom_label_repel(aes(label = english_stimulus_label)) + 
  facet_wrap(~predictor, scales="free_y")

Ugh. Surprising that these are not more related. Let’s look in on the one that seems to have ANY downward signal. Let’s zoom in.

ggplot(filter(mdf_long, predictor == "hl_exponent"),
       aes(x = aoa, y = value)) + 
  geom_point(alpha = .5, aes(size = n_trials)) + 
  geom_smooth(method = "lm") + 
  ggpmisc::stat_correlation() + 
  ggrepel::geom_label_repel(aes(label = english_stimulus_label)) 

We see that we have very little data about most of the words here, so there is a huge missing data issue.

Let’s zoom in on the ones we have most data on. Surprisingly, this shows that both the 70% point of the AOA curves and the HL exponent do correlate negatively (expected direction).

ggplot(filter(mdf_long, predictor %in% c("aoa_70", "hl_exponent"),
              n_trials > 100),
       aes(x = aoa, y = value)) + 
  geom_point(alpha = .5, aes(size = n_trials)) + 
  geom_smooth(method = "lm") + 
  ggpmisc::stat_correlation() + 
  ggrepel::geom_label_repel(aes(label = english_stimulus_label)) + 
  facet_wrap(~predictor, scales = "free_y")

Tentative conclusion is that we have a problem, which is that lots of harder words are shown to small samples of older children (who are better at recognizing them). This means the random effects are not very useful here.

But I’d still like to zoom in more on why the estimates are so low here for book, shoe, juice.