This markdown documents a new way of thinking about modeling variation in LWL data. The idea is to try to:
The trouble is, what’s the right summary statistic? It might be that there’s not just one! But let’s assume there is one and we just need to sum it up right.
So we’ll start by trying to figure out what the best measure is. We’re going to make decisions to maximize within-experiment reliability via ICCs.
We’re focused on familiar words here.
all_aoi_timepoints <- get_aoi_timepoints()
all_stimuli <- get_stimuli()
all_administrations <- get_administrations()
all_subjects <- get_subjects()
all_trial_types <- get_trial_types()
all_trials <- get_trials()
aoi_data_joined <- all_aoi_timepoints |>
right_join(all_administrations) |>
right_join(all_subjects) |>
right_join(all_trials) |>
right_join(all_trial_types) |>
mutate(stimulus_id = target_id) |>
right_join(all_stimuli) |>
select(dataset_name, subject_id, administration_id, trial_id, dataset_id, stimulus_id, t_norm, age, aoi, english_stimulus_label, stimulus_novelty)
# TO DO: looks like a very small number of rows get added in the join, probably a sign of an issue/ ambiguity somewhere.
save(aoi_data_joined, file = here("explorations","data","aoi_data_joined.Rds"))
load(file = here("explorations","data","aoi_data_joined.Rds"))
Take only familiar word data. (Note: retaining monolingual+bilingual data; only excluding Tseltal dataset)
d_trial <- aoi_data_joined |>
filter(dataset_name != "casillas_tseltal_2015") |>
filter(age > 12, age <= 60,
stimulus_novelty == "familiar") |>
select(dataset_name, subject_id, administration_id, trial_id,
dataset_id, stimulus_id, t_norm, age, aoi, english_stimulus_label) |>
mutate(correct = ifelse(aoi == "target", 1,
ifelse(aoi == "distractor", 0, NA)))
Now think about what the basic curve is and how to get out various measures.
ggplot(d_trial, aes(x = t_norm, y = correct)) +
xlim(-2000,3500)+
geom_smooth()
Seems like we want something that captures 1) the accuracy and 2) the rise coming soon after zero (RT).
(Comment MZ: this summary is a little bit atypical in that the common convention is to summarize average accuracy for each trial first, and then average across items/ within subjects). This is meant e.g. to deal with missing data, e.g. the fact that different trials will have different amounts of (non-missing) looking data. Also note that we are summarizing by administrations, not subjects.)
d_summary <- d_trial |>
group_by(dataset_name, dataset_id, subject_id, administration_id, stimulus_id) |>
summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
prop_data = mean(!is.na(correct[t_norm > 0])))
ggplot(d_summary, aes(x = prop_data, y = accuracy)) +
geom_point(alpha = .05)
There’s a lot of missing data and a lot of “zoners” (kids who look only at one side). Zoners are not just missing data kids.
ggplot(filter(d_summary, prop_data > .75),
aes(x = accuracy)) +
geom_histogram()
Should we exclude data to get a more reliable measure? Here are two different decisions we could optimize:
Let’s try to figure those out.
We’re going to use ICCs, with McGraw & Wong (1996). It seems like we want two-way random effects, no interaction (subjects and items are meaningful). This is type “2A.” We want average agreement across units.
One big decision is whether to look across stimulus items, rather than across kids. Across stimulus items returns much higher values. This is in part because we typically have more kids than items, and kids are sort of like “raters.”
#devtools::install_github("jmgirard/agreement")
library(agreement)
get_icc <- function (x, column = "accuracy", object = "stimulus") {
if (object == "stimulus") {
iccs <- dim_icc(x,
model = "2A",
type = "agreement",
unit = "average",
object = stimulus_id,
rater = administration_id,
score = {{column}},
bootstrap = 0)
} else {
iccs <- dim_icc(x,
model = "2A",
type = "agreement",
unit = "average",
object = administration_id,
rater = stimulus_id,
score = {{column}},
bootstrap = 0)
}
return(iccs$Inter_ICC)
}
iccs <- d_summary |>
group_by(dataset_name) |>
nest() |>
mutate(icc_acc = unlist(map(data, get_icc))) |>
select(-data) |>
unnest(cols = c())
knitr::kable(iccs, digits = 2)
dataset_name | icc_acc |
---|---|
adams_marchman_2018 | 0.91 |
attword_processed | NaN |
byers-heinlein_2017 | 0.85 |
fmw_2013 | 0.88 |
frank_tablet_2016 | 0.86 |
garrison_bergelson_2020 | 0.12 |
hurtado_2008 | 0.78 |
mahr_coartic | 0.85 |
perry_cowpig | 0.63 |
pomper_saffran_2016 | 0.88 |
pomper_salientme | 0.70 |
potter_canine | 0.71 |
potter_remix | 0.41 |
reflook_socword | NaN |
reflook_v4 | NaN |
ronfard_2021 | 0.65 |
swingley_aslin_2002 | 0.65 |
weisleder_stl | 0.78 |
xsectional_2007 | 0.39 |
Let’s look at one dataset. Here are the stimulus and administration ICCs for Swingley & Aslin (2002).
sa <- d_summary |>
filter(dataset_name == "swingley_aslin_2002")
get_icc(sa, object = "stimulus")
## [1] 0.6534529
get_icc(sa, object = "administration")
## [1] 0
# ggplot(sa, aes(x = administration_id, y = accuracy, col = factor(stimulus_id))) +
# geom_jitter(alpha = .2, width = .5) +
# geom_smooth(method = "lm", formula = y ~ 1, se = FALSE)
I don’t understand the zero.
Now try to do this programmatically across all datasets.
icc_sim <- function (zoners_included, exclude_less_than, object)
{
df <- d_summary |>
filter(prop_data > exclude_less_than)
# drop zoners
if (zoners_included == FALSE) {
df <- filter(df, accuracy > 0, accuracy < 1)
}
# compute ICCs
df |>
group_by(dataset_name) |>
nest() |>
mutate(icc = unlist(map(data, ~get_icc(., "accuracy", object)))) |>
select(-data) |>
unnest(cols = c())
}
excl_params <- expand_grid(zoners_included = c(FALSE, TRUE),
exclude_less_than = seq(.1, .9, .1),
object = c("stimulus", "administration")) |>
mutate(icc = pmap(list(zoners_included, exclude_less_than, object), icc_sim)) |>
unnest(col = icc)
ggplot(excl_params,
aes(x = exclude_less_than, y = icc, col = zoners_included)) +
geom_jitter(width = .01, alpha = .5) +
geom_smooth(method = "lm") +
facet_wrap(~object)
Looks to me like excluding zoners isn’t a clear win (and a loss for stimulus ICC). Further, excluding on amount of data doesn’t seem to gain us reliability.
I find this surprising and want to double check from other perspectives.
These simulations use ICCs as a way to understand how we summarize accuracy data. In particular, we’re going to look at how ICCs change as a function of window size.
icc_window_sim <- function (t_start = 0, t_end = 4000, object)
{
df <- d_trial |>
filter(t_norm > t_start, t_norm < t_end) |>
group_by(dataset_name, dataset_id, administration_id, stimulus_id) |>
summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
prop_data = mean(!is.na(correct[t_norm > 0])))
# compute ICCs
df |>
group_by(dataset_name) |>
nest() |>
mutate(icc = unlist(map(data, ~get_icc(., "accuracy", object)))) |>
select(-data) |>
unnest(cols = c())
}
window_params <- expand_grid(t_start = seq(0,1750,250),
t_end = seq(2000,4000,250),
object = c("stimulus", "administration")) |>
mutate(icc = pmap(list(t_start, t_end, object), icc_window_sim)) |>
unnest(col = icc)
ggplot(window_params, aes(x = t_start, y = icc, col = as.factor(t_end))) +
geom_jitter() +
facet_wrap(~object) +
geom_smooth(aes(group = as.factor(t_end)), se = FALSE)
Looks like for stimulus and administration you get consistent but modest gains if you take the longest window. BUT for stimuli, the early part of the trial adds reliability (probably because of bias due to stimulus-level preferences?). In contrast, for administrations, the early part of the trial is less informative. 500ms seems like a pretty good compromise.
Let’s move forward with ALL THE DATA (tm). Window will be something like 500 – 4000 based on the above analysis (which is at best incomplete and at worst…).
df <- d_trial |>
filter(t_norm > 500, t_norm < 4000) |>
group_by(dataset_name, dataset_id, administration_id,
age, stimulus_id, english_stimulus_label) |>
summarise(accuracy = mean(correct[t_norm > 0], na.rm=TRUE),
prop_data = mean(!is.na(correct[t_norm > 0])))
df <- df[complete.cases(df),]
Plot all trials by all participants.
ggplot(df, aes(x = age/12, y = accuracy)) +
geom_point(alpha = .01) +
geom_smooth() +
geom_hline(lty = 2, yintercept = .5)
Try breaking this down by word.
hf_words = df |>
group_by(english_stimulus_label) |>
count() |>
filter(n > 500)
ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label,
age <= 36),
aes(x = age/12, y = accuracy, col = english_stimulus_label)) +
geom_point(alpha = .05) +
geom_smooth(se=FALSE) +
geom_hline(lty = 2, yintercept = .5)
We see lots of wiggliness here. Let’s try to use a model to get this out.
Trying to get in some random effect structure without blowing everything up.
df$age_scaled <- scale(df$age)
mod <- lmer(accuracy ~ age_scaled + (1 | administration_id) +
(age_scaled | english_stimulus_label) +
(1 | dataset_name),
data = df)
summary(mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled |
## english_stimulus_label) + (1 | dataset_name)
## Data: df
##
## REML criterion at convergence: 8021.4
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0078 -0.6181 0.1363 0.7362 2.3438
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## administration_id (Intercept) 0.0052859 0.07270
## english_stimulus_label (Intercept) 0.0027322 0.05227
## age_scaled 0.0002781 0.01668 -0.19
## dataset_name (Intercept) 0.0021459 0.04632
## Residual 0.0771708 0.27780
## Number of obs: 24234, groups:
## administration_id, 1908; english_stimulus_label, 151; dataset_name, 19
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 0.625429 0.012956 48.27
## age_scaled 0.072917 0.005183 14.07
##
## Correlation of Fixed Effects:
## (Intr)
## age_scaled -0.107
Let’s extract and plot the random effects.
ranefs <- ranef(mod)$english_stimulus_label |>
tibble() |>
mutate(english_stimulus_label = rownames(ranef(mod)$english_stimulus_label)) |>
mutate(english_stimulus_label = fct_reorder(english_stimulus_label,
`(Intercept)`)) |>
pivot_longer(cols = -english_stimulus_label,
names_to = "coefficient",
values_to = "value")
ggplot(filter(ranefs, english_stimulus_label %in% hf_words$english_stimulus_label),
aes(x = english_stimulus_label, y = value)) +
facet_wrap(~coefficient) +
geom_point() +
coord_flip()
One question we might want to ask is, what’s the best measure of word difficulty. I’m not sure! I would have guessed that baby and car would be easy, but not frog or cookie necessarily! I also wouldn’t have said that apple would be hard.
If we do the predictions, though, we do see there are some very big study effects for the reflook studies, which have many kids and much older kids. This is tough variance to deal with.
newdata <- expand_grid(age_scaled = seq(min(df$age_scaled), max(df$age_scaled), .1),
english_stimulus_label = hf_words$english_stimulus_label)
newdata$pred <- predict(mod,
newdata = newdata,
re.form = ~ (age_scaled | english_stimulus_label),
type = "response")
newdata$age <- (newdata$age_scaled * sd(df$age)) + mean(df$age)
ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
aes(x = age/12, y = accuracy, col = english_stimulus_label)) +
geom_point(alpha = .1) +
geom_line(data = newdata,
aes(x = age/12, y = pred, col = english_stimulus_label)) +
ggrepel::geom_text_repel(data = filter(newdata,
age == max(age)),
aes(label = english_stimulus_label,
y = pred)) +
geom_hline(lty = 2, yintercept = .5) +
xlab("Age (years)") +
xlim(1, 6)
More generally, we can ask, what are our big goals for this analysis.
Let’s try to sketch each of these.
Let’s start by noting that there appears to be some functional form to these curves. Maybe capturing this form will help us summarize. Descriptively, let’s take a look at it word by word.
These asymptotic forms might be captured well by an exponential (following Kail 1990): \[ y \sim \alpha + \beta e^{-\gamma x} \]
Where \(x\) is age.
filter(df, english_stimulus_label %in% hf_words$english_stimulus_label) |>
ggplot(aes(x = age/12, y = accuracy)) +
geom_point(alpha = .1) +
geom_smooth() +
geom_smooth(col = "red", method = "lm", formula = y ~ I((exp(1)**(-x)))) +
geom_hline(lty = 2, yintercept = .5) +
facet_wrap(~english_stimulus_label)
# df$age_scaled <- scale(df$age)
exp_mod <- lmer(accuracy ~ I((exp(1)**(-age_scaled))) + (1 | administration_id) +
(I((exp(1)**(-age_scaled))) | english_stimulus_label) +
(1 | dataset_name),
data = df)
anova(mod,exp_mod)
## Data: df
## Models:
## mod: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled | english_stimulus_label) + (1 | dataset_name)
## exp_mod: accuracy ~ I((exp(1)^(-age_scaled))) + (1 | administration_id) + (I((exp(1)^(-age_scaled))) | english_stimulus_label) + (1 | dataset_name)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod 8 8021.8 8086.6 -4002.9 8005.8
## exp_mod 8 8024.5 8089.3 -4004.3 8008.5 0 0
So the exponential model doesn’t fit better, at least by AIC.
We can still visualize. But it’s clear that e.g. “car” is just increasing much more.
newdata$pred <- predict(exp_mod,
newdata = newdata,
re.form = ~ (I((exp(1)**(-age_scaled))) | english_stimulus_label),
type = "response")
ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
aes(x = age/12, y = accuracy, col = english_stimulus_label)) +
geom_point(alpha = .1) +
geom_line(data = newdata,
aes(x = age/12, y = pred, col = english_stimulus_label)) +
ggrepel::geom_text_repel(data = filter(newdata,
age == max(age)),
aes(label = english_stimulus_label,
y = pred)) +
geom_hline(lty = 2, yintercept = .5) +
xlab("Age (years)") +
xlim(1, 6)
One more try at finding a functional form, let’s use a half-logit. That would be:
\[ y \sim .5 + .5\frac{1}{1+e^{\alpha + \beta x}} \]
# df$age_scaled <- scale(df$age)
hl_mod <- lmer(accuracy ~ I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) + (1 | administration_id) +
(I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled))))| english_stimulus_label) +
(1 | dataset_name),
data = df)
summary(hl_mod)
## Linear mixed model fit by REML ['lmerMod']
## Formula: accuracy ~ I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) + (1 |
## administration_id) + (I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) |
## english_stimulus_label) + (1 | dataset_name)
## Data: df
##
## REML criterion at convergence: 7981.5
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0776 -0.6184 0.1387 0.7338 2.3626
##
## Random effects:
## Groups Name Variance
## administration_id (Intercept) 0.005218
## english_stimulus_label (Intercept) 0.022429
## I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) 0.038737
## dataset_name (Intercept) 0.002209
## Residual 0.077067
## Std.Dev. Corr
## 0.07223
## 0.14976
## 0.19682 -0.94
## 0.04700
## 0.27761
## Number of obs: 24234, groups:
## administration_id, 1908; english_stimulus_label, 151; dataset_name, 19
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 1.26132 0.04425 28.50
## I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) -0.84056 0.05762 -14.59
##
## Correlation of Fixed Effects:
## (Intr)
## I(0.5+0*(+e -0.956
anova(mod,hl_mod)
## Data: df
## Models:
## mod: accuracy ~ age_scaled + (1 | administration_id) + (age_scaled | english_stimulus_label) + (1 | dataset_name)
## hl_mod: accuracy ~ I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) + (1 | administration_id) + (I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled)))) | english_stimulus_label) + (1 | dataset_name)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## mod 8 8021.8 8086.6 -4002.9 8005.8
## hl_mod 8 7986.8 8051.5 -3985.4 7970.8 35.065 0
Somewhat surprisingly, the half logit approach works and fits better by quite a bit!
newdata$pred <- predict(hl_mod,
newdata = newdata,
re.form = ~ (I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) | english_stimulus_label),
type = "response")
ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
aes(x = age/12, y = accuracy, col = english_stimulus_label)) +
geom_point(alpha = .1) +
geom_line(data = newdata,
aes(x = age/12, y = pred, col = english_stimulus_label)) +
ggrepel::geom_text_repel(data = filter(newdata,
age == max(age)),
aes(label = english_stimulus_label,
y = pred)) +
geom_hline(lty = 2, yintercept = .5) +
scale_color_discrete(guide = "none") +
xlab("Age (years)") +
xlim(1, 6)
Ok, so this is kind of entertaining, but it’s not working perfectly in that the curves are all being shifted by an intercept, rather than the parameters of the logit being shifted… really we want to push the intercept inside the half-logit so that its slope gets moved around. Not sure how to do that right now.
Onward.
newdata$pred <- predict(hl_mod,
newdata = newdata,
re.form = NA,
type = "response")
ggplot(filter(df, english_stimulus_label %in% hf_words$english_stimulus_label),
aes(x = age/12, y = accuracy, col = english_stimulus_label)) +
geom_point(alpha = .1) +
geom_line(data = newdata,
aes(x = age/12, y = pred), col = "black", size = 1) +
ggrepel::geom_text_repel(data = filter(newdata,
age == max(age)),
aes(label = english_stimulus_label,
y = pred)) +
geom_hline(lty = 2, yintercept = .5) +
scale_color_discrete(guide = "none") +
xlab("Age (years)") +
xlim(1, 6)
Doesn’t look half bad.
Load AoAs - this is from aoa_prediction.R
, which I assume I wrote at some point but can’t remember.
items <- wordbankr::get_item_data(language = "English (American)")
ws_data <- wordbankr::get_instrument_data(language = "English (American)",
form = "WS",
administrations = TRUE,
items = items$item_id[items$form == "WS"]) %>%
right_join(items)
wg_data <- wordbankr::get_instrument_data(language = "English (American)",
form = "WG",
administrations = TRUE,
items = items$item_id[items$form == "WG"]) %>%
right_join(items)
wordbank_data <- bind_rows(ws_data, wg_data) %>%
mutate(produces = value == "produces",
form = "both",
num_item_id = definition, # stupid stuff to make fit_aoa work on joint data
item_id = definition) %>%
filter(type == "word")
aoas <- wordbankr::fit_aoa(wordbank_data,
measure = "produces",
method = "glmrob",
age_min = 8,
age_max = 36)
saveRDS(aoas, here("explorations","data","aoas.rds"))
Load these from cache since it’s time-consuming.
aoas <- readRDS(here("explorations","data","aoas.rds"))
Now let’s look at the relationship between AoAs and accuracies.
What’s the right metric to compare? Here are some ideas:
Let’s try each. First let’s do average accuracy.
mdf <- df |>
group_by(english_stimulus_label) |>
summarise(avg_accuracy = mean(accuracy),
n_trials = n()) |>
inner_join(aoas |>
ungroup() |>
select(definition, aoa) |>
rename(english_stimulus_label = definition))
ggplot(mdf, aes(x = aoa, y = avg_accuracy)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm") +
ggpmisc::stat_correlation() +
ggrepel::geom_label_repel(aes(label = english_stimulus_label))
There’s a correlation, but it goes in the WRONG direction! That’s probably because studies for older kids have harder words.
Let’s try methods 2 and 3.
hl_labels <- tibble(ranef(hl_mod)$english_stimulus_label) |>
rename(hl_intercept = `(Intercept)`,
hl_exponent = `I(0.5 + 0.5 * (1/(1 + exp(1)^(age_scaled))))`)
hl_labels$english_stimulus_label <- rownames(ranef(hl_mod)$english_stimulus_label)
mdf <- left_join(mdf, hl_labels)
Extract “AoAs” from predicted curves.
mod_aoas <- expand_grid(age_scaled = seq(min(df$age_scaled), max(df$age_scaled), .01),
english_stimulus_label = mdf$english_stimulus_label)
mod_aoas$pred <- predict(hl_mod,
newdata = mod_aoas,
re.form = ~(I(.5 + .5 * (1 / (1 + exp(1)^(age_scaled)))) | english_stimulus_label),
type = "response")
mod_aoas$age <- (mod_aoas$age_scaled * sd(df$age)) + mean(df$age)
mdf <- left_join(mdf,
mod_aoas |>
group_by(english_stimulus_label) |>
summarise(aoa_60 = age[pred > .60][1],
aoa_65 = age[pred > .65][1],
aoa_70 = age[pred > .70][1],
aoa_75 = age[pred > .75][1]))
Now plot all of these.
mdf_long <- mdf |>
pivot_longer(cols = -c("english_stimulus_label", "aoa", "n_trials"),
names_to = "predictor",
values_to = "value")
ggplot(mdf_long,
aes(x = aoa, y = value)) +
geom_point(alpha = .5) +
geom_smooth(method = "lm") +
ggpmisc::stat_correlation() +
ggrepel::geom_label_repel(aes(label = english_stimulus_label)) +
facet_wrap(~predictor, scales="free_y")
Ugh. Surprising that these are not more related. Let’s look in on the one that seems to have ANY downward signal. Let’s zoom in.
ggplot(filter(mdf_long, predictor == "hl_exponent"),
aes(x = aoa, y = value)) +
geom_point(alpha = .5, aes(size = n_trials)) +
geom_smooth(method = "lm") +
ggpmisc::stat_correlation() +
ggrepel::geom_label_repel(aes(label = english_stimulus_label))
We see that we have very little data about most of the words here, so there is a huge missing data issue.
Let’s zoom in on the ones we have most data on. Surprisingly, this shows that both the 70% point of the AOA curves and the HL exponent do correlate negatively (expected direction).
ggplot(filter(mdf_long, predictor %in% c("aoa_70", "hl_exponent"),
n_trials > 100),
aes(x = aoa, y = value)) +
geom_point(alpha = .5, aes(size = n_trials)) +
geom_smooth(method = "lm") +
ggpmisc::stat_correlation() +
ggrepel::geom_label_repel(aes(label = english_stimulus_label)) +
facet_wrap(~predictor, scales = "free_y")
Tentative conclusion is that we have a problem, which is that lots of harder words are shown to small samples of older children (who are better at recognizing them). This means the random effects are not very useful here.
But I’d still like to zoom in more on why the estimates are so low here for book, shoe, juice.