This document shows an analysis of MB1 pilot data for purposes of procedural and analytic decision-making. It includes analysis of demographics and looking times, as well as condition differences. Condition differences (i.e., analyses of IDS preference) are included for purposes of analysis planning and should not be used for modification of procedure.

Please forgive errors, omissions, lack of clearly labeled axes, etc. This document was produced quickly to provide an initial guide for decision-making.

There are many decision-points that need to be discussed throughout. Among them:

options(dplyr.width = Inf)
knitr::opts_chunk$set(message = FALSE, warning = FALSE, cache=FALSE)

library(tidyverse)
library(eyetrackingR)
library(stringr)
library(lubridate)
library(bit64) # necessary because of times from SMI > max integer size
library(langcog)
library(knitr)

source("et_helper.R")

theme_set(theme_bw())

1 Load data

Loading process to standardize pilot data from each lab. Current desideratum is a long-form data sheet with columns:

In general, we probably want to have three different spreadsheets:

Then these can be merged appropriately.

Questions:

1.1 Frank data

Read in eye-tracker pilot data.

raw_data_path <- "pilot/frank/"
info_path <- "info/"
processed_data_path <- "processed_data/frank/"

all_data <- dir(raw_data_path, pattern="*.txt") %>%
  paste0(raw_data_path, .) %>%
  map_df(get_smi_header) %>% 
  split(.$file_name) %>%
  map_df(read_smi_idf) %>%
  split(.$file_name) %>%
  map_df(preprocess_data) 

Now extract trial numbers and match with stimuli.

Some items to be fixed:

  • Currently we’re not actually matching stimuli to trials (this will require parsing eye-tracker XML output)
  • We’re missing trial ADS-7; that’s a (major) oversight from the experimental setup.
  • Right now I clip trial < 5, which trims out the two training trials.
frank_data <- all_data %>% 
  group_by(file_name, trial, stimulus) %>%
  summarise(looking_time = max(t_stim)) %>%
  mutate(trial_cat = ifelse(str_detect(stimulus, ".jpg"), "speech","other")) %>%
  filter(trial_cat == "speech") %>%
  group_by(file_name) %>%
  filter(trial > 5) %>%
  mutate(trial_num = 1:n(), 
         subid = str_replace(str_replace(file_name,raw_data_path,""),
                           ".txt","")) 

Now merge in demographic information.

info <- read_csv("info/frank_demo.csv")

frank_data <- info %>% 
  select(subid, age, order) %>%
  left_join(frank_data)

Now merge in orders.

orders <- read_csv("info/orders.csv") %>%
  gather(marker, stimulus, 2:19) %>%
  rename(order = Order) %>%
  filter(!str_detect(stimulus, "Train")) %>% 
  group_by(order) %>%
  mutate(trial_num = 1:n()) %>%
  separate(stimulus, into = c("trial_type", "stim_num"), sep = -2) %>%
  select(-marker, -stim_num)

frank_data <- left_join(frank_data, orders) %>%
  mutate(trial_num = ceiling(trial_num  / 2)) %>%
  mutate(age_days = as.numeric(age), 
         lab = "stanford", 
         method = "eye-tracking") %>%
  select(lab, method, subid, age_days, trial_type, trial_num, looking_time)

1.2 Floccia data

floccia_data <- read_csv("pilot/floccia/pilot data.csv") %>%
  rename(age_days = age, 
         looking_time = LT) %>%
  mutate(subid = as.character(id), 
         method = "HPP",
         stimulus = str_replace(str_replace(stimulus, ".wav", ""), 
                                "Manybabies\\\\", "")) %>%
  separate(stimulus, into = c("trial_type", "stim_num"), sep = "-") %>%
  mutate(trial_num = ceiling(trial/2)) %>%
  select(lab, method, subid, age_days, trial_type, trial_num, looking_time)

1.3 Hamlin data

hamlin_path <- "pilot/hamlin/"
hamlin_data <- dir(hamlin_path, pattern="*.csv") %>%
  paste0(hamlin_path, .) %>%
  map_df(function(x) {read_csv(x) %>% mutate(order = x)}) %>%
  mutate(order = as.numeric(str_replace(str_replace(order, ".csv",""),
                                        "pilot/hamlin/order",""))) %>%
  gather(trial, looking_time, 
         starts_with("Train"), starts_with("IDS"), starts_with("ADS")) %>%
  separate(trial, into = c("trial_type","trial_num"), sep = -2) %>%
  mutate(lab = "ubc",
         method = "single-screen",
         trial_num = as.numeric(trial_num), 
         age_days = str_split(age, ";") %>% 
           map_dbl(function(x) as.numeric(x[1]) * 30.3 + as.numeric(x[2]))) %>%
  rename(subid = subnum) %>%
  select(lab, method, subid, age_days, trial_type, trial_num, looking_time)

1.4 Merge all data

This is what the eventual data frame looks like:

d <- bind_rows(floccia_data, hamlin_data, frank_data)
kable(head(d))
lab method subid age_days trial_type trial_num looking_time
plymouth HPP 1 331 IDS 1 20.194
plymouth HPP 1 331 ADS 1 20.199
plymouth HPP 1 331 IDS 2 2.715
plymouth HPP 1 331 ADS 2 16.488
plymouth HPP 1 331 ADS 3 18.237
plymouth HPP 1 331 IDS 3 13.777

1.5 Descriptives

d %>%
  group_by(lab, age_days, subid) %>%
  distinct %>%
  group_by(lab) %>%
  summarise(n = n(), 
            age_months = mean(age_days)/30.3) %>%
  kable(digits = 1)
lab n age_months
plymouth 9 10.0
stanford 14 13.9
ubc 9 5.2

2 Plots

2.1 Demographics

What’s our participant distribution?

subs <- d %>%
  group_by(lab, subid, age_days) %>%
  distinct

qplot(age_days, fill = lab, data=subs)

2.2 Looking time dynamics

First, the overall distribution of looking times.

qplot(looking_time, fill = lab, facets = ~ lab, binwidth = 2, data = d)

Stanford has a large number of 2s looking times because that’s the lookaway from the tracker. So when a child isn’t looking at all, they get a 2s. How should we deal with this?

Next, are children making it through the experiment? Looks like essentially everyone does.

final_trial <- d %>%
  group_by(lab, subid) %>%
  summarize(max_trial = max(trial_num[looking_time > 2]))
                            
qplot(max_trial, fill = lab, data = final_trial)

Now, histogram of looking time by trial number. Looks like looking times are staying pretty long.

ggplot(d, aes(x = looking_time, fill = lab)) + 
  geom_histogram(binwidth = 2) + 
  facet_wrap(~trial_num)

We can look at this by age, too.

qplot(age_days, looking_time, col = lab, facets = ~ trial_num, data = d) + 
  geom_smooth(aes(group = 1), method = "lm", col = "black")

Plot means.

ms <- d %>%
  group_by(lab, trial_num) %>%
  multi_boot_standard(col = "looking_time", na.rm=TRUE)

ggplot(ms, aes(x = trial_num, y = mean, col = lab)) + 
  geom_line() + 
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper), 
                 position = position_dodge(width = .1)) 

2.3 Condition differences

2.3.1 Between-subjects analysis

Doesn’t compute difference scores - treats IDS and ADS observations as independent. These analyses should likely be lower-powered. (Interested to hear people’s thoughts as to whether they should be included).

ms <- d %>%
  filter(trial_type != "Train", 
         looking_time != 0, !is.na(looking_time)) %>%  
  group_by(trial_num, trial_type) %>%
  multi_boot_standard(col = "looking_time", na.rm=TRUE)

ggplot(ms, aes(x = trial_num, y = mean, col = trial_type)) + 
  geom_line() + 
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper), 
                 position = position_dodge(width= .1)) +
  ylab("Looking time (s)")

Split by lab (which is really age now).

ms <- d %>%
  filter(trial_type != "Train", 
         looking_time != 0, !is.na(looking_time)) %>%  
  group_by(lab, trial_num, trial_type) %>%
  multi_boot_standard(col = "looking_time", na.rm=TRUE)

ggplot(ms, aes(x = trial_num, y = mean, col = trial_type)) + 
  geom_smooth(se = FALSE, span = 2) + 
  facet_wrap(~lab) + 
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper), 
                 position = position_dodge(width= .1))+ 
  ylab("Looking time (s)")

Take a look at this in log space as well, following Csibra et al. (2015), “Statistical Treatment of Looking-Time Data.” Doesn’t change much, but is likely better.

ms <- d %>%
  filter(trial_type != "Train", 
         looking_time != 0, !is.na(looking_time)) %>%
  group_by(lab, trial_num, trial_type) %>%
  mutate(log10_looking_time = log10(looking_time)) %>%
  multi_boot_standard(col = "log10_looking_time", na.rm=TRUE)

ggplot(ms, aes(x = trial_num, y = mean, col = trial_type)) + 
  geom_smooth(se = FALSE, span = 2) + 
  facet_wrap(~lab) + 
  geom_linerange(aes(ymin = ci_lower, ymax = ci_upper), 
                 position = position_dodge(width= .1)) +
  ylab("Log looking time (s)")

2.3.2 Within-subjects

Now do (perhaps) the more appropriate analysis: For each pair of trials, subtract to get the difference score. Again following Csibra, we do a difference of logs.

diffs <- d %>%
  filter(trial_type != "Train", 
         looking_time != 0, !is.na(looking_time)) %>%
  group_by(lab, subid, age_days, trial_num) %>%
  filter(n() == 2) %>% # take only pairs that are complete
  summarise(idspref = log10(looking_time[trial_type=="IDS"]) - 
              log10(looking_time[trial_type=="ADS"]))

What’s the distributional form of these data?

qplot(idspref, data = diffs)

How do they change with trials?

ms_diff <- diffs %>%
  group_by(lab, trial_num) %>%
  multi_boot_standard(col = "idspref", na.rm=TRUE)

ggplot(ms_diff, aes(x = trial_num, y = mean)) +
         geom_smooth(se = FALSE, span = 2) + 
  facet_wrap(~lab) + 
  geom_pointrange(aes(ymin = ci_lower, ymax = ci_upper), 
                 position = position_dodge(width= .1)) +
  ylab("IDS preference (log10 s)") + 
  geom_hline(yintercept = 0, lty = 2)

Or with age?

qplot(age_days, idspref, col = lab, group = 1, data = diffs) + 
  geom_smooth(method = "lm") + 
  geom_hline(yintercept = 0, lty = 2) + 
  ylab("IDS preference (s)") 

By age and by trial.

qplot(age_days, idspref, col = lab, group = 1, data = diffs) + 
  geom_smooth(method = "lm") + 
  facet_wrap(~trial_num) + 
  geom_hline(yintercept = 0, lty = 2) + 
  ylab("IDS preference (s)") 

3 Conclusions

Practical recommendations:

Conclusions: It looks like we’re seeing some IDS preference for each group, albeit at a different part of the experiment for each age/lab combo.