Overview

100 participants read naturalistic stories from the natural stories corpus. Each participant read 1 story.

We exclude

participants who do not report English as a native language (95 remaining)
participants who do not get 80% of the words correct (63 remaining)
practice items (64714 words remaining)
words that were wrong or were within two after a mistake (58388 words remaining)
the first word of every sentence (didn’t have a real distractor, RT is measured slightly differently) (55458 words remaining)
words with RTs <100 or >5000 (<100 we think is likely a recording error, or at least not reading the words at all, >5000 is likely getting distracted) (55384 words remaining)

Within the filtered data, each story was read between 3 and 8 times, for an average of 6.3.

We also do the analyses on only the words before mistakes (per sentence) (40809 words)

From the modelling side: (After attempts without doing this filtering) we only include words which are single token and known words in each of the models vocabularies. We also only include words with frequencies. This is roughly equivalent to excluding words with punctuation.

We use as predictors:

length in characters of stripped word
unigram frequency of word. Frequencies for words are calculated using word_tokenize on the gulordava train data and counting up instances. (This tends to tokenize off punctuation, but is capitalization sensitive). Frequencies are represented as log2 of the expected occurances in 1 billion words.

Surprisals are measured in bits.

ngram (5-gram KN smoothed)
GRNN
Transformer-XL

For GAM models, we center length and frequency but not surprisal. We want surprisal interpretable, but we also will be plotting it (at least for the bootstrapping) at length and frequencies set to 0, so they need to be centered. (Not sure this last piece is actually true/matters).

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Story_Num = col_double(),
##   Sentence_Num = col_double(),
##   Sentence = col_character()
## )

## Joining, by = c("Story_Num", "Sentence_Num")

## `summarise()` has grouped output by 'word_num', 'word', 'sentence', 'type', 'Story_Num', 'Sentence_Num', 'Word_In_Story_Num', 'txl_surp', 'ngram_surp', 'grnn_surp', 'gpt_surp', 'freq', 'length', 'txl_center', 'ngram_center', 'grnn_center', 'freq_center', 'length_center', 'gpt_center', 'past_txl_surp', 'past_ngram_surp', 'past_grnn_surp', 'past_gpt_surp', 'past_freq', 'past_length', 'past_txl_center', 'past_ngram_center', 'past_grnn_center', 'past_freq_center', 'past_length_center', 'past_gpt_center'. You can override using the `.groups` argument.

GAMs

#no hierarchical, has freq x len interaction
formula_interact <- rt ~ s(surprisal, bs="cr", k=20)+
                     ti(freq, bs="cr") + 
                     ti(len, bs="cr")+
                     ti(freq,len, bs="cr")+
                     s(prev_surp, bs="cr", k=20)+
                     ti(prev_freq, bs="cr")+
                     ti(prev_len, bs="cr")+
                     ti(prev_freq, prev_len, bs="cr")

# no hierarchical, NO freq x len interaction
formula_no_interact <- rt ~ s(surprisal, bs="cr", k=20)+
                     ti(freq, bs="cr") + 
                     ti(len, bs="cr")+
                     s(prev_surp, bs="cr", k=20)+
                     ti(prev_freq, bs="cr")+
                     ti(prev_len, bs="cr")

All of this is on by-item mean data.

## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6324.8  355096280                 
## 2    6337.0  357451934 -12.278 -2355654

## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6328.9  326754835                 
## 2    6339.6  328514957 -10.785 -1760122

## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6326.2  338211260                 
## 2    6340.0  340297053 -13.794 -2085793

## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6324.7  316288721                 
## 2    6335.7  318041282 -10.989 -1752561

I don’t know how to interpret the above.

I also don’t know how to interpret the plots.

Natural Stories GAM analysis

Overview

GAMs