Overview

100 participants read naturalistic stories from the natural stories corpus. Each participant read 1 story.

We exclude

Within the filtered data, each story was read between 3 and 8 times, for an average of 6.3.

We also do the analyses on only the words before mistakes (per sentence) (40809 words)

From the modelling side: (After attempts without doing this filtering) we only include words which are single token and known words in each of the models vocabularies. We also only include words with frequencies. This is roughly equivalent to excluding words with punctuation.

We use as predictors:

Surprisals are measured in bits.

For GAM models, we center length and frequency but not surprisal. We want surprisal interpretable, but we also will be plotting it (at least for the bootstrapping) at length and frequencies set to 0, so they need to be centered. (Not sure this last piece is actually true/matters).

## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Story_Num = col_double(),
##   Sentence_Num = col_double(),
##   Sentence = col_character()
## )
## Joining, by = c("Story_Num", "Sentence_Num")
## `summarise()` has grouped output by 'word_num', 'word', 'sentence', 'type', 'Story_Num', 'Sentence_Num', 'Word_In_Story_Num', 'txl_surp', 'ngram_surp', 'grnn_surp', 'gpt_surp', 'freq', 'length', 'txl_center', 'ngram_center', 'grnn_center', 'freq_center', 'length_center', 'gpt_center', 'past_txl_surp', 'past_ngram_surp', 'past_grnn_surp', 'past_gpt_surp', 'past_freq', 'past_length', 'past_txl_center', 'past_ngram_center', 'past_grnn_center', 'past_freq_center', 'past_length_center', 'past_gpt_center'. You can override using the `.groups` argument.

GAMs

#no hierarchical, has freq x len interaction
formula_interact <- rt ~ s(surprisal, bs="cr", k=20)+
                     ti(freq, bs="cr") + 
                     ti(len, bs="cr")+
                     ti(freq,len, bs="cr")+
                     s(prev_surp, bs="cr", k=20)+
                     ti(prev_freq, bs="cr")+
                     ti(prev_len, bs="cr")+
                     ti(prev_freq, prev_len, bs="cr")

# no hierarchical, NO freq x len interaction
formula_no_interact <- rt ~ s(surprisal, bs="cr", k=20)+
                     ti(freq, bs="cr") + 
                     ti(len, bs="cr")+
                     s(prev_surp, bs="cr", k=20)+
                     ti(prev_freq, bs="cr")+
                     ti(prev_len, bs="cr")

All of this is on by-item mean data.

## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6324.8  355096280                 
## 2    6337.0  357451934 -12.278 -2355654
## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6328.9  326754835                 
## 2    6339.6  328514957 -10.785 -1760122
## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6326.2  338211260                 
## 2    6340.0  340297053 -13.794 -2085793
## Analysis of Deviance Table
## 
## Model 1: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + ti(freq, len, bs = "cr") + s(prev_surp, 
##     bs = "cr", k = 20) + ti(prev_freq, bs = "cr") + ti(prev_len, 
##     bs = "cr") + ti(prev_freq, prev_len, bs = "cr")
## Model 2: rt ~ s(surprisal, bs = "cr", k = 20) + ti(freq, bs = "cr") + 
##     ti(len, bs = "cr") + s(prev_surp, bs = "cr", k = 20) + ti(prev_freq, 
##     bs = "cr") + ti(prev_len, bs = "cr")
##   Resid. Df Resid. Dev      Df Deviance
## 1    6324.7  316288721                 
## 2    6335.7  318041282 -10.989 -1752561

I don’t know how to interpret the above.

I also don’t know how to interpret the plots.