In case you weren’t around on Monday

Why generalised linear models?

  • Unified framework of advanced statistical techniques that covers a wider range of data-analysis problems.
    • Linear regression and General linear models
    • Generalised linear models: transformation allows modelling of data-types including categorical, ordinal, count data.
    • Multi-level generalised linear models: data with hierarchical structure (repeated measures).
  • Backbone of other advanced models: machine learning, time series models, path analysis, structural equation models, factor analysis, signal detection models.

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu, \sigma^2) \end{aligned} \]

  • Model of how the distribution of one variable, known as the outcome, varies as a function of other variables, known as predictor.
  • Each observation \(y\) is a sample (indexed by \(i\)) from (“\(\sim\)”) a univariate normally distributed probability model \(\mathcal{N}\).
  • Greek letters are the parameters that describe this distibution: mean \(\mu\) and standard deviation \(\sigma^2\).

Normal linear model (formal description)

\[ \begin{aligned} y_i &\sim \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i &= \beta_0 + \beta_1 \cdot x_{1i} \\ \end{aligned} \]

  • Linear regression equation allows decomposition: value of \(\mu_i\) is a linear function of values of one (or more) predictor variable \(x\).
  • Intercept \(\beta_0\) is “constant”
  • Slope \(\beta_1\) depends on predictor \(x_{1i}\)
  • If \(x_{1i}\) takes on 0, \(\mu_i = \beta_0\)
  • If \(x_{1i}\) takes on 1, \(\mu_i = \beta_0 + \beta_1\)
  • \(\beta_1\) is the difference between \(x_{1i}=1\) and \(x_{1i}=0\).

Simulation of normal distributed data

We can use lm to implement such a model.

model <- lm(y ~ x, data = data)

We can simulate data that exactly meet model assumptions (normal distribution, equal variance, independence etc.) with known model parameters.

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Generate data
y <- rnorm(n = 1000, mean = 550, sd = 10)

Simulation of normal distributed data

  • rnorm generates unimodal, symmetrically distributed values.
  • r in rnorm = random; norm = normal
# Decompose the mean
beta_0 = 500
beta_1 = 50
# Generate data
y <- rnorm(n = 1000, mean = beta_0 + beta_1, sd = 10)

Simulation of normal distributed data

# Set parameter values
n <- 1000
beta_0 <- 500 
beta_1 <- 50
sigma <- 100
# Random data for group 1
group_1 <- rnorm(n = n/2, mean = beta_0, sd = sigma)
# Random data for group 2
group_2 <- rnorm(n = n/2, mean = beta_0 + beta_1, sd = sigma)
# Generate data
sim_data <- tibble(group_1 = group_1,
                   group_2 = group_2) 

# Preview data
glimpse(sim_data)
Rows: 500
Columns: 2
$ group_1 <dbl> 560, 495, 485, 546, 291, 402, 397, 450, 355, 522, 415, 347, 41…
$ group_2 <dbl> 615, 640, 514, 592, 564, 496, 632, 678, 510, 565, 532, 677, 63…
# Change data format
sim_data <- pivot_longer(sim_data, cols = c(group_1, group_2), names_to = "x", values_to = "y")

# Preview data
glimpse(sim_data)
Rows: 1,000
Columns: 2
$ x <chr> "group_1", "group_2", "group_1", "group_2", "group_1", "group_2", "g…
$ y <dbl> 560, 615, 495, 640, 485, 514, 546, 592, 291, 564, 402, 496, 397, 632…

Simulation of normal distributed data

# As a reminder, these are the parameter values
beta_0 <- 500 
beta_1 <- 50
sigma <- 100

lm uses maximum likelihood estimation to determine for which set of parameter values are the data most likely.

# Normal linear model of simulated data
model <- lm(y ~ x, data = sim_data)

coef(model) # Model coefficients
(Intercept)    xgroup_2 
      501.0        45.1 
sigma(model) # Standard deviation
[1] 96.9

Exercise: complete script normal_model_simulation.R

Real data

  • By using linear models to estimate parameter values from data we make assumptions about the process that generates the data.
  • For simulated data we know the process that generated the data.
    • normal distribution because rnorm samples normally distributed data
    • both groups have the same variance because sigma was the same.
  • Even if we know that assumptions are met, parameter estimates obtained are not identical but similar to the true parameter values.
  • In real life we don’t know the true parameter values or the underlying process that generates the data.

Real (hierarchical) data: Martin et al. (2010)

Martin et al. (2010) data (Experiment 3a)

data <- read_csv("../data/martin-etal-2010-exp3a.csv")
# A tibble: 1,036 × 4
   item   ppt    rt nptype   
  <dbl> <dbl> <dbl> <chr>    
1    11    10  1055 conjoined
2    11     9  2010 conjoined
3    11    11   461 conjoined
4    11     6   977 conjoined
5    11     7  1152 conjoined
# ℹ 1,031 more rows
summarise(data, across(c(ppt, item), list(n = ~length(unique(.))), .names = "{col}"))
# A tibble: 1 × 2
    ppt  item
  <int> <int>
1    12    48

Aggregation by participants (across items)

In standard repeated measures ANOVAs only one source of variance can be modelled at a time.

For example one would aggregate across items and by-participant neglecting the by-items variance.

data_ppt <- summarise(data, rt = mean(rt), .by = c(nptype, ppt))
# A tibble: 24 × 3
  nptype      ppt    rt
  <chr>     <dbl> <dbl>
1 conjoined     1 1382.
2 simple        1 1323.
3 conjoined     2 1077.
4 simple        2 1108.
5 conjoined     3 1162.
# ℹ 19 more rows

Conditions nested within participants

Participant variation (by-participant means)

Model description

Each rt observation \(i \in 1 \dots N\) comes from a normal distribution with a mean \(\mu\) and a standard deviation \(\sigma^2\).

\[ \text{rt}_i \sim \mathcal{N}(\mu_i, \sigma^2)\\ \]

Using the linear regression equation, the mean \(\mu\) can be decomposed into

\[ \mu_i = \beta_0 + \beta_1 \cdot \text{nptype}_i + \text{ppt}_i\\ \]

where \(\text{nptype}_i\) is dummy coded (by default as treatment contrast in alphabetical order)

\[ \text{nptype}_i = \left\{ \begin{array}{ll} 0, \text{ if nptype}_i = \texttt{conjoined}\\ 1, \text{ if nptype}_i = \texttt{simple}\\ \end{array} \right. \]

\(\beta_0\) is the average rt for conjoined NPs because

\(\beta_0 + \beta_1 \cdot 0 = \beta_0\),

the average rt for simple NPs is

\(\beta_0 + \beta_1 \cdot 1 = \beta_0 + \beta_1\),

and therefore \(\beta_1\) is the difference in rts between conjoined and simple NPs.

Finally \(\text{ppt}_i \sim \mathcal{N}(0, \sigma^2_\text{ppt})\).

Model implementation in R

Repeated measures ANOVA:

library(afex)
m <- aov_car(rt ~ Error(ppt/nptype), data = data_ppt)
summary(m)
Univariate Type III Repeated-Measures ANOVA Assuming Sphericity

              Sum Sq num Df Error SS den Df F value  Pr(>F)    
(Intercept) 28783750      1   587304     11  539.11 1.1e-10 ***
nptype          6217      1    13043     11    5.24   0.043 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Linear mixed-effects model:

library(lme4)
m <- lmer(rt ~ nptype + (1|ppt), data = data_ppt)
anova(m)
Type III Analysis of Variance Table with Satterthwaite's method
       Sum Sq Mean Sq NumDF DenDF F value Pr(>F)  
nptype   6217    6217     1    11    5.24  0.043 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercises: complete script repeated_measures_model.R

and after that mixed_effects_model.R.

Item variation (by-image sets means)

Crossed participants and items

What’s a mixed-effects model?

  • Combination of fixed and random effects
  • Allows to address dependency in the data: hierarchical (nested, multi-level) and crossed factor
  • Fixed effects: systematic / deterministic effects (e.g. age, trial id, grouping variables)
  • Random effects: non-systematic effects (e.g. ppts, items, class, country)
  • Random effects can be nested: people within country; students within classroom within school within area etc.
  • Sometimes the differences can be blurry: comparing differences between two test sites rather than treating test sites as repeated measures variable.

Crossed participants and items

  • Traditional repeated-measures ANOVA can only handle one source of variance (participant or item).
  • Linear mixed-effects models can capture one and more (Baayen, Davidson, and Bates 2008).
m <- lmer(rt ~ nptype + ( 1 | ppt ) + ( 1 | item ), data = data)

Crossed participants and items

summary(m)$varcor
 Groups   Name        Std.Dev.
 item     (Intercept)  40.6   
 ppt      (Intercept) 161.2   
 Residual             233.5   
sigma(m)
[1] 234

Random intercepts

Predicted outcomes

summary(m)$coef
             Estimate Std. Error t value
(Intercept)    1111.8       48.0    23.2
nptypesimple    -33.4       14.5    -2.3
(coefs <- m@beta) # extract estimates
[1] 1111.8  -33.4
# Predicted mean for np conjoined
coefs[1] + coefs[2] * 0
[1] 1112
# Predicted mean for np simple
coefs[1] + coefs[2] * 1
[1] 1078
# Check predictor contrasts (default: treatment)
contrasts(factor(data$nptype))
          simple
conjoined      0
simple         1

Null-hypothesis test for lmer models?

t-values \(> \mid 2 \mid\) correspond to \(\alpha = .05\) (Baayen 2008):

summary(m)$coef
             Estimate Std. Error t value
(Intercept)    1111.8       48.0    23.2
nptypesimple    -33.4       14.5    -2.3

Confidence intervals: 95% CI shows the range of hypotheses that can’t be ruled out for \(\alpha = .05\):

confint(m, "nptypesimple")
             2.5 % 97.5 %
nptypesimple -61.8  -4.93

Satterthwaite approximation for degrees of freedom (lmerTest):

m <- lmerTest::lmer(rt ~ nptype + ( 1 | ppt ) + ( 1 | item ), data = data)
summary(m)$coef
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.8       48.0  11.9    23.2 3.11e-11
nptypesimple    -33.4       14.5 977.9    -2.3 2.16e-02

Likelihood-ratio test:

m_null <- lmer(rt ~ 1 + ( 1 | ppt ) + ( 1 | item ), data = data)
anova(m_null, m)
Data: data
Models:
m_null: rt ~ 1 + (1 | ppt) + (1 | item)
m: rt ~ nptype + (1 | ppt) + (1 | item)
       npar   AIC   BIC logLik deviance Chisq Df Pr(>Chisq)  
m_null    4 14319 14339  -7155    14311                      
m         5 14316 14340  -7153    14306  5.28  1      0.022 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Predicted effects by participants and items

Random intercepts and random slopes

  • Random intercepts: size of np-type effect is constant.
  • Random slopes: size of np-type effect varies by participants and items.
  • Correlation between random intercepts and random slopes: e.g. np-type effect might be weaker for participants / items with longer rts.
  • Maximal random effects structure (Barr et al. 2013): for repeated-measures designs with crossed participants and items, we need random intercepts for participants and items with by-participants and by-items random slopes.
# Random intercepts model
m <- lmer(rt ~ nptype + ( 1 | ppt ) + ( 1 | item ), data = data)

# Maximal random effects structure 
m_max <- lmer(rt ~ nptype + ( nptype | ppt ) + ( nptype | item ), data = data)

Random intercepts and random slopes

Model summary

# Maximal random effects model 
summary(m_max)
Linear mixed model fit by REML. t-tests use Satterthwaite's method [
lmerModLmerTest]
Formula: rt ~ nptype + (nptype | ppt) + (nptype | item)
   Data: data

REML criterion at convergence: 14288

Scaled residuals: 
   Min     1Q Median     3Q    Max 
-4.505 -0.596 -0.147  0.449  5.002 

Random effects:
 Groups   Name         Variance Std.Dev. Corr 
 item     (Intercept)   2749.8   52.44        
          nptypesimple   522.1   22.85   -1.00
 ppt      (Intercept)  27159.7  164.80        
          nptypesimple    52.4    7.24   -1.00
 Residual              54359.2  233.15        
Number of obs: 1036, groups:  item, 48; ppt, 12

Fixed effects:
             Estimate Std. Error     df t value Pr(>|t|)    
(Intercept)    1111.9       49.3   11.5   22.58    7e-11 ***
nptypesimple    -33.6       15.0  125.7   -2.24    0.027 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr)
nptypesimpl -0.310
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see help('isSingular')

Hypothetical examples

Task: Pair designs with their maximal random-effects structure!

  • Design 1 (between participants / within items): both NP type conditions were presented for each item set but half of the participants saw simple NPs and the other half saw conjoined NPs.
  • Design 2 (within participants / within items): each participant saw both conditions and each item was presented in both conditions.
  • Design 3 (within participants / between items): half of all items were conjoined and the other half were simple NP type; each participant saw both condition (simple NPs and conjoined NPs).
  • Design 4 (between participants / between items): half of all items were conjoined and half were simple NP type; half of the participants saw simple NPs and the other half saw conjoined NPs.
Model_A <- lmer(rt ~ nptype + ( 1 | ppt ) + ( nptype | item ), data = data) 
Model_B <- lmer(rt ~ nptype + ( 1 | ppt ) + ( 1 | item ), data = data)
Model_C <- lmer(rt ~ nptype + ( nptype | ppt ) + ( nptype | item ), data = data) 
Model_D <- lmer(rt ~ nptype + ( nptype | ppt ) + ( 1 | item ), data = data) 

Exercise: complete exercise script mixed_effects_model_rirs.R

Generalised linear models

Type of outcome variable

Continuous:

library(lme4)
model <- lmer(dv ~ iv + ( random slopes | random intercepts ), data = data)

Binary:

library(lme4)
model <- glmer(dv ~ iv + ( random slopes | random intercepts ), data = data, family = binomial())

Count:

library(lme4)
model <- glmer(dv ~ iv + ( random slopes | random intercepts ), data = data, family = poisson())

Also negative-binomial lme4::glmer.nb and zero-inflated count models NBZIMM::glmm.zinb.

Ordinal:

library(ordinal)
model <- clmm(dv ~ iv + ( random slopes | random intercepts ), data = data)

Frequentist mixed-effects models require familiarity with different packages.

Convergence failures

  • Maximal random effects structure were introduced as gold-standard for repeated-measures experiments (Barr et al. 2013; Baayen, Davidson, and Bates 2008).
  • Easy to fit for simple designs with balanced samples and simulations (Barr et al. 2013).
  • Convergence failures are well documented in the literature (Bates et al. 2015; Eager and Roy 2017; Kimball et al. 2016).


  • Overparameterisation: unidentifiable parameter estimates (Bates et al. 2015).
  • lme4 author Bates proposed the use of “parsimonious” random-effects structures (Bates et al. 2015).
  • Model selection based on failure to fit more complex model.

Alternative

  • Convergence failures (and other problems like floor and ceiling effects) can be (easily) addressed in Bayesian models.
  • Bayesian model converge by definition (as the number of iterations approaches \(\infty\)).

Fitting mixed-effects models in brms

(Bürkner 2017, 2018)

brms (Bürkner 2017)

  • R package for Bayesian models
  • (Almost) no more complicated to fit than lme4 models.
  • Syntax deliberately mimics lme4.
  • More probability models than other packages: gaussian, lognormal, bernoulli, poisson, zero_inflated_poisson, skew_normal, shifted_lognormal, exgaussian, et cetera
  • Allows mixture models, nonlinear syntax, (un)equal variance signal-detection theory, multivariate models


  • More flexibility is only with Stan (Annis, Miller, and Palmeri 2017; Carpenter et al. 2017).
  • brms creates Stan code to compile a probabilistic MCMC (Markov Chain Monte Carlo) sampler.
  • Compiling the sampler and obtaining the full posterior can take time.

Exercise: complete script mixed_effects_model_brms.R

R syntax

Frequentist mixed-effects models

library(lme4)
fit_lmer <- lmer(rt ~ nptype + ( nptype | ppt ) + ( nptype | item ), data = data)

Bayesian mixed-effects models

library(brms)
fit_brm <- brm(rt ~ nptype + ( nptype | ppt ) + ( nptype | item ), data = data)

Fit models on simulated data

coef(summary(fit_lmer)) # Coefficients of frequentist model
             Estimate Std. Error    df t value Pr(>|t|)
(Intercept)    1111.9       49.3  11.5   22.58 6.96e-11
nptypesimple    -33.6       15.0 125.7   -2.24 2.70e-02
fixef(fit_brm) # Coefficients of Bayesian model
             Estimate Est.Error   Q2.5    Q97.5
Intercept      1108.0      50.2 1010.1 1206.889
nptypesimple    -33.2      17.2  -66.3    0.749

Why Bayesian?

see e.g. Kruschke (2014)

Two universes

Two different philosophical frameworks of generalising from sample to the population: What do the data tell us about “the truth”?

  • Null-hypothesis significance testing
  • Bayesian inference

Null-hypothesis significance testing

  • Classical / frequentist statistics; p-value based statistics
  • p-value: How plausible are the data (or something more extreme) if we assume that there’s no effect?
  • Evaluating the probability of data (e.g. t-value, F-statistic) assuming that the null hypothesis \(H_0\) is true.
  • If the data are extreme enough, \(H_0\) is concluded to be implausible (rejected).
  • If \(H_0\) is implausible, we assume \(H_1\) (alternative hypothesis).
  • Statistically significant means data are unexpected under \(H_0\).

\[ Pr(\text{data} \mid H_0 = \text{TRUE}) \]

Its problems

  • Focus on significance: small effects can become significant when the sample is large enough.
  • People think of statistical significance as continuous not binary.
  • How often do we actually believe in \(H_0\)?
  • What are you doing if \(p=0.051\)?


  • Systematic misinterpretations of p-values (Colquhoun 2017; Greenland et al. 2016) and CIs (Hoekstra et al. 2014; Morey et al. 2016)
  • Optional stopping (Kruschke 2018)
  • \(\alpha\)-level of 0.05 implies there is a 5% chance that:
    • our effect isn’t real.
    • we’ll miss an effect that is real.

Bayesian inference

  • Expressing uncertainty about a hypothesis expressed in probability distributions
  • Updating one’s belief (prior, example later) about a hypothesis (e.g. difference between conditions, interactions etc) as the new information becomes available.

\[ Pr(H \mid \text{data}) \]

Bayes’ Theorem

\[ \underbrace{Pr(\theta \mid \text{data})}_{\text{posterior}} \propto \underbrace{Pr(\theta)}_{\text{prior}} \cdot \underbrace{Pr(\text{data} \mid \theta)}_{\text{likelihood}} \]

  • Likelihood: probability of data given the model parameter value(s)
  • Prior: what do we already know about model parameter(s)
  • Posterior: our updated belief in the parameter value(s) after seeing the data

Why is Bayesian inference important?

Instead of concluding


\[ Pr(\text{data} \mid H_0 = \text{TRUE}) \]

we can answer


\[ Pr(H \mid \text{data}) \]

Why is Bayesian inference important?

It tells us what we want to know!


  • What’s the probability of an effect given the data (and what we already knew before)?
  • Summary stats of a Bayesian model are often very similar to what people think frequentist quantities (p-values, CIs) mean (Nicenboim and Vasishth 2016):
  • What’s the relative evidence for one model (e.g., \(H_0\)) vs. another (e.g., \(H_1\))?
  • What interval contains an unknown parameter value with .95 probability?

Why is Bayesian inference important?

  • Estimate the posterior uncertainty over parameter values based on the observed data (Kruschke 2014).
  • For this we need
  1. a probability model (e.g. a normal distribution and a set of predictors)
  2. priors (on e.g. means, difference, variability)
  • brms uses probabilistic sampling to determine the posterior uncertainty about the parameter value.
  • This involve calculating the weighted probability of the likelihood of the data given a proposed parameter value in the (potentially high dimensional) parameter space.
  • This can take time.

Convergence and model diagnostics

see Lambert (2018) for HMC

So what about this?

Progress of probabilistic sampler.

By default:

  • 2,000 iterations for parameter estimation per chain
  • iterations / 2 warm-up samples: discarded eventually
  • 4 chains to establish convergence
  • How many posterior samples have we got for inference (= chains * (iterations - warm-up))?
  • How many iterations / chains do you need?

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Likelihood of random samples is estimated given the data and the prior.
  • Direction changes if probability of proposed estimate is lower than previous one.

Parameter estimation

  • Hamiltonian Monte Carlo: Markov chain Monte Carlo method for obtaining random samples.
  • Likelihood of a random sample is estimated given the data and the prior.
  • Direction changes if probability of proposed sample is lower than previous one.

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Parameter estimation (traceplot)

Model diagnostics and convergence checks

  • Traceplots: we want fat hairy caterpillars
  • \(\hat{R}\) convergence statistic; should be \(<1.1\) (Gelman and Rubin 1992)
  • Posterior predictive checks: compare data \(y\) and predictions \(y_{rep}\)
  • Exercise: complete script model_diagnostics.R

Where are the priors?

see e.g. chapter 7 in Lee and Wagenmakers (2014)

Priors

  • Prior knowledge about plausible parameter values.
  • This knowledge is expressed as probability distributions (e.g. normal distributions).
  • Help probabilistic sampler by limiting the parameter space.
  • Small data samples are sensitive to prior information which makes intuitively sense.
  • Otherwise data typically overcome the prior (automatic Ockham’s razor).
  • Less common: test data against a (prior) effect suggested by the literature.

Priors: intercept

  • A priori, each value in the parameter space is equally possible.
  • Let’s think about the parameter space for rts.

Priors: intercept

  • Can rts range between -\(\infty\) and \(\infty\)?
  • What are plausible lower and upper end?
  • Plausible probability distribution for rts:

\[ \beta_0 \sim \mathcal{N}(1000 \text{ msecs}, ???) \]

Priors: intercept

  • Can rts range between -\(\infty\) and \(\infty\)?
  • What are plausible lower and upper end?
  • Plausible probability distribution for rts:

\[ \beta_0 \sim \mathcal{N}(1000 \text{ msecs}, 150\text{ msecs}) \]

Priors: slope

  • What do we know about the difference between simple and conjoined NPs?
  • Let’s pretend we don’t know much: so a priority there might be a mean difference of 0 msecs.
  • How large are rt differences usually? Anything larger than 200 msecs would be massive, so lets propose a humble standard deviation of 100 msecs

\[ \beta_1 \sim \mathcal{N}(0\text{ msecs}, 100\text{ msecs}) \]

Priors: slope

  • What do we know about the difference between simple and conjoined NPs?
  • Let’s pretend we don’t know much: so a priority there might be a mean difference of 0 msecs.
  • How large are rt differences usually? Anything larger than 200 msecs would be massive, so lets propose a humble standard deviation of 100 msecs

\[ \beta_1 \sim \mathcal{N}(0\text{ msecs}, 100\text{ msecs}) \]

Priors

  • brms uses default priors that can be changed as appropriate.
  • Some defaults are good, (flat) priors are worth specifying.

Check defaults used earlier:

prior_summary(fit_brm)
prior class coef group
(flat) b
(flat) b nptypesimple
lkj(1) cor
lkj(1) cor item
lkj(1) cor ppt
student_t(3, 1039.5, 235) Intercept
student_t(3, 0, 235) sd
student_t(3, 0, 235) sd item
student_t(3, 0, 235) sd Intercept item
student_t(3, 0, 235) sd nptypesimple item
student_t(3, 0, 235) sd ppt
student_t(3, 0, 235) sd Intercept ppt
student_t(3, 0, 235) sd nptypesimple ppt
student_t(3, 0, 235) sigma


Exercise: complete script mixed_effects_model_brms_with_priors.R (this time with priors)

Hypothesis testing

see e.g. Nicenboim and Vasishth (2016) for a tutorial

What results do I report?

  • To test a hypothesis about parameter values (e.g. difference between groups, interactions, change in outcome):
    • Summary of posterior parameter value
    • Region of practical equivalence (ROPE)
    • Savage-Dickey Bayes Factor: for nested models
  • Cross-validation (leave-one-out): comparison of different probability models

Posterior probability distribution

Get posterior of slope \(\beta\) (difference between conditions)

beta <- as_draws_df(fit_brm) %>% pull(b_nptypesimple)
length(beta)
[1] 4000
beta[1:5]
[1] -34.0 -45.2 -33.0 -37.6 -23.7

Posterior probability distribution

Posterior probability distribution

Posterior mean

mean(beta)
[1] -33.2

Posterior probability distribution

95% probability interval (conceptually different from confidence interval)

quantile(beta, probs = c(0.025, 0.975))
   2.5%   97.5% 
-66.281   0.749 

Posterior probability distribution

95% probability interval (conceptually different from confidence interval)

quantile(beta, probs = c(0.025, 0.975))
   2.5%   97.5% 
-66.281   0.749 

89% probability interval (McElreath 2020)

quantile(beta, probs = c(0.055, 0.945))
  5.5%  94.5% 
-60.83  -6.53 

Posterior probability distribution

Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)

mean(beta < 0)
[1] 0.973

Posterior probability distribution

Probability that slope \(\beta\) is negative: \(P(\hat{\beta}<0)\)

mean(beta < -10)
[1] 0.917

See exercises: hypothesis_testing_post.R

ROPE

e.g. Kruschke (2010)

ROPE

mean(beta < 0)
[1] 0.973
mean(beta < -5)
[1] 0.954
mean(beta < -10)
[1] 0.917

There is nothing special about 0.

ROPE

  • Point values don’t represent our beliefs about the null hypothesis: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are equivalent to a null effect (Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: values in interval of [-10, 10] msecs

ROPE

  • Point values don’t represent our beliefs about the null hypothesis: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are equivalent to a null effect (Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: values in interval of [-10, 10] msecs
library(bayestestR)
rope(beta, range = c(-10, 10)) 
# Proportion of samples inside the ROPE [-10.00, 10.00]:

inside ROPE
-----------
6.05 %     

ROPE

  • Point values don’t represent our beliefs about the null hypothesis: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are equivalent to a null effect (Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: values in interval of [-10, 10] msecs
library(bayestestR)
rope(beta, range = c(-10, 10)) 
# Proportion of samples inside the ROPE [-10.00, 10.00]:

inside ROPE
-----------
6.05 %     

ROPE

  • Point values don’t represent our beliefs about the null hypothesis: \(H_0 = 0\)
  • Region of practical equivalence: range of values that are equivalent to a null effect (Kruschke 2010, 2011).
  • Define region that is equivalent to no effect: values in interval [-10, Inf]
library(bayestestR)
rope(beta, range = c(-10, Inf))
# Proportion of samples inside the ROPE [-10.00, Inf]:

inside ROPE
-----------
6.05 %     

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).
sigma <- as_draws_df(fit_brm) %>% pull(sigma)
delta <- beta / sigma
abs(mean(delta)) # very small effect (see Cohen's d)
[1] 0.142

ROPE

  • Determine ROPE to assess the uncertainty of effect size (Makowski, Ben-Shachar, and Lüdecke 2019)
  • Calculate the standardised effect size \(\delta\) from posterior: \(\delta = \frac{\beta}{\sigma}\)
  • ROPE of [-0.1, 0.1] is the range of negligible effect sizes (Cohn 1988; Kruschke 2018).
rope(delta, range = c(-0.1, 0.1)) 
# Proportion of samples inside the ROPE [-0.10, 0.10]:

inside ROPE
-----------
26.34 %    

ROPE

  • The ROPE value indicates the extent to which the posterior cannot rule out a negligible effect.
  • A meaningful effect size should have a small proportion of posterior samples within the ROPE.
rope(delta, range = c(-0.1, 0.1)) 
# Proportion of samples inside the ROPE [-0.10, 0.10]:

inside ROPE
-----------
26.34 %    

Exercise: complete script hypothesis_testing_rope.R

Bayes Factor

e.g. chapter 7.6 in Lee and Wagenmakers (2014)

Bayes Factor

p-values have two possible outcomes:

  1. reject the null (i.e. \(p<0.05\))
  2. inconclusive (i.e. \(p>0.05\) – go back to start)

BFs have a continuum of three possible outcomes (Dienes 2014, 2016; Dienes and Mclatchie 2018):

  1. Reject the null / alternative hypothesis
  2. Inconclusive (get more data!)
  3. Accept the null / alternative hypothesis

Bayes Factor

\[ \text{BF} = \frac{p(H_1 \mid y)}{p(H_0 \mid y)} \]

  • How much more probable is one model (hypothesis) over the other?
  • How convinced should we be about the evidence for our hypothesis \(H_1\) as opposed to, say, a null hypothesis \(H_0\)?
  • Savage-Dickey density ratio for nested models (Jeffreys 1961; Dickey, Lientz, et al. 1970):
  • Height of the posterior density at zero compared to the height of the prior density at zero.

Bayes Factor: Savage-Dickey density ratio

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 100)\)
dnorm(0, 0, 100)
[1] 0.00399
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 100) / dnorm(0, mean(beta), sd(beta))
[1] 1.11

Bayes Factor: Savage-Dickey density ratio

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 25)\)
dnorm(0, 0, 25)
[1] 0.016
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 25) / dnorm(0, mean(beta), sd(beta))
[1] 4.42

Bayes Factor: Savage-Dickey density ratio

  • Posterior: \(\mathcal{N}(-32.2, 17.4)\)
  • Prior: \(\mathcal{N}(0, 2000)\)
dnorm(0, 0, 2000)
[1] 0.000199
dnorm(0, mean(beta), sd(beta))
[1] 0.00361
dnorm(0, 0, 2000) / dnorm(0, mean(beta), sd(beta))
[1] 0.0553

Bayes Factor: Savage-Dickey density ratio

  • Regularising (skeptical) priors prevent the model to get overexcited by the sample (his words: McElreath 2020).
  • If in doubt use weakly informative priors like student-t distributions that allow for larger tails, e.g. student_t(3, 0, 100)

Student t-distribution: student_t(df, mean, sd)

  • Symmetric continuous distribution with fat tails assigning more probability to extreme values.
  • \(\nu\) parameter controls the degrees of freedom
  • \(\nu = 1\) is a Cauchy distribution
  • When \(\nu \rightarrow \infty\) the t-distribution becomes Gaussian.

Exercise

  • Explore hypothesis() in the exercise: hypothesis_testing_bf_1.R
  • Bonus: explore the Savage-Dickey method in hypothesis_testing_bf_2.R

Bayes Factor with hypothesis()

H: there is no difference between conjoined and simple NPs

hypothesis(fit_brm, "nptypesimple = 0")
Hypothesis Tests for class b:
          Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
1 (nptypesimple) = 0    -33.2      17.2    -66.3     0.75         NA        NA
  Star
1     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

H: simple NPs are faster than conjoined NPs

hypothesis(fit_brm, "nptypesimple < 0")
Hypothesis Tests for class b:
          Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio Post.Prob
1 (nptypesimple) < 0    -33.2      17.2    -61.4    -5.88         36      0.97
  Star
1    *
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

H: simple NPs are more than 10 msecs faster than conjoined NPs

hypothesis(fit_brm, "nptypesimple < -10")
Hypothesis Tests for class b:
                Hypothesis Estimate Est.Error CI.Lower CI.Upper Evid.Ratio
1 (nptypesimple)-(-10) < 0    -23.2      17.2    -51.4     4.12       11.1
  Post.Prob Star
1      0.92     
---
'CI': 90%-CI for one-sided and 95%-CI for two-sided hypotheses.
'*': For one-sided hypotheses, the posterior probability exceeds 95%;
for two-sided hypotheses, the value tested against lies outside the 95%-CI.
Posterior probabilities of point hypotheses assume equal prior probabilities.

Families of distributions

Families of distributions


“Models are devices that connect theories to data. A model is an instantiation of a theory […]” (Rouder, Morey, and Wagenmakers 2016)


  • Our models describe how we understand reality.
  • Model allows us to describe reality qualitatively (with their parameters) and quantitatively (parameter value estimates).
  • Interpretation of parameters depends on data-modeling context.
  • At minimum, distribution families (probability models) depend on data type.

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = gaussian())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = bernoulli())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = poisson())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = zero_inflated_poisson())

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
  • Cumulative: ordinal (ordered categorical) data (Bürkner and Vuorre 2019) (Likert scales)
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = cumulative()) # or acat, sratio

Families of distributions (some important ones)

  • Gaussian: data come from normal distribution
  • Bernoulli: binomial outcomes (yes / no; correct / incorrect)
  • Poisson: count data (number of words / mistakes) (Winter and Bürkner 2021)
  • Zero-inflated Poisson: count data with lots of zeros (number of typos per word)
  • Cumulative: ordinal (ordered categorical) data (Bürkner and Vuorre 2019) (Likert scales)
  • Lognormal: zero bound data with positive skew; also skewed / shifted (log)normal, Wiener Diffusion models etc. [for RT data; Matzke and Wagenmakers (2009)]
fit <- brm(outcome ~ predictor + (1|participant), data = data, family = lognormal())

Compare probability models of rt data

see e.g. Matzke and Wagenmakers (2009); Roeser et al. (2024)

Probability models of rt data

So far we used a Gaussian model which showed a poor fit to the data.

Alternative models: skewed normal, shifted (log)normal, Wiener Diffusion models, ex-Gaussian, mixture of lognormal etc. (see Matzke and Wagenmakers 2009; Roeser et al. 2024).

Gaussian

\[y \sim \mathcal{N}(\mu, \sigma^2)\]

  • Data generating process can be described as a normal distribution with mean \(\mu\) and standard deviation \(\sigma\).

Lognormal

Lognormal

\[ \log(y) \sim \mathcal{N}(\mu, \sigma^2) \]

  • Often used to address positive skew; e.g. response times (Baayen 2008)
  • Log-values are only defined if \(y \in [0, \infty]\)
  • De-emphasize large values over small values.
  • Models the percentage change and not absolute differences:
    • Rather than saying “keystroke intervals are 40 msecs slower”, you say “keystrokes are 7% slower”.
    • Difference of 40 msecs can be huge for fast typing but negligible for pauses around 10 secs.

Shifted lognormal

\[ \log(y - \exp(\text{ndt})) \sim \mathcal{N}(\mu, \sigma^2) \]

  • Variation of the lognormal distribution: a zero shift would be equivalent to lognormal
  • Includes a horizontal shift.
  • Useful for modeling rt data because they are typically not just
    • right-skewed
    • non-negative
  • But also
    • positively shifted: there is a minimum time required for any response

Accounts for a minimum rt: The shift parameter represents the minimum possible rt, accounting for the fact that no response can be instantaneous.

Shifted lognormal (simulated data)

Exercise

Fit two other probability models, namely a lognormal and a shifted lognormal model:

  • Complete script mixed_effects_models_brms_lognormal.R
  • Complete script mixed_effects_models_brms_shifted_lognormal.R

Model comparisons

chapter 11 in Gelman, Hill, and Vehtari (2020); chapter 6 in McElreath (2020)

The Gaussian model showed poor fit to data

Model comparisons

  • How accurately does the model predict new data.
  • \(R^2\) et al.: more complex models are generally better ( \(R^2\) can’t go down).
  • Overfitting: models with more parameters can make unreasonable predictions.
  • Generalisations to data that were used to fit the model are over optimistic (Gelman, Hill, and Vehtari 2020).
  • We will look at both fit and predictive performance of the model.

Compare probability models of rt data

Leave-One-Out (LOO) cross validation

  • Cross validation: performance of a model on new data to remove the overfitting problem.
  • Leave-One-Out (LOO) Information Criterion (LOO-IC):
    • Train model on \(N-1\) observations
    • Predict remaining data point from training model.
    • Repeat process \(N\) times to predict every observation from a model of the remaining data.
  • Adding up prediction results gives an estimate of expected log-predictive density (\(elpd\)); i.e. approximation of results that would be expected for new data.
  • loo() uses the probability calculations to approximate LOO-IC (i.e. Pareto smoothed importance sampling) (Vehtari, Gelman, and Gabry 2015, 2017).

Leave-One-Out (LOO) cross validation

  • Approximation involved in loo() uses the log posterior predictive densities: how likely is each data point given the distribution parameter estimates?

Leave-One-Out (LOO) cross validation

loo(fit_brm)
Computed from 4000 by 1036 log-likelihood matrix.

         Estimate   SE
elpd_loo  -7141.4 40.8
p_loo        41.6  3.6
looic     14282.8 81.6
------
MCSE of elpd_loo is 0.1.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.6, 1.5]).

All Pareto k estimates are good (k < 0.7).
See help('pareto-k-diagnostic') for details.

Leave-One-Out (LOO) cross validation

  • elpd_loo: sum of means (expected log predictive density)
  • p_loo: sum of variances
  • looic: \(-2 \cdot (\)elpd_loo\(-\)p_loo\()\) (for deviance scale)
log_lik(fit_brm) %>% as_tibble() %>% 
  mutate(sample = 1:n()) %>%
  pivot_longer(-sample, names_to = "obs", 
               values_to = "loglik") %>%
  summarise(mean_loglik = mean(exp(loglik)),
            var_loglik = var(loglik),
            .by = obs) %>% 
  summarise(elpd_loo = sum(log(mean_loglik)), 
            p_loo = sum(var_loglik),
            looic = -2 * (elpd_loo - p_loo))
# A tibble: 1 × 3
  elpd_loo p_loo  looic
     <dbl> <dbl>  <dbl>
1   -7100.  41.4 14282.

Leave-One-Out (LOO) cross validation

  • elpd_loo and looic rarely have a direct interpretation (as opposed to loo_R2()): important are differences between models.
  • p_loo is the effective number of parameters: how flexible is the model fit.
  • Exercise: complete the script model_comparison.R
  • This script assumes that you have stored the posterior of these models: Gaussian, lognormal, and shifted lognormal

Model comparison

comparison elpd_diff SE elpd_ratio
fit_gaussian vs fit_lognormal -77.3 22.2 3.5
fit_gaussian vs fit_shifted_lognormal -75.6 23.2 3.3
fit_lognormal vs fit_shifted_lognormal 1.7 1.2 1.4
a elpd_diff is often reported as e.g. \(\Delta\widehat{elpd}\).

The end

Summary

  • brms isn’t much more complicated than lme4 and comes with more flexibility.
  • Priors require some thinking.
  • Bayesian models allow straight forward interpretation of parameter values without an arbitrary threshold for significance.

Recommended reading

References

Annis, Jeffrey, Brent J Miller, and Thomas J Palmeri. 2017. “Bayesian Inference with Stan: A Tutorial on Adding Custom Distributions.” Behavior Research Methods 49: 863–86.

Baayen, R. Harald. 2008. Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press.

Baayen, R. Harald, Douglas J. Davidson, and Douglas M. Bates. 2008. “Mixed-Effects Modeling with Crossed Random Effects for Subjects and Items.” Journal of Memory and Language 59 (4): 390–412.

Barr, Dale J, Roger Levy, Christoph Scheepers, and Harry J Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78.

Bates, Douglas, Reinhold Kliegl, Shravan Vasishth, and Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv Preprint arXiv:1506.04967.

Bürkner, Paul-Christian. 2017. “Brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80: 1–28. https://doi.org/10.18637/jss.v080.i01.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.

———. 2019. “Bayesian Item Response Modeling in R with Brms and Stan.” arXiv Preprint arXiv:1905.09501.

Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101.

Carpenter, Bob, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2017. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 76: 1–32.

Clayton, Aubrey. 2021. Bernoulli’s Fallacy: Statistical Illogic and the Crisis of Modern Science. Columbia University Press.

Cohn, J. 1988. Statistical Power Analysis for the Behavioral Sciences. Hillsdale, NJ: Lawrence Earlbam Associates.

Colquhoun, David. 2017. “The Reproducibility of Research and the Misinterpretation of p-Values.” Royal Society Open Science 4 (12): 171085.

Dickey, James M., B. P. Lientz, et al. 1970. “The Weighted Likelihood Ratio, Sharp Hypotheses about Chances, the Order of a Markov Chain.” The Annals of Mathematical Statistics 41 (1): 214–26.

Dienes, Zoltan. 2014. “Using Bayes to Get the Most Out of Non-Significant Results.” Frontiers in Psychology 5 (781): 1–17.

———. 2016. “How Bayes Factors Change Scientific Practice.” Journal of Mathematical Psychology 72: 78–89.

Dienes, Zoltan, and Neil Mclatchie. 2018. “Four Reasons to Prefer Bayesian Analyses over Significance Testing.” Psychonomic Bulletin & Review 25 (1): 207–18.

Eager, Christopher D, and Joseph Roy. 2017. “Mixed Models Are Sometimes Terrible.” Linguistic Society of America, Austin, TX.

Gelman, Andrew, Jennifer Hill, and Aki Vehtari. 2020. Regression and Other Stories. Cambridge University Press.

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Greenland, Sander, Stephen J. Senn, Kenneth J. Rothman, John B. Carlin, Charles Poole, Steven N. Goodman, and Douglas G. Altman. 2016. “Statistical Tests, p Values, Confidence Intervals, and Power: A Guide to Misinterpretations.” European Journal of Epidemiology 31 (4): 337–50.

Hoekstra, Rink, Richard D. Morey, Jeffrey N. Rouder, and Eric-Jan Wagenmakers. 2014. “Robust Misinterpretation of Confidence Intervals.” Psychonomic Bulletin & Review 21 (5): 1157–64.

Jeffreys, Harold. 1961. The Theory of Probability. Vol. 3. Oxford: Oxford University Press, Clarendon Press.

Kimball, A, Kailen Shantz, C Eager, and Joseph Roy. 2016. “Beyond Maximal Random Effects for Logistic Regression: Moving Past Convergence Problems.” ArXiv e-Prints.

Kruschke, John K. 2010. “What to Believe: Bayesian Methods for Data Analysis.” Trends in Cognitive Sciences 14 (7): 293–300.

———. 2011. “Bayesian Assessment of Null Values via Parameter Estimation and Model Comparison.” Perspectives on Psychological Science 6 (3): 299–312.

———. 2014. “Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan.”

———. 2018. “Rejecting or Accepting Parameter Values in Bayesian Estimation.” Advances in Methods and Practices in Psychological Science 1 (2): 270–80.

Lambert, Ben. 2018. A Student’s Guide to Bayesian Statistics. Sage.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.

Makowski, Dominique, Mattan S Ben-Shachar, and Daniel Lüdecke. 2019. bayestestR: Describing Effects and Their Uncertainty, Existence and Significance Within the Bayesian Framework.” Journal of Open Source Software 4 (40): 1541.

Martin, Randi C, Jason E Crowther, Meredith Knight, Franklin P Tamborello II, and Chin-Lung Yang. 2010. “Planning in Sentence Production: Evidence for the Phrase as a Default Planning Scope.” Cognition 116 (2): 177–92.

Matzke, Dora, and Eric-Jan Wagenmakers. 2009. “Psychological Interpretation of the Ex- Gaussian and Shifted Wald Parameters: A Diffusion Model Analysis.” Psychonomic Bulletin & Review 16 (5): 798–817.

McElreath, Richard. 2020. Statistical Rethinking: A Bayesian course with examples in R and Stan. Vol. 2. CRC Press.

Morey, Richard D., Rink Hoekstra, Jeffrey N Rouder, Michael D. Lee, and Eric-Jan Wagenmakers. 2016. “The Fallacy of Placing Confidence in Confidence Intervals.” Psychonomic Bulletin & Review 23 (1): 103–23.

Nicenboim, Bruno, and Shravan Vasishth. 2016. “Statistical Methods for Linguistic Research: Foundational Ideas – Part II.” Language and Linguistics Compass 10 (11): 591–613.

Roeser, J., M. Torrance, M. Andrews, and T. Baguley. 2024. “No Default Syntactic Scope for Advance Planning in Sentence Production: Evidence from Finite Mixture Models,” December. https://doi.org/10.31219/osf.io/u7v36.

Rouder, Jeffrey N., Richard D. Morey, and Eric-Jan Wagenmakers. 2016. “The Interplay Between Subjectivity, Statistical Practice, and Psychological Science.” Collabra 2 (1): 1–12.

Sorensen, Tanner, S. Hohenstein, and Shravan Vasishth. 2016. “Bayesian Linear Mixed Models Using Stan: A Tutorial for Psychologists, Linguists, and Cognitive Scientists.” Quantitative Methods for Psychology 12 (3): 175–200.

Vasishth, Shravan, and Bruno Nicenboim. 2016. “Statistical Methods for Linguistic Research: Foundational Ideas – Part I.” arXiv Preprint arXiv:1601.01126.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2015. “Pareto Smoothed Importance Sampling.” arXiv Preprint arXiv:1507.02646.

———. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.

Winter, Bodo, and Paul-Christian Bürkner. 2021. “Poisson Regression for Linguists: A Tutorial Introduction to Modelling Count Data with Brms.” Language and Linguistics Compass 15 (11): e12439.