Tutorial: Fitting log-Gaussian mixed-effects mixture models for keystroke data

This is a step-by-step tutorial for fitting mixed-effects mixture models with two log-Gaussian distributions (MacLahlan & Peel, 2000) using R (R Core Team, 2016) and the rstan package to interface with the probabilistic programming language Stan (Carpenter et al., 2016; Hoffman & Gelman, 2014; Stan Development Team, 2015b, 2015a). We will compare the mixture-model analysis to standard Gaussian and log-Gaussian mixed-effect models fitted using brms (Bürkner, 2017, 2018).¹

In this tutorial we show how to fit a mixture model written in Stan to keystroke data (similar to Roeser et al., 2021) and how to compare this to model mainstream mixed-effects models using leave-one-out cross-validation. Then we demonstrate how the posterior of this model can be used to calculate differences between transition locations in the text. This tutorial is largely self-contained. To run the code below, the reader only needs to install the required packages. Data and the Stan programme can be loaded using the code below.

Three packages need to be installed before (see footnotes for instructions): (1) rstan to interface with Stan²; (2) tidyverse for general data wrangling and visualisation.³; (3) brms to for Bayesian models without directly working in Stan.⁴

library(rstan)
library(tidyverse)
library(brms)

1 Preparing the data

First we load a sample of keystroke data. Data are loaded from our OSF repository using the URL as below. For simplicity we use a random subset of data from participants who completed an essay-writing task in the context of the ProWrite project (e.g. Dux Speltz et al., 2022).

# Load data
data <- read_csv("https://osf.io/tjms9/download")

We reduced the data to 10 participants and three variables: a participant identifier participant, the interkey intervals iki, and the transition location location (levels: within word, before word, before sentence).

data

# A tibble: 600 × 3
   participant   iki location       
         <dbl> <dbl> <chr>          
 1         116  2881 before-sentence
 2         116   465 before-word    
 3         116   144 before-sentence
 4         116   121 before-sentence
 5         116   129 before-word    
 6         116    81 before-sentence
 7         116   165 within-word    
 8         116   114 before-word    
 9         116    63 before-sentence
10         116  1072 before-sentence
# ℹ 590 more rows

We reduced the number of interkey intervals to 20 observations per participant and per transition location by randomly sampling interkey intervals. See for example:

count(data, participant, location)

# A tibble: 30 × 3
   participant location            n
         <dbl> <chr>           <int>
 1          76 before-sentence    20
 2          76 before-word        20
 3          76 within-word        20
 4         101 before-sentence    20
 5         101 before-word        20
 6         101 within-word        20
 7         116 before-sentence    20
 8         116 before-word        20
 9         116 within-word        20
10         130 before-sentence    20
# ℹ 20 more rows

2 Mixed-effects models

We start with standard single distribution mixed-effects models (as in e.g. Baayen et al., 2008; Pinheiro et al., 2007), that the data come from a single underlying process that, the first model, can be described as Gaussian and log-Gaussian for the other single distribution model. We compare these models below.

2.1 Gaussian mixed-effects model

2.1.1 Model

We start with a Gaussian mixed-effects model. The process that generates the $i^\text{th}$ interkey interval (iki below), where $i \in 1\dots N$ and $N$ is the total number of keystroke, is a single Gaussian (normal) distribution $\mathcal{N}()$ which is characterised by a mean $\mu$ and a standard deviation $\sigma_\text{e}^2$. This model can be expressed mathematically using the following equation:

\[ \begin{align} \text{iki}_i \sim & \mathcal{N}(\mu_i, \sigma_\text{e}^2)\\ \mu_i = & \beta_\text{location[i]} + u_\text{participant[i]}\\ u_\text{participant[i]} \sim & \mathcal{N}(0, \sigma_\text{p}^2) \end{align} \] The mean can be decomposed into $\beta$ and $u_\text{participant[i]}$. $\beta$ was allowed to take on a different value for each transition location associated with the $i^\text{th}$ interkey interval. By-participant interkey intervals are assumed to vary around the average which is achieved by placing a normal distribution on the difference between average and by-participant interkey intervals, indicated as $u_\text{participant[i]}$, with a mean of 0 and a standard deviation $\sigma_\text{p}^2$.

This model can be described in R using the following formula wrapped in the brms formula function bf. The family argument can be used to specify distributions where gaussian is the default can could be omitted below.

formula <- bf(iki ~ 0 + location + (1|participant), family = gaussian())

The syntax used in brms is generally similar to lme4 syntax that many readers will be familiar with. Like in lm, glm, glmer, lmer and other modelling functions 0 + can be used to drop the intercept parameter which means that the model is parametrised to return cell means and standard errors for each level of location.

2.1.2 Priors

As we are modelling data in the Bayesian framework, we need to define priors before model fit. brms is using sensible regulating default priors for most model parameters which can be inspected using get_prior which needs the model formula defined above and the data as input.

get_prior(formula = formula, data = data)

                  prior class                    coef       group resp dpar nlpar lb ub       source
                 (flat)     b                                                                default
                 (flat)     b locationbeforeMsentence                                   (vectorized)
                 (flat)     b     locationbeforeMword                                   (vectorized)
                 (flat)     b     locationwithinMword                                   (vectorized)
 student_t(3, 0, 187.5)    sd                                                      0         default
 student_t(3, 0, 187.5)    sd                         participant                  0    (vectorized)
 student_t(3, 0, 187.5)    sd               Intercept participant                  0    (vectorized)
 student_t(3, 0, 187.5) sigma                                                      0         default

We want to make sure that (flat) is replaced by more (i.e. weakly or regulating) informative priors (see Kruschke, 2010a; Lambert, 2018; McElreath, 2016). For example, we might consider a Student-t distribution as prior over the slope parameters (class b) which in our model are parameterised as the cell means of the three transition locations (levels: within word, before word, before sentence) with a mean of 350 msecs and a standard deviation of 75 msecs.

prior <- set_prior('student_t(3, 350, 75)', class = 'b')

Student-t distributions, in comparison to normal distributions, have a degrees of freedom parameter (first argument) in addition to the mean and the standard deviation (second and third parameter, respectively). The degrees of freedom control the thickness of the tails of the distribution with smaller values assigning more probability mass to the tails; the shape of the Student-t distribution is moving closer to a normal distribution when the degrees of freedom approach $\infty$. This is illustrated in Figure 2.1 for the mean and the standard deviation of the prior above.

Figure 2.1: Student-t prior with varying degrees of freedom.

These priors are weakly informative. The intention is to use general knowledge that we have about how long interkey intervals usually are to point the model towards the right area in the parameter space by assigning low probabilities to unlikely or even implausible values. Even if we know very little about the duration of interkey intervals, we do know that they shouldn’t be longer than a day, or an hour, or even several minutes, because such long intervals are too extreme to reflect active writing.

2.1.3 Sampling parameters

For the model to converge we need to run a sufficient number of iterations. 10,000 iterations, as below, are a lot but does not guarantee convergence. To be able to test whether converged has reached, we need to run the model several time (i.e. different chains). This can be done at the same time (in parallel) when using more than one core of your computer (three cores below).⁵ There is no need to use more cores than chains; using less cores would mean that at least one process has to wait until after another processes has been completed. When the model has settled on a distribution of parameter values, we should observe, after the model has finished sampling, that all chains mixed nicely. Running long chains (many iterations) is useful to get more accurate parameter estimates but will take more time.

iterations <- 10000      # No. of iterations
warmup <- iterations / 2 # Warmup samples to be discarded
nchains <- ncores <- 3   # No. of chains / cores used (one chain per core)

2.1.4 Fitting the model

The following code combines the model, the data, and the prior and returns the posterior. This will take a few of minutes to complete sampling from the posterior after brms has compiled the model code.

fit_gaus <- brm(formula = formula, 
                 data = data,
                 prior = prior,
                 chains = nchains,
                 cores = ncores,
                 iter = iterations,
                 warmup = warmup,
                 sample_prior = TRUE,
                 inits = 0,
                 seed = 365)

Because Bayesian models tend to take more time to run, it makes sense to save the results using saveRDS for re-use. The function requires the name of the fitted model fit_gaus. The model is stored as compressed “.rda” file.

Before you save the posterior, create a new directory called “stan” where we will save all Stan outputs in this tutorial.

# Create an empty directory called "stan"
dir.create("stan")

saveRDS(object = fit_gaus,
        file = "stan/gaussian.rda", 
        compress = "xz")

To read the fitted model back into R’s working environment use readRDS.

fit_gaus <- readRDS("stan/gaussian.rda")

The function fixef provide a summary of the posterior showing parameter value estimates as average and 95% probability interval by transition location in msecs.

fixef(fit_gaus) %>% round()

                        Estimate Est.Error Q2.5 Q97.5
locationbeforeMsentence     2840       427 1977  3657
locationbeforeMword          387       134  149   697
locationwithinMword          343       130   67   600

2.2 Log-Gaussian mixed-effects model

2.2.1 Model

This model is largely identical to the Gaussian mixed-effect model in the previous section, except instead of a normal distribution this model assumes the interkey intervals come from a single underlying process that follows a log-normal distribution indicated as $\text{log}\mathcal{N}()$.

\[ \begin{align} \text{iki}_i \sim & \text{log}\mathcal{N}(\mu_i, \sigma_\text{e}^2)\\ \mu_i = & \beta_\text{location[i]} + u_\text{participant[i]}\\ u_\text{participant[i]} \sim & \mathcal{N}(0, \sigma_\text{p}^2) \end{align} \]

Assuming a log-normal distribution is useful as interkey intervals and other response times are zero-bound (interkey intervals cannot be shorter than 0). A log-normal distribution forces responses to be positive. Also the log scale is not linear, meaning the distance between 100 msecs and 110 msecs is not the same as the distance between 500 and 510 msecs; instead the distance between adjacent values is reduced for larger values while the distances between adjacent values on the lower end is increased. For example, for a logarithm with the base 10, the distance between 10 msecs and 100 msecs is identical to the distance between 100 msecs and 1000 msecs.

For the model formula we need to change the distribution family from gaussian (default) to lognormal.

formula <- bf(iki ~ 0 + location + (1|participant), family = lognormal())

2.2.2 Priors

The default priors and the priors that we need to specify are now no longer on the msecs scale but on the log scale.

get_prior(formula = formula, data = data)

                prior class                    coef       group resp dpar nlpar lb ub       source
               (flat)     b                                                                default
               (flat)     b locationbeforeMsentence                                   (vectorized)
               (flat)     b     locationbeforeMword                                   (vectorized)
               (flat)     b     locationwithinMword                                   (vectorized)
 student_t(3, 0, 2.5)    sd                                                      0         default
 student_t(3, 0, 2.5)    sd                         participant                  0    (vectorized)
 student_t(3, 0, 2.5)    sd               Intercept participant                  0    (vectorized)
 student_t(3, 0, 2.5) sigma                                                      0         default

For example, a mean of 5 on the log scale are exp(5.85) $\approx$ 347 msecs on a linear scale. This prior with varying degrees of freedom is visualised in Figure 2.2.

prior <- set_prior('student_t(3, 5.85, .5)', class = 'b')

Here is again an example for different degrees of freedom for the prior above.

Figure 2.2: Student-t prior with varying degrees of freedom on log scale.

2.2.3 Fitting the model

We fit the model as before with the same sampling parameters.

fit_lgaus <- brm(formula = formula, 
                 data = data,
                 prior = prior,
                 chains = nchains,
                 cores = ncores,
                 iter = iterations,
                 warmup = warmup,
                 sample_prior = TRUE,
                 inits = 0,
                 seed = 365)

Save the results using saveRDS.

saveRDS(object = fit_lgaus,
        file = "stan/loggaussian.rda", 
        compress = "xz")

The fixef function provides a summary of the inferred posterior estimates as average and 95% probability interval by transition location in log msecs.

fixef(fit_lgaus) %>% round(1)

                        Estimate Est.Error Q2.5 Q97.5
locationbeforeMsentence      6.5       0.1  6.3   6.7
locationbeforeMword          5.8       0.1  5.6   6.0
locationwithinMword          5.1       0.1  4.9   5.3

These can be converted into msecs using the exponential function exp.

fixef(fit_lgaus)[,-2] %>% 
  exp() %>% round(0)

                        Estimate Q2.5 Q97.5
locationbeforeMsentence      691  557   845
locationbeforeMword          330  266   402
locationwithinMword          162  132   197

The posterior estimates obtained from the log-Gaussian model are radically different from the posterior estimates of the Gaussian model. This difference is likely due to the missing lower bound at zero in the Gaussian model as we as its inability to de-emphasize values in the right tail.

fixef(fit_gaus)[,-2] %>% 
  round(0)

                        Estimate Q2.5 Q97.5
locationbeforeMsentence     2840 1977  3657
locationbeforeMword          387  149   697
locationwithinMword          343   67   600

3 Model comparison I

To evaluate the out-of-sample predictive performance of the models we can use leave-one-out cross-validation via Pareto smoothed importance-sampling (Vehtari et al., 2015, 2017). Cross-validation is an alternative to more conventional model-comparison metrics such as $R^2$ as it penalises models with more parameters to prevent overfitting (see Lambert, 2018; McElreath, 2016). Overfitting happens when a model captures nuances of a sample, rather than those patterns that one would expect to observe in new samples.

In contrast, cross-validation is measuring the performance of a model, meaning its ability to generalise to new data. The leave-one-out information criterion (LOO-IC) is determined as follows: train a model on $N-1$ observations, predict the remaining data points from the training model, repeat process $N$ times to predict every observation from a model of th remaining data. Adding up the prediction results gives an estimate of the expected log-predictive density (elpd_loo; $\widehat{elpd}$), i.e. an approximation of the results that would be expected for new data.

The function loo can be used to access the expected log-predictive density which uses probability calculations to approximate the LOO-IC via Pareto smoothed importance sampling (Vehtari et al., 2015, 2017). LOO can be approximated with the log posterior predictive densities as a measure of how likely a data point is, given the parameter estimates. This is shown in Figure 3.1 for each data point in the sample. Extremely small values indicate that an observation is very unlike to repeat in a new sample under the given model and the estimated posterior parameter values.

log_lik(fit_gaus) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(),
               names_to = "obs", 
               values_to = "loglik") %>%
  summarise(mean = mean(loglik), var = var(loglik), .by = obs) %>%
  filter(mean > -10) %>% # there are some terrible ones for this model 
  mutate(obs = as.numeric(str_remove(obs, "V"))) %>%
  ggplot(aes(x = obs, y = mean, ymin = mean - var, ymax = mean + var)) +
  geom_pointrange(fatten = .5) +
  labs(x = "observations", 
       y = "log-posterior predictive density")

Figure 3.1: Log-posterior densities of each data point (errorbars show variance) of the Gaussian model.

If we sum over the means shown in Figure 3.1, we obtain $\widehat{elpd}$, the expected log predictive density, indicated as elpd_loo. $p_\text{loo}$ (p_loo) is the sum of the variances and indicates the flexibility of the model fit (also used as the effective number of parameters). looic, or LOO-IC, corresponds to $-2 \times ($elpd_loo$-$p_loo$)$ (for deviance scale). To illustrate these, we can calculate them manually

log_lik(fit_gaus) %>% 
  as_tibble() %>% 
  pivot_longer(cols = everything(),
               names_to = "obs", 
               values_to = "loglik") %>%
  summarise(mean = mean(loglik), var = var(loglik), .by = obs) %>% 
  summarise(elpd_loo = sum(mean), # sum of mean log lik
            p_loo = sum(var), # effective number of parameters
            looic = -2 * (elpd_loo - p_loo)) %>% # Bayesian deviance
  pivot_longer(everything()) %>% 
  mutate(across(value, round, 1))

# A tibble: 3 × 2
  name       value
  <chr>      <dbl>
1 elpd_loo -6003. 
2 p_loo       80.8
3 looic    12168.

or directly using loo (a small difference isn’t unexpected).

loo(fit_gaus)


Computed from 15000 by 600 log-likelihood matrix.

         Estimate    SE
elpd_loo  -6037.9 175.2
p_loo        62.4  47.6
looic     12075.8 350.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.2]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     597   99.5%   4818    
   (0.7, 1]   (bad)        1    0.2%   <NA>    
   (1, Inf)   (very bad)   2    0.3%   <NA>    
See help('pareto-k-diagnostic') for details.

The Pareto k diagnostic indicates that there are some observations that are very unusual under the single distribution Gaussian model and the obtained parameter estimates.

The difference in elpd_loo for the two single distribution models is returned by loo and is shown in the elpd_diff column ($\Delta\widehat{elpd}$) with standard error shown in the se_diff column.

loo_fit <- loo(fit_gaus, fit_lgaus)
loo_fit$diffs %>% 
  as.data.frame() %>% 
  round(1)

          elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo   looic se_looic
fit_lgaus       0.0       0  -4444.2        51.2  10.9      1.2  8888.4    102.3
fit_gaus    -1593.7     160  -6037.9       175.2  62.4     47.6 12075.8    350.3

The model comparison shows that the model fit_lgaus (the log-Gaussian model) is approximately 10 standard errors better than fit_gaus. This can be calculated by abs(loo_fit$diffs[2,1] / loo_fit$diffs[2,2]) by normalising the difference over its standard error $\mid\frac{\Delta\widehat{elpd}}{\text{SE}}\mid$, which returns the difference between models expressed in standard errors (Sivula et al., 2020). A difference of 10 standard errors is substantial. By comparison, a t-value more extreme than $\mid2\mid$ corresponds to an effect that is significant at an $\alpha$-level of 0.05 (i.e. $p<0.05$) (e.g. Baayen, 2008).

Aside, unlike $R^2$, elpd_loo and looic have no direct interpretation. This is because the size of elpd_loo increases as a function of the number of observations. Generally elpd_loo and looic can be understood as a quantification of how many possible futures the model can rule out. However, what is of importance is the difference between models.

If a measure of variance explained is desired, loo_R2 provides a Bayesian $R^2$. This quantity is not ideal for model comparisons because, similar to the frequentist $R^2$, it is subject to overfitting: $R^2$ will inevitabily increase when more parameters are added to the model. A large $R^2$ means that a model captures the data well but might mean that the model makes predictions that do not generalise out-of-sample (see chapter 11 in Gelman et al., 2020; also chapter 6 in McElreath, 2016). The is an example for Bayesian $R^2$ for the single-distribution log-Gaussian model.

loo_R2(fit_lgaus)

     Estimate  Est.Error        Q2.5      Q97.5
R2 0.01960149 0.00890381 0.004511295 0.03963095

4 Two-distributions log-Gaussian mixed-effects mixture model

The mixture model captures the writing process as a combination of three parameters: the average typing speed or fluent interkey intervals ($\beta$), the slowdown for hesitant interkey intervals ($\delta$) and the probability of hesitant interkey intervals ($\theta$). In the model below, the latter two are estimated separately for each level of the categorical prediction location (i.e. transition location) for the $i^\text{th}$ observation.

\[ \begin{align} \text{iki}_{i} \sim & \theta_\text{location[i],participant[i]} \times \text{log}\mathcal{N}(\beta + \delta_\text{location[i]} + u_\text{participant[i]}, \sigma_{e'_\text{location[i]}}^2) + \\ & (1 - \theta_\text{location[i],participant[i]}) \times \text{log}\mathcal{N}(\beta+ u_\text{participant[i]}, \sigma_{e_\text{location[i]}}^2)\\ &\text{where}\\ &\delta \sim \mathcal{N}(0,1)\\ &\text{constraint: } \delta > 0 \end{align} \]

This model takes into account two sources of random error: (1) each participant has an individual typing speed modelled as deviations from the average $u_\text{participant} \sim \mathcal{N}(0, \sigma^2_p)$ as above; (2) each participant has in individual hesitation probability $\theta_\text{participant}$ that differs across levels of the categorical predictor location.

The parameters of this equation are visualised in Figure 4.1. This illustration shows the dissociation of two underlying processes that the model is estimating on the basis of the data. There are two distributions of which the left one in grey is the distribution of fluent interkey intervals; the right distribution in green is wider and captures longer interkey intervals (with some overlap between fluent and disfluent interkey intervals).

Figure 4.1: What’s a mixture model for interkey intervals? The parameters of the mixture model explained visually.

Importantly Figure 4.1 illustrates the average typing speed of fluent interkey intervals captured by the $\beta$ parameter; the distance between fluent transitions and disfluent transitions captured by $\delta$ (i.e. the distance between the grey and the green distribution); the hesitation probability or the “pausing frequency” which is the height of the green distribution captured by $\theta$.

Most interkey intervals are, typically, fluent and hence part of the grey distribution. The frequency of the minority of disfluent interkey intervals is captured by the height of the green distribution. As this model assumes a finite mixture of two distributions, the probability of fluent interkey intervals is the inverse $1-\theta$; in other words, a pausing probability of $\theta=.4$ means that the probability of fluent interkey intervals is $.6$. If the pausing probability decreases to $.3$ the proportion of fluent interkey intervals had increased to $.7$.

4.1 Model input values

The data must be transformed into a list to feed them into Stan. The information required from the data is determined in the Stan code and has to do with what we need to estimate the model parameters and how the model is implemented. The exert from the Stan code below shows the information that is expected as input, which data type is expected (e.g. int means integer), their names, and what the smallest (i.e. lower) and largest possible value (i.e. upper) are. The full Stan code can be found at the end of this tutorial.

// Do not run
// Data chunk of Stan code
data {
    int<lower=1> N;               // Number of observations
  int<lower=1> nS;              // Number of subjects
  int<lower=1, upper=nS> ppt[N];// Participant identifiers
  int<lower=1> K;               // Number of conditions
  int<lower=1> condition[N];    // Condition identifiers
  vector[N] y;                  // Outcome: interkey intervals
}

Participants and conditions are expected to represent indices in the Stan code. In the code below we use factor in combination with as.numeric to ensure that there are no empty indices for participants and conditions. We create vectors with participant identifiers ppts (numbers $1$ through $I$ where $I$ is the number of participants) and a numeric identifier condition for each level of the transition location (levels: before sentence = 1, before word = 2, within word = 3). The returned values are indices for the parameters in the model: For example, for the location identifier, delta[1] is the population estimate for the slowdown for hesitant interkey intervals at before sentence locations, delta[2] for before word locations; theta[1] and theta[2] are the hesitation probability (on the logit scale) for before-sentence and before-word transitions, respectively.

ppts <- pull(data, participant) %>% factor() %>% as.numeric()
condition <- pull(data, location) %>% factor() %>% as.numeric()

In the below code, the names on the left side of the arrow must correspond to the names expected in the Stan code (above); names on the right side do not. We assigned the participant and location identifiers created above and their maximum values nS and K, respectively. The interkey interval data iki were assigned to y and the total number of observations (number of rows in the data nrow(data)) was assigned to N.

data_list <- within( list(), {
  ppt <- ppts  
  nS <- max(ppts) # max no of ppts
  condition <- condition
  K <- max(condition) # no of conditions
  y <- data$iki
  N <- nrow(data)
} )

The information in the data list can be viewed using the glimpse function.

glimpse(data_list)

List of 6
 $ N        : int 600
 $ y        : num [1:600] 2881 465 144 121 129 ...
 $ K        : num 3
 $ condition: num [1:600] 1 2 1 1 2 1 3 2 1 1 ...
 $ nS       : num 10
 $ ppt      : num [1:600] 3 3 3 3 3 3 3 3 3 3 ...

4.2 Loading the model

The Stan code can be downloaded from our OSF repository using readLines and saved locally using writeLines as “molg.stan” (mixture of log-Gaussians) in the directory “stan”.

# Download Stan code
file <- readLines("https://osf.io/r7aqx/download")
# Save Stan code in "stan" directory
writeLines(file, "stan/molg.stan")

The Stan model can be loaded and translated into an S4 class object using the stan_model function and assigned to molg. The “molg.stan” file needs to be available in the directory “stan” located inside the current working directory.

molg <- stan_model(file = "stan/molg.stan")

This code can fit the posterior of a categorical predictor with any number of levels. In other words, the posterior that the model returns can be used to calculate simple differences, main effects and interactions of factorial designs. We demonstrate below how to calculate a simple difference. The same logic can be used to calculate main effects and interactions. For more predictors, or removing or adding, e.g., random error terms, or adjusting priors, the Stan code can be opened and changed in R or any text editor. The Stan code used is largely based on Vasishth et al. (2017). Sorensen et al. (2016) presents a detailed tutorial on how to write models in Stan (see also Lambert, 2018).

This model estimates a number of parameters of which not all are interesting for inference. To reduce the physical size of the model object that contains the posterior we can determine which parameters we want to save.

# Parameters to keep in output
pars <- c("beta",                        # fluent typing
          "delta",                       # disfluency slowdown
          "theta",                       # mixing proportion on logit scale
          "prob",                        # mixing proportion as probability
          "prob_s",                      # by-participant disfluency probability
          "sigma",                       # variance component
          "sigmap_e", "sigma_e",         # variance by mixture component
          "sigma_diff",                  # difference between variances
          "sigma_u",                     # variance for random ppt 
          "log_lik",                     # log likelihood (for model comparison) 
          "y_tilde")                     # predicted data

The sampling function applies the model to the data using the information introduced before (iterations, warmup, chains, cores, parameters pars to be saved). save_warmup is set to FALSE to discard the warmup samples (which are not used for inference anyway) to reduce the size of the model object. To allow reproducibility (and because Bayesian models involves random number generation) we set the seed using an arbitrary number. Lastly, the control argument was specified with higher values for adapt_delta and max_treedepth: using higher values here means the model runs slower but means more careful parameter estimation. The model is assigned to fit_molg.

# Fit model
fit_molg <- sampling(molg, 
                     data = data_list,
                     iter = iterations,
                     warmup = warmup,
                     chains = nchains, 
                     cores = ncores,
                     pars = pars, # Parameters to keep.
                     save_warmup = FALSE, # Don't save the warmup samples.
                     seed = 365, 
                     # Improve sampling efficiency
                     control = list(adapt_delta = 0.96,  # default: 0.9
                                    max_treedepth = 14)) # default: 10

As above we save the results using saveRDS.

saveRDS(object = fit_molg,
        file = "stan/molg.rda", 
        compress = "xz")

Running this model will take a moment to complete sampling depending on your hardware specifications. The time it took my laptop to complete this job can be viewed using get_elapsed_time.

get_elapsed_time(fit_molg) / 60 # in mins

           warmup    sample
chain:1 0.8758833 0.7950167
chain:2 0.9136667 0.5474167
chain:3 0.8951833 0.8892333

Inferred posterior estimates shown as averages and 95% probability intervals can be extracted using the print function for each transition location and the three parameter values of interest. prob is the $\theta$ parameter on the probability scale rather than on the logit scale. $\beta$ and $\delta$ are on the log scale.

print(fit_molg, 
      pars = c("beta", "delta", "prob", "theta"), 
      probs = c(.025, .975))

Inference for Stan model: anon_model.
3 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=15000.

          mean se_mean   sd  2.5% 97.5% n_eff Rhat
beta      5.08    0.00 0.08  4.92  5.24  3668    1
delta[1]  1.97    0.00 0.15  1.69  2.27  8978    1
delta[2]  1.24    0.00 0.15  0.95  1.54 11735    1
delta[3]  0.67    0.00 0.29  0.16  1.28 11620    1
prob[1]   0.73    0.00 0.08  0.56  0.87  7466    1
prob[2]   0.55    0.00 0.10  0.36  0.73  6966    1
prob[3]   0.12    0.00 0.06  0.04  0.26  5506    1
theta[1]  1.03    0.00 0.42  0.24  1.88  7001    1
theta[2]  0.20    0.00 0.40 -0.58  1.01  6943    1
theta[3] -2.06    0.01 0.53 -3.12 -1.06  5160    1

Samples were drawn using NUTS(diag_e) at Sun May 19 04:55:42 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

We return to the interpretation of these estimates below.

4.3 Model diagnostics

4.3.1 Convergence

Before we interpret these results we need to ensure that the model reached convergence. Convergence can be established using two simple methods. First, we can inspect the chains in trace plots which show the parameter estimate across iterations (here after warmup) for each chain in different colours. We will look at the population-level parameters saved in pars.

pars <- c("beta", "delta", "theta")

To create trace plots, we can apply the stan_trace function to the model fit_molg and extract the MCMC chains for the parameters in pars. The alpha argument makes the colours slightly more transparent otherwise it is difficult to see the chains hiding in the back. If the chains overlap and look like “fat hairy caterpillars”, chains have converged on the same target distribution. The chains below look good (i.e. well mixed).

# Check convergence
stan_trace(fit_molg, pars = pars, alpha = .5)

Second, we can calculate the $\hat{R}$ statistic. Successful convergence is reflected in $\hat{R}$ values smaller than 1.05 (Gelman & Rubin, 1992). $\hat{R}$ is similar to the F statistic in standard analysis of variance: it tells us how much bigger the variability between chains is compared to the variability within chains. A value of $\approx 1$ indicates that the variability is essentially identical signifying convergence. Larger $\hat{R}$ values indicate that the chains are diverging. The rhat function applied to the model and the parameters of interest extracts the $\hat{R}$ statistic. There is no evidence for convergence problems.

summary(fit_molg, pars=pars)$summary %>% 
  as.data.frame() %>% 
  rownames_to_column("parameter") %>% 
  mutate(across(Rhat, round, 3)) %>% 
  select(parameter, Rhat)

  parameter  Rhat
1      beta 1.002
2  delta[1] 1.000
3  delta[2] 1.000
4  delta[3] 1.000
5  theta[1] 1.000
6  theta[2] 1.000
7  theta[3] 1.000

Convergence problems can have many reasons and therefore many solutions: for example, running longer chains (in particular longer warmups), increasing the maximum treedepth and the average acceptance probability (adapt_delta), specifying starting values, using sensible regulating priors, constraining parameters, or changing the parametrisation of the model. Severe convergence problems indicate model misspecification.

4.3.2 Posterior predictive check

The model is simulating hypothetical datasets on the basis of the estimated posterior parameter values. This happened for every iteration for every chains. In other words there is a total of 15,000 simulated data sets (and samples per parameter) because

(total_samples <- (iterations - warmup) * nchains)

[1] 15000

We can draw predicted data from the model output and compare these to the observed data. A model that makes reasonable predictions should fit the observed data.

Using the as.matrix function we extract a matrix of y_tilde, the simulated data. This returns a matrix of the size total_samples $\times$ nrow(data) (so 15,000 $\times$ 600). We use the sample function to draw N randomly sampled hypothetical data sets.

y_tilde <- as.matrix(fit_molg, pars = "y_tilde")   # extract simulated data sets
N <- 50                                            # number of simulations to use
rnd_sims <- sample(total_samples, N)               # created random indices
y_tilde_sample <- y_tilde[rnd_sims,]               # draw N random simulations

The ppc_dens_oberlay function from the bayesplot package is then mapping the observed data ($y$; thick blue line) to the simulated data ($y_{rep}$; thin light blue lines). We find that the simulated data sets closely align with the observed data.

library(bayesplot)
ppc_dens_overlay(data_list$y, y_tilde_sample) +
  scale_x_continuous(limits = c(0, 3000))

For comparison we also visualised the fit for the simulated data of the Gaussian and the log-Gaussian model in Figure 4.2. Both models show a substantially poorer fit to the keystroke data than the mixture-model based simulations. Note that the Gaussian model does not rule out negative values which are conceptually implausible.

Figure 4.2: Fit of simulated datasets compared to observed data for the Gaussian and the log-Gaussian model. x-axes were truncated for visability.

5 Model comparsion II

Similar to the section “Model comparison I” we use leave-one-out cross-validation to compare now all three models, the mixture of log-Gaussians, the single distribution Gaussian and log-Gaussian models. Because the mixture model was fitted directly in Stan and the single-distribution models were fitted in brms, we need to extract LOO statistics for which we need the log-likelihood and the relative effective sample size for the Stan model. We can then compare the models using functions provided by the loo package.

library(loo)
# Mixture of log-Gaussians
log_lik <- extract_log_lik(fit_molg, merge_chains = F) 
r_eff <- relative_eff(exp(log_lik)) # relative effective sample size
loo_molg <- loo(log_lik, r_eff = r_eff)

# Single distribution log-Gaussian
loo_lgaus <- loo(fit_lgaus)

# Single distribution Gaussian
loo_gaus <- loo(fit_gaus)

In “Model comparison I” we used loo to compare two models directly. This time we need the loo_compare function because we had to extract the LOO statistics for comparison.

loo_comp <- loo_compare(loo_gaus, loo_lgaus, loo_molg)

We can then extract the model comparisons and fit statistics:

loo_comp %>% 
  as.data.frame() %>% 
  rownames_to_column("models") %>% 
  mutate(across(models, recode, model3 = "fit_molg"),
         across(where(is.numeric), round, 1))

     models elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo   looic se_looic
1  fit_molg       0.0     0.0  -4311.0        51.5  31.5      2.0  8621.9    102.9
2 fit_lgaus    -133.2    14.6  -4444.2        51.2  10.9      1.2  8888.4    102.3
3  fit_gaus   -1726.9   164.6  -6037.9       175.2  62.4     47.6 12075.8    350.3

The difference between the model in the top row and the two other models is shown in elpd_diff ($\Delta\widehat{elpd}$) with standard error shown in the se_diff column. Models are ordered from the model with the highest predictive performance to the lowest (i.e. elpd_loo). The model comparison shows that the model fit_molg (the mixture of log-Gaussian) is approximately 9 standard error better than the single distribution log-Gaussian fit_lgaus (try abs(loo_comp[2,1] / loo_comp[2,2])). This constitutes substantial support for the mixture model over the single distribution log-Gaussian.

6 Posterior probability distribution

At the core of Bayesian inference is the posterior probability distribution. For each model parameter we have 15,000 posterior samples which constitute the posterior probability distribution. These samples are the result of the posterior samples taken from the target distribution after warmup collapsed across all chains. If the target distribution was not identified by the end of the warmup, inference will not be reliable. We know that the target distribution was determined because chains converged.

The distribution of parameter values represents the uncertainty about parameter values given the data. There is a large range of things one can do with a posterior. Below we focus on summarising parameter estimates, comparing conditions, and extracting by-participant estimates.

The names function can be used to check which parameters are available in the model (we omitted the log-likelihood log_lik, and the stimulated datasets y_tilde). The meaning of the parameters is described above and is determined in the Stan code. Indices refer to the transition locations (1 = before sentence; 2 = before word; 3 = within word) and to the participant identifier for by-participant parameters (indicated by _s).

names(fit_molg)[!str_detect(names(fit_molg), "log_lik|y_tilde")]

 [1] "beta"          "delta[1]"      "delta[2]"      "delta[3]"      "theta[1]"      "theta[2]"      "theta[3]"      "prob[1]"       "prob[2]"       "prob[3]"       "prob_s[1,1]"   "prob_s[2,1]"  
[13] "prob_s[3,1]"   "prob_s[1,2]"   "prob_s[2,2]"   "prob_s[3,2]"   "prob_s[1,3]"   "prob_s[2,3]"   "prob_s[3,3]"   "prob_s[1,4]"   "prob_s[2,4]"   "prob_s[3,4]"   "prob_s[1,5]"   "prob_s[2,5]"  
[25] "prob_s[3,5]"   "prob_s[1,6]"   "prob_s[2,6]"   "prob_s[3,6]"   "prob_s[1,7]"   "prob_s[2,7]"   "prob_s[3,7]"   "prob_s[1,8]"   "prob_s[2,8]"   "prob_s[3,8]"   "prob_s[1,9]"   "prob_s[2,9]"  
[37] "prob_s[3,9]"   "prob_s[1,10]"  "prob_s[2,10]"  "prob_s[3,10]"  "sigma"         "sigmap_e[1]"   "sigmap_e[2]"   "sigmap_e[3]"   "sigma_e[1]"    "sigma_e[2]"    "sigma_e[3]"    "sigma_diff[1]"
[49] "sigma_diff[2]" "sigma_diff[3]" "sigma_u"       "lp__"

The posterior probability distribution of the parameter estimates can be visualised using the stan_hist function applied to the model fit_molg and the to-be-visualised parameters pars as arguments. The function returns a histogram for each parameter. Each histogram is a probability distribution indicating which values are more (or less) probable estimates for the true parameter value and the uncertaintly associated with it.

stan_hist(fit_molg, pars = pars)

The posterior samples of the parameter values can be summarised using the print function. The probs argument can be used to extract the lower and upper bound of the probability interval. A lower bound of .025 and an upper bound of .975 gives the 95% probability interval (PI; also called “credible intervals”), i.e. the range that contains the true parameter value with a 95% probability given the data.

print(fit_molg, pars = pars, probs = c(.025, .975))

Inference for Stan model: anon_model.
3 chains, each with iter=10000; warmup=5000; thin=1; 
post-warmup draws per chain=5000, total post-warmup draws=15000.

          mean se_mean   sd  2.5% 97.5% n_eff Rhat
beta      5.08    0.00 0.08  4.92  5.24  3668    1
delta[1]  1.97    0.00 0.15  1.69  2.27  8978    1
delta[2]  1.24    0.00 0.15  0.95  1.54 11735    1
delta[3]  0.67    0.00 0.29  0.16  1.28 11620    1
theta[1]  1.03    0.00 0.42  0.24  1.88  7001    1
theta[2]  0.20    0.00 0.40 -0.58  1.01  6943    1
theta[3] -2.06    0.01 0.53 -3.12 -1.06  5160    1

Samples were drawn using NUTS(diag_e) at Sun May 19 04:55:42 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

This output summarises the most probable parameter value as mean and its standard error (se_mean), standard deviation (sd), the effective sample size (n_eff) indicating sampling efficiency and the convergence metric $\hat{R}$ (Rhat).

For the following steps, we focus on the three parameters that have conceptually interesting interpretations: (1) the typing speed $\beta$, (2) the hesitation slowdown $\delta$, and (3) the hesitation probability $\theta$ (shown on the logit scale).

The plot function shows the posterior probability distribution of the three parameters for before-sentence transitions (indicated as 1), before-word transitions (indicated as 2), and within-word transitions summarised as median and 95% PI.

plot(fit_molg, pars = pars, ci_level = .95) # ci = credible interval

The values for $\beta$ and $\delta$ are shown on a log-scale. To transform their values back to msecs we can extract the posterior samples using the as.data.frame function.

posterior <- as.data.frame(fit_molg, pars) %>% as_tibble() %>% rownames_to_column("idx")
posterior

# A tibble: 15,000 × 8
   idx    beta `delta[1]` `delta[2]` `delta[3]` `theta[1]` `theta[2]` `theta[3]`
   <chr> <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1 1      5.12       1.91      1.20     0.0271       0.405     0.303      -0.640
 2 2      4.98       1.82      1.29     0.00434      1.66      0.409      -2.35 
 3 3      5.12       1.84      1.33     0.107        1.08     -0.250      -2.52 
 4 4      5.11       2.17      1.27     0.252        0.389    -0.442      -1.44 
 5 5      5.11       2.00      1.11     0.336        0.699     0.967      -2.35 
 6 6      5.17       1.93      0.950    1.33         0.561    -0.0950     -2.70 
 7 7      5.05       2.09      1.52     0.286        1.53      0.186      -3.05 
 8 8      5.20       1.87      1.24     0.154        1.68     -0.416      -2.08 
 9 9      5.04       1.98      1.27     0.257        0.324    -0.202      -2.09 
10 10     4.96       2.26      1.17     1.33         0.974     0.115      -2.21 
# ℹ 14,990 more rows

The pivot_longer function transforms the data to a long format with an additional column for location and the delta and theta model parameters as columns. The names_pattern argument is using a regular expression to extract the number in the squared brackets.

posterior_long <- pivot_longer(posterior, 
                               cols = -c(idx, beta), 
                               names_to = c(".value", "location"),
                               names_pattern = "(.*)\\[(.)\\]") 
posterior_long

# A tibble: 45,000 × 5
   idx    beta location   delta  theta
   <chr> <dbl> <chr>      <dbl>  <dbl>
 1 1      5.12 1        1.91     0.405
 2 1      5.12 2        1.20     0.303
 3 1      5.12 3        0.0271  -0.640
 4 2      4.98 1        1.82     1.66 
 5 2      4.98 2        1.29     0.409
 6 2      4.98 3        0.00434 -2.35 
 7 3      5.12 1        1.84     1.08 
 8 3      5.12 2        1.33    -0.250
 9 3      5.12 3        0.107   -2.52 
10 4      5.11 1        2.17     0.389
# ℹ 44,990 more rows

Note that the values for beta are now repeated for each location a a consequence. We had to do this for the next step.

The next code transforms beta and delta to msecs using the exponential function exp. In order to transform the slowdown delta into msecs we had to add beta before using the exponential function; we can then subtract beta again. theta samples are currently on the logit scale and can be transformed to probabilities using brms::inv_logit_scaled (the model also contained a prob parameter which is theta but on the probability scale). The recode function changes the location indices from 1 to “before sentence”, 2 to “before word”, and 3 to “within word”.

posterior_in_msecs <- mutate(posterior_long,
                             delta = exp(beta + delta) - exp(beta),
                             across(beta, exp),
                             across(theta, inv_logit_scaled),
                             across(location, recode, `1` = "before sentence",
                                                      `2` = "before word",
                                                      `3` = "within word"))
posterior_in_msecs

# A tibble: 45,000 × 5
   idx    beta location           delta  theta
   <chr> <dbl> <chr>              <dbl>  <dbl>
 1 1      168. before sentence  970.    0.600 
 2 1      168. before word      390.    0.575 
 3 1      168. within word        4.60  0.345 
 4 2      146. before sentence  759.    0.840 
 5 2      146. before word      387.    0.601 
 6 2      146. within word        0.636 0.0873
 7 3      167. before sentence  882.    0.747 
 8 3      167. before word      463.    0.438 
 9 3      167. within word       18.8   0.0742
10 4      165. before sentence 1284.    0.596 
# ℹ 44,990 more rows

The posterior distribution of the parameter estimates can then be visualised in, for example, histograms. The histograms show the probability distribution of each parameter with the hesitation slowdown $\delta$ and the hesitation probability $\theta$ shown for each transition location. The latter two illustrate that their posterior parameter values are sensitive to the interkey intervals location in the text with more and longer hesitations before sentences than before words than within words which we evaluate in the next section.

pivot_longer(posterior_in_msecs, c(beta, delta, theta)) %>%
  mutate(across(location, ~case_when(str_detect(name, "beta") ~ "overall",
                                     TRUE ~ .)), # correct location for beta
         across(location, ~factor(.,             # order locations for plotting
                                  levels = c("before sentence", 
                                             "before word", 
                                             "within word",
                                             "overall"), 
                                  ordered = TRUE))) %>% 
  unique() %>%  # remove duplicated values for beta 
  ggplot(aes(x = value, colour = location, fill = location)) +
  geom_histogram(position = "identity", alpha = .25) +
  facet_wrap(~name, scales = "free", labeller = label_parsed) +
  scale_fill_brewer("Transition\nlocation", palette = "Dark2") +
  scale_color_brewer("Transition\nlocation", palette = "Dark2") +
  labs(x = "Posterior parameter estimate") +
  theme(legend.position = "top")

To illustrate differences for the estimated cells means compared to the single distribution models, here are the means for all parameters and their location. The central tendencies of the single distribution models are in principle a combination of beta and beta + delta weighted by prob.

pivot_longer(posterior_in_msecs, c(beta, delta, theta)) %>% 
  mutate(across(location, ~case_when(str_detect(name, "beta") ~ "overall",
                                     TRUE ~ .))) %>% 
  unique() %>% 
  summarise(across(value, list(mean = mean), 
                   .names = "{.col}"), 
            .by = c(name, location)) %>% 
  arrange(name)

# A tibble: 7 × 3
  name  location           value
  <chr> <chr>              <dbl>
1 beta  overall          161.   
2 delta before sentence 1013.   
3 delta before word      402.   
4 delta within word      169.   
5 theta before sentence    0.730
6 theta before word        0.548
7 theta within word        0.124

These are the estimated cell means from the log-Gaussian model

fixef(fit_lgaus)[,1] %>% exp() %>% round(0)

locationbeforeMsentence     locationbeforeMword     locationwithinMword 
                    691                     330                     162

and the estimated cell means of the Gaussian model

fixef(fit_gaus)[,1] %>% round(0)

locationbeforeMsentence     locationbeforeMword     locationwithinMword 
                   2840                     387                     343

Differences are especially noticeable for before-sentence transitions.

7 Hypothesis testing

7.1 Differences between transition locations

We calculate the differences between transition locations for both hesitation parameters to determine to what extent each of the parameters are sensitive to transition locations. This shows us to what extent the size of the hesitation slowdown $\delta$, and the hesitation probability $\theta$ are sensitive to the linguistic location of a key pair.

To determine these differences we change the data format above and create a column that indicates the parameter name and a column for the corresponding values for each transition location. The last line calculates the difference between (1) transitions before sentences and words, and (2) transitions before and within words for each of the two hesitation parameters.

posterior_diffs <- posterior_in_msecs %>%
  select(-idx) %>% 
  pivot_longer(delta:theta, names_to = "parameter") %>%
  mutate(id = row_number(), .by = location) %>%
  pivot_wider(names_from = location, values_from = value) %>%
  transmute(parameter = str_c(parameter, "[diff]"),
            diff_sent = `before sentence` - `before word`, # compare sentences to word
            diff_word = `before word` - `within word`)     # compare within to before word

posterior_diffs

# A tibble: 30,000 × 3
   parameter   diff_sent diff_word
   <chr>           <dbl>     <dbl>
 1 delta[diff]  580.       385.   
 2 theta[diff]    0.0246     0.230
 3 delta[diff]  372.       386.   
 4 theta[diff]    0.239      0.514
 5 delta[diff]  420.       444.   
 6 theta[diff]    0.309      0.364
 7 delta[diff]  859.       378.   
 8 theta[diff]    0.205      0.199
 9 delta[diff]  727.       271.   
10 theta[diff]   -0.0567     0.637
# ℹ 29,990 more rows

The differences between locations can be summarised as the mean, the 95% PIs and the probability that the difference between transition locations is smaller than 0 (notated as e.g. $P(\hat{\theta}_\text{diff} < 0)$ for the parameter $\theta$; the hat symbol $\hat{.}$ indicates that the value is an estimate of the population parameter value). This summary tells us to what extent we can rule out a null difference between transition locations.

# Create summary functions for convenience
lower <- function(x) quantile(x, .025) # lower bound of 95% PI
upper <- function(x) quantile(x, .975) # upper bound of 95% PI
p_diff <- function(x) mean(x < 0)      # prob that difference is negative

# Summarise differences
posterior_diffs %>%
  pivot_longer(starts_with("diff")) %>% 
  summarise(across(value, 
              list(mean = mean,        # mean of difference
                   lower = lower,      # 2.5% lower bound of difference
                   upper = upper,      # 97.5% upper bound of difference
                   p_diff = p_diff),   # prob that difference is negative
              .names = "{.fn}"),
            .by = c(parameter, name)) %>%   # group by parameters, comparison 
  mutate(across(where(is.numeric), round, 2)) # round after 2nd decimal place

# A tibble: 4 × 6
  parameter   name        mean  lower   upper p_diff
  <chr>       <chr>      <dbl>  <dbl>   <dbl>  <dbl>
1 delta[diff] diff_sent 611.   251.   1052.     0   
2 delta[diff] diff_word 233.   -50.6   483.     0.05
3 theta[diff] diff_sent   0.18  -0.05    0.42   0.06
4 theta[diff] diff_word   0.42   0.21    0.63   0

Form these results we can infer that there is a $P(\hat{\theta}_\text{diff} < 0)$ = 0.06 chance that hesitations do not occur more frequently before sentence; the results show that hesitations are 0.18 more probable to occur sentence initially than word initially (95% PI: -0.05, 0.42); before-word transitions show an increased hesitation probability of 0.42 (95% PI: 0.21, 0.63) compared to within-word transitions.

Also, the slowdown associated with hesitations is 611 msecs longer for sentences (95% PI: 251, 1052) with a probability of $P(\hat{\delta}_\text{diff} < 0)$ = 0 (not exactly 0 but smaller than 0.01) to be smaller than zero. Hesitations before words were 233 msecs longer than within words, although the probability interval (95% PI: -51, 483) did not rule out a negative difference, or no difference, which however had a low probability $P(\hat{\delta}_\text{diff} < 0)$ = 0.05.

These differences can be viewed in histograms. The vertical dotted lines indicates a zero difference between transition locations. The area of the histogram to the left of this line corresponds to the probability p_diff to observe a negative difference. Sentence boundaries seem to have a stronger effect on the hesitation slowdown while both sentence and word boundaries increase the hesitation probability.

posterior_diffs %>% 
  pivot_longer(starts_with("diff")) %>% 
  mutate(across(name, recode, diff_sent = "before sentence vs before word",
                              diff_word = "before word vs within word")) %>% 
  ggplot(aes(x = value, colour = name, fill = name)) +
  geom_vline(xintercept = 0, colour = "black", linetype = "dotted", size = .5) +
  geom_histogram(position = "identity", alpha = .25) +
  facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
  scale_fill_brewer("Difference", palette = "Dark2") +
  scale_color_brewer("Difference", palette = "Dark2") +
  labs(x = "Difference between transition locations") +
  theme(legend.position = "top")

7.2 ROPE

One problem with null-hypothesis significance testing it that, in practice, we rarely believe that only a difference of exactly null is indicating that there is no difference between groups. Instead there are values that are non-different from null; for example a difference in keystroke intervals of 1 msecs is meaningless in most, if not all, contexts. A hesitation probability increase of 0.001 is too small to be meaningful. To take this into account we can determine a region of practical equivalence (ROPE, Kruschke, 2010b, 2011, 2014).

The ROPE can be defined as the range of values that are a priori considered non-different from a null effect. The result can be understood as the extent to which the posterior cannot rule out a negligible effect. A meaningful difference between groups should have a small proportion of posterior samples within the ROPE.

To calculate the ROPE for differences between transition locations, we use the rope function from the bayestestR package.⁶ The rope function requires as input a vector with the posterior difference. We can create this from the posterior differences calculated above. Let’s focus on $\delta_\text{diff}$, the slowdown difference for hesitant transitions.

# Extract vector of differences
# For sentence vs word
delta_diff_sent <- filter(posterior_diffs, str_detect(parameter, "delta")) %>% pull(diff_sent)

# For before word vs within word
delta_diff_word <- filter(posterior_diffs, str_detect(parameter, "delta")) %>% pull(diff_word)

Also, we need to define the range argument which specifies the range of values that are considered indifferent from null. Of course this range will depend on what the parameter represents and researcher’s intuition as well as prior knowledge of the domain. For example, for fluent transitions we would consider small differences, of say 10 msecs, more meaningful than for hesitations. If we consider differences of 10 msecs or smaller equivalent to a null difference, we can define the range argument with a lower bound of -10 and the upper bound of 10.

library(bayestestR)
rope(delta_diff_word, range = c(-10, 10))

# Proportion of samples inside the ROPE [-10.00, 10.00]:

inside ROPE
-----------
1.19 %

The ROPE for the hesitation duration of the difference between before-word and within-word transitions contains 1.19% of the posterior samples. In other words, there is a 1.19% chance that the true difference $\delta_\text{diff}$ is between -10 and 10 msecs and therefore negligible.

Let’s see what happens if we determine a wider ROPE between -50 through 50 msecs which can be understood as a more conservative test for a non-zero difference.

rope(delta_diff_word, range = c(-50, 50))

# Proportion of samples inside the ROPE [-50.00, 50.00]:

inside ROPE
-----------
6.20 %

The ROPE contains 6.2% of the posterior samples. In other words, there is a 6.2% probability that the true difference $\delta_\text{diff}$ is between -50 and 50 msecs which would therefore be a negligible effect.

7.3 Bayes Factor

Frequentist null-hypothesis testing is based on the p-value. Conceptually this probability represents how likely the data are if there was no effect and we can, if this probability is low enough (typically smaller than .05) reject the null hypothesis (i.e. the true parameter value being zero). In other words, if the data are very unlikely, the null-hypothesis is implausible. More specifically it tests whether the probability of the obtained test statistic (e.g. t, F, $\chi^2$) is as or more extreme than what one would expect if the null hypothesis is true. There are two possible outcomes for p-values: reject the null (i.e. $p<.05$); inconclusive evidence (i.e. $p>.05$). And that’s it. p-values do neither provide direct evidence for the alternative hypothesis (but evidence against the null), nor do they provide evidence for the null hypothesis (that data are not inconsistent with the null hypothesis is not evidence that there is no difference; or absence of evidence is not evidence of absence).

In contrast, Bayes Factors (BFs) provide a continuum of three possible outcomes (Dienes, 2014, 2016; Dienes & Mclatchie, 2018): (1) there is evidence to reject the null / alternative hypothesis; (2) the evidence is inconclusive (get more data!); (3) accept the null / alternative hypothesis.

For example, if the BF is the probability of the alternative hypothesis $H_1$ over the null hypothesis $H_0$, given the data $y$, values $>1$ indicate that there is more evidence for $H_1$ than for $H_0$, given the data, and $<1$ indicates that there is more evidence for $H_0$. This can be turned around by swapping nominator and denominator

\[ \text{BF}_{10} = \frac{p(H_1 \mid y)}{p(H_0 \mid y)} \] or by taking the inverse of $\text{BF}_{10}$ which is $\text{BF}_{01} = \frac{1}{\text{BF}_{10}}$.

The strength of evidence for $H_0$ and $H_1$ is continuous rather than binary (i.e. evidence vs no evidence). Authors have proposed different soft thresholds for meaningful evidence (see, e.g., Baguley, 2012; Jeffreys, 1961; Lee & Wagenmakers, 2014) but generally a BF larger than 3 is weak evidence, larger than 5 / 6 indicate moderate evidence and 10 / 32 is strong evidence for a statistically meaningful effect. For example BF$_{10}$ = 2 reflects that the alternative hypothesis is two times more likely than the null hypothesis given the evidence. BF$_{10}$ = 0.33 is weak evidence in favour of the null hypothesis because $\frac{1}{0.33} \approx 3$. Notably, BFs can be used to obtain evidence in favour of the null; for discussion see Dienes (2014), Dienes & Mclatchie (2018), Dienes (2016) and for a practical application in writing research see Spilling et al. (2022) and Spilling et al. (2023).

BFs can be calculated using the Savage-Dickey method (see, e.g., Dickey et al., 1970; Jeffreys, 1961; Wagenmakers et al., 2010) where the BF is determined on the basis of the height of the posterior density at zero compared to the height of the prior density at zero. Important to note here is that the BF is sensitive to the prior especially in situations when the sample was small and / or prior information played a big role. It is therefore a good idea to carefully think about the prior in advance of analysis (hence “prior”) and to be mindful of the role of priors in the calculation of BFs.

Figure 7.1 shows an example for the calculation of BFs based on a posterior with an effect difference that is normally distributed with a mean of 2.5 and a SD of 1, expressed as $\mathcal{N}(2.5, 1)$, compared against three priors with (1) a wide SD $\mathcal{N}(0, 10)$, (2) a SD equal to the posterior $\mathcal{N}(0, 2.5)$, and (3) a relatively small SD $\mathcal{N}(0, 1)$. The resulting BFs are shown in the boxes as well as their calculation. The denominator shows the height of the posterior (black line) at zero and the nominator shows the height of the prior at zero (indicates as black dots). The BF was larger for more informative priors. Of course in practice, the shape of the posterior density is not independent of the prior; for more informative priors the posterior will move closer to the prior.

Figure 7.1: Example for the BF calculated using the Savage-Dickey method for three differently wide priors.

To calculate the BF on the basis of posterior samples, the Savage-Dickey method can be implemented as below (taken from Nicenboim & Vasishth, 2016) which requires an appropriate prior distribution.

BF <- function(posterior_difference, prior_sd = 1){
  library(polspline)
  fit <- logspline(posterior_difference) 
  posterior <- dlogspline(0, fit) # Height of the posterior at 0 
  prior <- dnorm(0, 0, prior_sd)  # Height of the prior at 0
  return(prior/posterior)
}

The logspline function estimates the density of the posterior and the dlogspline and the dnorm functions return the height of the density at 0. We fixed the mean of the prior at 0 but allow to change the SD as function argument. Aside, the mean of the prior does not need to be zero but can take on any value that can be justified on the basis of domain knowledge. We illustrate here how BFs can be used to test the evidence again the null hypothesis.

For demonstration purposes we calculate the BF with different prior SDs for the hesitation probability $\theta$, or in particular the posterior difference in hesitation $\theta_\text{diff}$ for before-sentence vs before-word transitions and before-word vs within-word transitions.

First, we extract vectors of differences

# For sentence vs word
theta_diff_sent <- filter(posterior_diffs, str_detect(parameter, "theta")) %>% pull(diff_sent)

# For before word vs within word
theta_diff_word <- filter(posterior_diffs, str_detect(parameter, "theta")) %>% pull(diff_word)

and then apply the BF function from above first to the difference between before-sentence and before-word transitions. For interpretation it is important to remember that $\theta$ is a probability (which we transformed from logit to probability above mostly for ease of explanation; in practice it is a better idea to work with the logit scaled values); hence the difference between two probabilities can only range between -1 and 1 (in the most extreme case). In other words, a prior SD of 1 is saying anything goes (which is not helpful). Similar to the calculations of the ROPE above, it is useful to contemplate which prior distribution appropriately reflects the null hypothesis for the scale the difference is on (e.g. logit, probability, log msecs, msecs). SDs of 1 and .5 render very weakly informative prior distributions (flat, really) as they assign about the same prior probability to differences across the board. An SD of .1 is a more realistic representation for the distribution of null differences.⁷

# Prior with sd of 1
BF(theta_diff_sent, prior_sd = 1)

[1] 0.4004183

# Prior with sd of .5
BF(theta_diff_sent, prior_sd = .5)

[1] 0.8008365

# Prior with sd of .1
BF(theta_diff_sent, prior_sd = .1)

[1] 4.004183

7.4 Note on using more than a single grouping variable

The Stan code presented in this tutorial can be used to estimate the posterior for two and more conditions represented in one predictor variable (called condition in the Stan code): We can calculate the difference between conditions from the posterior as well as main effects and interactions for more complex factorial designs as any factorial design can be reduced to a single vector. For example, say we have a 2 $\times$ 2 factorial design with factor 1 having two levels AB and CD and factor 2 having the corresponding levels AC and BD. These two factors render four conditions A, B, C, D. From posterior samples for each level A through D we can calculate main effect 1 as $\text{ME1}=(A+B) - (C+D)$, main effect 2 as $\text{ME2}=(A+C) - (B+D)$, and their interaction as $\text{Interaction}=(A-B) - (C-D)$ (or $\text{Interaction}=(A-C) - (B-D)$) and summary statistics as shown above.

8 By-participant estimates

In some contexts we would like to obtain by-participant estimates of the probability of a participant to exhibit hesitations (i.e. $\theta_s$) reflecting how often a participant is pausing. As before we can extract the posterior from the model fit_molg. This time we use the parameter that stored by-participant estimates prob_s (which is the inverse-logit of the theta_s which we didn’t store in the model fit) corresponding to the population estimate theta.

posterior_ppts <- as.data.frame(fit_molg, pars = 'prob_s') %>% as_tibble()
names(posterior_ppts)[1:10]

 [1] "prob_s[1,1]" "prob_s[2,1]" "prob_s[3,1]" "prob_s[1,2]" "prob_s[2,2]" "prob_s[3,2]" "prob_s[1,3]" "prob_s[2,3]" "prob_s[3,3]" "prob_s[1,4]"

The column names have the format <parameter>_s[<location>, <participant>] where indices indicate the transition-location id and the participant identifier of the parameters. The following code creates a long format with columns location and participant and a value column with their respective values (value is pivot_longer default). The previous column names were separated into location (levels: 1, 2, 3), and participant (an index for each participant) using “,” to separate the two variables and extracting the numbers inside the squared brackets (the remainder of the name was discarded).

posterior_ppts_long <- posterior_ppts %>%
  pivot_longer(everything(), 
               names_to = c("location", "participant"), 
               # extract indice for location and ppt using "," as separator
               names_pattern = "\\[(.+),(.+)\\]") %>%
  mutate(across(location, recode, 
                `1` = "before sentence",
                `2` = "before word",
                `3` = "within word")) %>%
  # create different columns for location
  mutate(id = row_number(), 
         .by = location) %>%
  pivot_wider(names_from = location, 
              values_from = value) %>% 
  select(-id) # drop id

posterior_ppts_long

# A tibble: 150,000 × 4
   participant `before sentence` `before word` `within word`
   <chr>                   <dbl>         <dbl>         <dbl>
 1 1                       0.401         0.285         0.303
 2 2                       0.704         0.596         0.167
 3 3                       0.286         0.686         0.211
 4 4                       0.549         0.315         0.210
 5 5                       0.573         0.541         0.313
 6 6                       0.488         0.422         0.439
 7 7                       0.591         0.738         0.110
 8 8                       0.701         0.479         0.302
 9 9                       0.670         0.725         0.202
10 10                      0.821         0.360         0.403
# ℹ 149,990 more rows

We can then summarise by-participant estimates for each transition location with the most probable parameter value and the 95% PI as before.

posterior_ppts_long_summary <- posterior_ppts_long %>%
  pivot_longer(-participant, names_to = "location") %>%
  summarise(across(value, list(mean = mean, 
                               lower = lower, 
                               upper = upper),
                   .names = '{.fn}'),
            .by = c(location, participant)) 

posterior_ppts_long_summary

# A tibble: 30 × 5
   location        participant   mean  lower upper
   <chr>           <chr>        <dbl>  <dbl> <dbl>
 1 before sentence 1           0.642  0.364  0.895
 2 before word     1           0.630  0.365  0.888
 3 within word     1           0.190  0.0393 0.464
 4 before sentence 2           0.816  0.600  0.967
 5 before word     2           0.574  0.313  0.838
 6 within word     2           0.0974 0.0124 0.272
 7 before sentence 3           0.506  0.261  0.767
 8 before word     3           0.697  0.420  0.933
 9 within word     3           0.128  0.0237 0.333
10 before sentence 4           0.631  0.389  0.856
# ℹ 20 more rows

For all 10 participants we see the estimate probability of typing hesitations by transition location. We observe in general that hesitations are more frequent before sentences and words than within words. However, participants vary to the extend they hesitate before words and sentences.

posterior_ppts_long_summary %>%
  mutate(across(participant, as.numeric)) %>% 
  ggplot(aes(x = participant, 
             y = mean, 
             ymin = lower, 
             ymax = upper,
             colour = location,
             shape = location)) +
  geom_pointrange(position = position_dodge(.75)) +
  labs(x = "Participant id", 
       y = "Hesitation probability with 95% PIs") +
  scale_colour_brewer("Transition location", palette = "Dark2") +
  scale_shape_manual("Transition location", values = c(21, 3, 8)) +
  scale_x_continuous(breaks = 1:10) +
  theme(legend.position = 'top')

Finally, we calculated and visualised the by-participant differences.

posterior_ppts_long_diff_summary <- posterior_ppts_long %>% 
  mutate(diff_sent = `before sentence` - `before word`,
         diff_word = `before word` - `within word`) %>% 
  select(participant, starts_with('diff')) %>%
  pivot_longer(-participant, 
               names_to = 'difference', 
               values_to = 'diff') %>% 
  summarise(across(diff, list(mean = mean, 
                              lower = lower, 
                              upper = upper),
                   .names = '{.fn}'),
            .by = c(participant, difference)) 

posterior_ppts_long_diff_summary

# A tibble: 20 × 5
   participant difference    mean   lower upper
   <chr>       <chr>        <dbl>   <dbl> <dbl>
 1 1           diff_sent   0.0116 -0.362  0.371
 2 1           diff_word   0.440   0.0916 0.743
 3 2           diff_sent   0.242  -0.0812 0.554
 4 2           diff_word   0.477   0.179  0.769
 5 3           diff_sent  -0.192  -0.553  0.186
 6 3           diff_word   0.569   0.254  0.845
 7 4           diff_sent   0.246  -0.104  0.566
 8 4           diff_word   0.273  -0.0469 0.565
 9 5           diff_sent   0.157  -0.130  0.462
10 5           diff_word   0.590   0.287  0.866
11 6           diff_sent   0.214  -0.138  0.539
12 6           diff_word   0.228  -0.0816 0.538
13 7           diff_sent   0.356  -0.0154 0.698
14 7           diff_word   0.354   0.0643 0.698
15 8           diff_sent   0.367   0.0568 0.656
16 8           diff_word   0.395   0.0661 0.708
17 9           diff_sent  -0.119  -0.423  0.188
18 9           diff_word   0.637   0.295  0.885
19 10          diff_sent   0.468   0.106  0.791
20 10          diff_word   0.210  -0.142  0.549

These allow us to identify participants that do or do not differ in terms of hesitation frequency depending on transition locations.

posterior_ppts_long_diff_summary %>%
  mutate(across(difference, recode, 
                diff_sent = "Comparison: before sentence vs before word",
                diff_word = "Comparison: before word vs within word"),
         across(participant, as.numeric)) %>% 
  ggplot(aes(x = participant, 
             y = mean, 
             ymin = lower, 
             ymax = upper)) +
  geom_hline(yintercept = 0, colour = "black", linetype = "dotted") +
  geom_pointrange() +
  facet_wrap(~difference) +
  coord_flip() +
  scale_x_continuous(breaks = 1:10) +
  labs(x = "Participant id", 
       y = "Difference in hesitation probabilities with 95% PIs")

9 References

Baayen, R. H. (2008). Analyzing linguistic data. A practical introduction to statistics using R. Cambridge University Press.

Baayen, R. H., Davidson, D. J., & Bates, D. M. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59(4), 390–412.

Baguley, T. (2012). Serious stats: A guide to advanced statistics for the behavioral sciences. Palgrave Macmillan.

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28.

Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20.

Dickey, J. M., Lientz, B. P., et al. (1970). The weighted likelihood ratio, sharp hypotheses about chances, the order of a markov chain. The Annals of Mathematical Statistics, 41(1), 214–226.

Dienes, Z. (2014). Using Bayes to get the most out of non-significant results. Frontiers in Psychology, 5(781), 1–17.

Dienes, Z. (2016). How Bayes factors change scientific practice. Journal of Mathematical Psychology, 72, 78–89.

Dienes, Z., & Mclatchie, N. (2018). Four reasons to prefer Bayesian analyses over significance testing. Psychonomic Bulletin & Review, 25(1), 207–218.

Dux Speltz, E., Roeser, J., & Chukharev-Hudilainen, E. (2022). Automating individualized, process-focused writing instruction: A design-based research study. Frontiers in Communication: Emerging Technologies and Writing: Pedagogy and Research, 7, 933878. https://doi.org/0.3389/fcomm.2022.933878

Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.

Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.

Jeffreys, H. (1961). The theory of probability (Vol. 3). Oxford University Press, Clarendon Press.

Kruschke, J. K. (2010a). Doing Bayesian data analysis: A tutorial introduction with R. Academic Press.

Kruschke, J. K. (2010b). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.

Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6(3), 299–312.

Kruschke, J. K. (2014). Doing bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd ed.). Academic Press.

Lambert, B. (2018). A student’s guide to Bayesian statistics. Sage.

Lee, M. D., & Wagenmakers, E.-J. (2014). Bayesian cognitive modeling: A practical course. Cambridge University Press.

MacLahlan, G., & Peel, D. (2000). Finite mixture models. John & Sons.

Matzke, D., & Wagenmakers, E.-J. (2009). Psychological interpretation of the ex-Gaussian and shifted Wald parameters: A diffusion model analysis. Psychonomic Bulletin & Review, 16(5), 798–817.

McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press.

Nicenboim, B., & Vasishth, S. (2016). Statistical methods for linguistic research: Foundational ideas – Part II. arXiv Preprint arXiv:1602.00245.

Pinheiro, J., Bates, D. M., DebRoy, S., & Sarkar, D. (2007). Linear and nonlinear mixed effects models. R Package Version, 3, 57.

R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Roeser, J., De Maeyer, S., Leijten, M., & Van Waes, L. (2021). Modelling typing disfluencies as finite mixture process. Reading and Writing. https://osf.io/y3p4d/

Sivula, T., Magnusson, M., Matamoros, A. A., & Vehtari, A. (2020). Uncertainty in Bayesian leave-one-out cross-validation based model comparison. arXiv Preprint arXiv:2008.10296.

Sorensen, T., Hohenstein, S., & Vasishth, S. (2016). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. Quantitative Methods for Psychology, 12(3), 175–200.

Spilling, E. F., Rønneberg, V., Rogne, W. M., Roeser, J., & Torrance, M. (2022). Handwriting versus keyboarding: Does writing modality affect quality of narratives written by beginning writers? Reading and Writing, 35(1), 129–153.

Spilling, E. F., Rønneberg, V., Rogne, W. M., Roeser, J., & Torrance, M. (2023). Writing by hand or digitally in first grade: Effects on rate of learning to compose text. Computers & Education, 198, 104755.

Stan Development Team. (2015a). Stan modeling language user’s guide and reference manual. http://mc-stan.org/.

Stan Development Team. (2015b). Stan: A C++ library for probability and sampling. http://mc-stan.org/.

Vasishth, S., Chopin, N., Ryder, R., & Nicenboim, B. (2017). Modelling dependency completion in sentence comprehension as a Bayesian hierarchical mixture process: A case study involving Chinese relative clauses. ArXiv e-Prints.

Vehtari, A., Gelman, A., & Gabry, J. (2015). Pareto smoothed importance sampling. arXiv Preprint arXiv:1507.02646.

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.

Wagenmakers, E.-J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian hypothesis testing for psychologists: A tutorial on the savage–dickey method. Cognitive Psychology, 60(3), 158–189.

10 Stan code

/*
  Mixture model for pauses
  With by-ppt mixing proportion
  Random intercepts for participants 
*/

data {
  int<lower=1> N;               // Number of observations
  int<lower=1> nS;              // Number of subjects
  int<lower=1, upper=nS> ppt[N];// Participant identifiers
  int<lower=1> K;               // Number of conditions
  int<lower=1> condition[N];    // Condition identifiers
  vector[N] y;                  // Outcome: interkey intervals
}


parameters {
  // fixed effects
  real beta; // fluent interkey intervals
  vector<lower=0>[K] delta; // slowdown for long interkey intervals
  vector[K] theta; // hesitation probability

  real<lower = 0> tau; // error for hesitation probability
  real<lower=0> sigma; // residual sd
  vector<lower=0>[K] sigma_diff; 

  // For random effects
  vector[nS] u; // participant intercepts
  real<lower=0> sigma_u; // participant sd
  matrix[K, nS] theta_s; // participant hesitations

}

transformed parameters{
  vector<lower=0>[K] sigmap_e = sigma + sigma_diff;
  vector<lower=0>[K] sigma_e = sigma - sigma_diff;
  matrix[K, nS] log_theta_s_1 = log_inv_logit(theta_s);
  matrix[K, nS] log_theta_s_2 = log1m_inv_logit(theta_s);
  vector[K] prob = inv_logit(theta); 
  matrix[K, nS] prob_s = inv_logit(theta_s); 

}

model {
  vector[2] lp_parts;

  // Priors
  beta ~ normal(5, .5);
  sigma ~ student_t(7, 0, 1);
  sigma_diff ~ normal(0, 1);
  delta ~ normal(0, .5);
  theta ~ normal(0, 1);
  tau ~ student_t(7, 0, .5);
    
  for(s in 1:nS){
    for(k in 1:K){
     theta_s[k,s] ~ normal(theta[k], tau);  
    }
  }
  
  // REs priors
  sigma_u ~ normal(0,2.5);
  u ~ normal(0, sigma_u); //subj random effects

  // likelihood
  for(n in 1:N){
    real mu = beta + u[ppt[n]];
    lp_parts[1] = log_theta_s_1[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu + delta[condition[n]], sigmap_e[condition[n]]); 
    lp_parts[2] = log_theta_s_2[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu, sigma_e[condition[n]]); 
    target += log_sum_exp(lp_parts);
  }
}

generated quantities{
  vector[N] log_lik;
  vector[N] y_tilde;
  vector[2] lp_parts;
  real<lower=0,upper=1> prob_tilde; 

  for(n in 1:N){
    real mu = beta + u[ppt[n]];
    lp_parts[1] = log_theta_s_1[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu + delta[condition[n]], sigmap_e[condition[n]]); 
    lp_parts[2] = log_theta_s_2[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu, sigma_e[condition[n]]); 
    log_lik[n] = log_sum_exp(lp_parts);
    prob_tilde = bernoulli_rng(prob_s[condition[n], ppt[n]]); 
    if(prob_tilde) { 
      y_tilde[n] = lognormal_rng(mu + delta[condition[n]], sigmap_e[condition[n]]);
    }
    else{
      y_tilde[n] = lognormal_rng(mu, sigma_e[condition[n]]);
    }
  }
}

11 Session info

sessionInfo()

R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so;  LAPACK version 3.7.1

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8    LC_PAPER=en_GB.UTF-8      
 [8] LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] polspline_1.1.25   bayestestR_0.13.2  loo_2.7.0          bayesplot_1.11.1   brms_2.21.0        Rcpp_1.0.12        rstan_2.32.6       StanHeaders_2.32.8 lubridate_1.9.3    forcats_1.0.0     
[11] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0    patchwork_1.2.0    knitr_1.46        
[21] rmdformats_1.0.4  

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     viridisLite_0.4.2    farver_2.1.2         fastmap_1.2.0        tensorA_0.36.2.1     digest_0.6.35        timechange_0.3.0     estimability_1.5.1   lifecycle_1.0.4     
[10] magrittr_2.0.3       posterior_1.5.0      compiler_4.3.3       rlang_1.1.3          sass_0.4.9           tools_4.3.3          utf8_1.2.4           yaml_2.3.8           labeling_0.4.3      
[19] bridgesampling_1.1-2 bit_4.0.5            pkgbuild_1.4.4       curl_5.2.1           RColorBrewer_1.1-3   plyr_1.8.9           abind_1.4-5          withr_3.0.0          grid_4.3.3          
[28] stats4_4.3.3         fansi_1.0.6          xtable_1.8-4         colorspace_2.1-0     inline_0.3.19        emmeans_1.10.2       scales_1.3.0         insight_0.19.11      cli_3.6.2           
[37] mvtnorm_1.2-5        rmarkdown_2.27       crayon_1.5.2         generics_0.1.3       RcppParallel_5.1.7   rstudioapi_0.16.0    reshape2_1.4.4       tzdb_0.4.0           cachem_1.1.0        
[46] ggthemes_5.1.0       parallel_4.3.3       matrixStats_1.3.0    vctrs_0.6.5          Matrix_1.6-5         jsonlite_1.8.8       bookdown_0.39        hms_1.1.3            bit64_4.0.5         
[55] jquerylib_0.1.4      glue_1.7.0           codetools_0.2-20     distributional_0.4.0 stringi_1.8.4        gtable_0.3.5         QuickJSR_1.1.3       munsell_0.5.1        pillar_1.9.0        
[64] htmltools_0.5.8.1    Brobdingnag_1.2-9    R6_2.5.1             vroom_1.6.5          evaluate_0.23        lattice_0.22-6       highr_0.11           backports_1.5.0      bslib_0.7.0         
[73] rstantools_2.4.0     coda_0.19-4.1        gridExtra_2.3        nlme_3.1-164         checkmate_2.3.1      xfun_0.44            pkgconfig_2.0.3

The R package brms provides a flexible framework to implement mixture models and many other types of probability models (Bürkner, 2017, 2018). brms has a mixture function to specify mixtures of various types of distributions. These could be Gaussian, skewed-Normal, shifted-Normal, ex-Gaussian etc. and combinations thereof. There is a large number of probability models for continuous data that are plausible candidates (for reaction time data see e.g. Matzke & Wagenmakers, 2009). In this tutorial we use Stan for mixture modelling to demonstrate the process and because the publication associated with this tutorial is fully using Stan instead of brms.↩︎
Instructions on how to install rstan can be found here.↩︎
Run install.packages("tidyverse") for installation.↩︎
Run install.packages("brms") for installation.↩︎
If you want to check how many cores are available on your machine, run parallel::detectCores() (you might need to install the package parallel).↩︎
To install bayestestR run install.packages("bayestestR").↩︎
For brms fits the hypothesis function calculates the BFs indicated as evidence ratio in the output. For a model such as y ~ 1 + condition, hypothesis can calculate the evidence for the hypothesis that the effect for condition is, for example, equal to 0. hypothesis can then be applied to the model with a statement that describes the hypothesis that should be evaluated, for example hypothesis(brm_model, "conditionb = 0") for a null hypothesis. To calculate the evidence ratio the argument sample_prior = TRUE (or sample_prior = 'yes') needs to be specified in brm.↩︎