Walkthrough: fitting a mixture model on copy-task data

This is a step-by-step walk-through that describes how to fit a finite mixture model of two log-Normal distributions using the statistical program R and the rstan package to interface with the probabilistic programming language Stan (Carpenter et al., 2016; Hoffman & Gelman, 2014; Stan Development Team, 2015a, 2015b).1 This walkthrough relates to Roeser et al. (2021).

The model returns population estimates for the average typing speed of fluent keystroke transitions (\(\beta\)), the slowdown for hesitant transitions (\(\delta\)) and the frequency of hesitant transitions (\(\theta\)) for each level of a categorical predictor after taking into account three sources of random error: (1) each participant has an individual fluent typing speed that differs across levels of the categorical predictor; (2) each participant has in individual hesitation frequency that differs across levels of the categorical predictor; (3) keystroke intervals for some bigrams are slower than for others.

This guide shows how to fit a mixture model written in Stan to copy-task data to then calculate the differences between two copy-task components for three parameters of interest and by-participant estimates. This walkthrough is largely self-contained. To run the code below, the reader only needs to install the required packages. Data and Stan code can be loaded from the code presented below.

We require two packages: (1) rstan to use R to interface with Stan for fitting Bayesian models2; (2) tidyverse for data processing and visualisation.3

library(rstan)
library(tidyverse)

Fitting the model

Preparing the data

First we load the keystroke data used in the manuscript. Data can be downloaded from OSF using the URL as below and the read_csv function. For brevity we use a subset of the sample. Data are from participants that each completed the consonants and LF-bigrams component of the copy task (Van Waes et al., 2019, 2021).

# Load data
data <- read_csv("https://osf.io/e5gy6/download") %>%
  # take first 50 ppts and only IKIs larger than 10 msecs
  filter(ppts %in% 1:50, IKI > 10) 

We reduced the data to four variables: a participant identifier ppts, a bigram identifier bigram, the inter-keystroke interval IKI, and the copy task component (levels: Consonants, LF).

data
# A tibble: 2,071 x 4
    ppts bigram   IKI component 
   <dbl> <chr>  <dbl> <chr>     
 1     1 tj       330 Consonants
 2     1 jx       278 Consonants
 3     1 xg       246 Consonants
 4     1 gf       297 Consonants
 5     1 fl       168 Consonants
 6     1 pg       270 Consonants
 7     1 gk       236 Consonants
 8     1 kf       440 Consonants
 9     1 fk      1784 Consonants
10     1 kq       663 Consonants
# … with 2,061 more rows

The data must be transformed into a list to feed them into Stan. The information required from the data is determined in the Stan code and has to do with what we need to estimate the model parameters and how the model is implemented. The exert from the Stan code below shows which information is expected as input (e.g. int is integer), how they are named, and what the smallest (i.e. lower) and largest possible value (i.e. upper) are. The full Stan code can be found at the end of this document.

// Do not run
// Data chunk of Stan code
data {
  int<lower=1> N;                    // number of observations
  int<lower=1> nS;                   // number of ppts
  int<lower=1, upper=nS> ppts[N];    // ppts identifier
  int<lower=1> nB;                   // number of bigrams
  int<lower=1, upper=nB> bigrams[N]; // ppts identifier
  int<lower=1> K;                    // number of conditions
  int condition[N];                  // condition identifiers
  vector[N] y;                       // outcome
}

Depending on the data we might have different numbers of participants, bigrams, and conditions. The Stan code is using indices to be able to fit varying numbers for each. In the code below we use factor in combination with as.numeric to ensure that there are no empty indices. We create vectors with participant identifiers ppts (numbers \(1\) through \(I\) where \(I\) is the number of participants), bigram identifiers bigrams (numbers 1 through \(J\) where \(J\) is the maximum number of bigrams), and a numeric identifier components for each copy task component (levels: consonants = 1, LF = 2). The returned values are indices for the parameters in the model: For example, for the components identifier, beta[1] is the population estimate for non-hesitant typing in the consonants task, beta[2] is same for LF-bigram; theta[1] and theta[2] are the disfluency probability for the consonants and LF-bigrams task, respectively.

ppts <- as.numeric(factor(data$ppts))
bigrams <- as.numeric(factor(data$bigram))
condition <- as.numeric(factor(data$component))

In the below code, the names on the left side of the arrow must correspond to the names expected in the Stan code (above); names on the right side do not. We assign the participant, bigram, and component identifiers created above and their maximum values nS, nB, K, respectively. The keystroke data IKI are assigned to y and the total number of observations (number of participants \(\times\) number of bigrams produced by participant) is assigned to N (number of rows in the data nrow(data).

data_list <- within( list(), {
  ppts <- ppts  
  nS <- max(ppts) # max no of ppts
  bigrams <- bigrams 
  nB <- max(bigrams) # max no of bigrams
  condition <- condition
  K <- max(condition) # no of conditions
  y <- data$IKI
  N <- nrow(data)
} )

The information in the data list can be viewed using the glimpse function.

glimpse(data_list)
List of 8
 $ N        : int 2071
 $ y        : num [1:2071] 330 278 246 297 168 ...
 $ K        : num 2
 $ condition: num [1:2071] 1 1 1 1 1 1 1 1 1 1 ...
 $ nB       : num 179
 $ bigrams  : num [1:2071] 139 79 170 48 42 110 52 82 41 84 ...
 $ nS       : num 50
 $ ppts     : num [1:2071] 1 1 1 1 1 1 1 1 1 1 ...

Load model

The Stan code described in the manuscript was extended to fit more than one copy-task component (i.e. a categorical predictor with \(K\) levels). The code is stored in the file “MoG.stan” (i.e. mixture of Gaussians) and can be downloaded from the “walkthrough” folder on OSF. Alternatively, readLines and writeLines can be used to download the Stan code from OSF so it will be saved in R’s current working directory.

# Download Stan code
file <- readLines("https://osf.io/j6dzt/download")
# Save Stan code to current working directory
writeLines(file, "MoG.stan")

The Stan model can be loaded for data fitting using the stan_model function and assign it to the object mog. The “MoG.stan” file needs to be available in the current working directory of R or the file path needs to be indicated.

mog <- stan_model(file = "MoG.stan")

This code fits the posterior of a categorical predictor with any number of levels. In other words, the posterior that model returns can be used to calculate simple effects, main effects and interactions of a factorial design. We demonstrate below how to calculate a simple effect. The same logic can be used to calculate main effects and interactions. Also, participant estimates are calculated for both mixture components and the mixing proportions by condition. For including more predictors, or removing, e.g., random error terms, or adjusting priors, the user will have to work directly in the Stan code.

This Stan code used is largely based on Sorensen et al. (2016) and Vasishth et al. (2017). Sorensen et al. (2016) presents a detailed tutorial on how to fit Stan models (see also Lambert, 2018).

Initiate sampling

For the model to converge we need to run a sufficient number of iterations. 30,000 iterations, as below, are a lot but does not guarantee convergence. We need to test convergence using the model’s posterior (see blow). To test whether the model has converged, we need to run the model more than one time (i.e. different chains). These can be run at the same time (in parallel) using more than one core of your computer (three cores below).4 There is no need to use more cores and chains and using less cores would mean that at least one process has to wait until after other processes are completed. When the model has settled on a parameter values, we should observe that all three streams (chains) overlap. Running long chains (many iterations) is useful to get more accurate parameter estimates.

iterations <- 30000       # No of iterations
warmup <- 15000           # Warmup samples to be discarded
n_chains <- n_cores <- 3  # No of chains / cores used (one chain per core)

The model is fitting a number of parameters that are not relevant for our inference. To reduce the size of the R object that containts the posterior we can select the parameters of interest and disregard all paremters that are not relevant for our inference.

# Parameters to keep in output
pars <- c("beta",                        # fluent typing
          "delta",                       # disfluency slowdown
          "theta",                       # mixing proportion
          "beta_s",                      # by-participant typing speed
          "theta_s",                     # by-participant disfluency probability
          "sigma",                       # variance component
          "sigma_diff", "sigmap_e", "sigma_e",  # variance by mixture component
          "sigma_u", "sigma_w",          # variance for random ppt and bigram intercepts
          "log_lik",                     # log likelihood (for model comparison) 
          "y_tilde")                     # predicted data 

The sampling function applies the model to the data using the information specified above (iterations, warmup, chains, cores, parameters pars to be saved). save_warmup is set to FALSE to discard the warmup samples (which are not used for inference anyway) to reduce the size of the posterior. To allow reproducibility (and because Bayesian models involves random number generation) we set the seed. The seed can be any number but using the same number ensures the use of the same random number. Lastly, the control argument was specified with higher values for adapt_delta and max_treedepth: using higher values here mean the model runs slower but supports more careful parameter estimation. The model is assigned to the variable m.

# Fit model
m <- sampling(mog, 
              data = data_list,
              iter = iterations,
              warmup = warmup,
              chains = n_chains, 
              cores = n_cores,
              pars = pars, # Parameters to keep.
              save_warmup = FALSE, # Don't save the warmup samples.
              seed = 365, 
              control = list(adapt_delta = .96, # default: .9
                             max_treedepth = 12)) # default: 10

Running this model will take a while to complete sampling depending on your hardware specifications. The time it took my machine to complete this job can me viewed using get_elapsed_time. Therefore it is worth to not use all cores for this to run or to use a dedicated high performance machine.

get_elapsed_time(m)/60 # in mins
          warmup   sample
chain:1 180.4650 344.6883
chain:2 180.6383 186.8117
chain:3 178.6567  97.3185

Reusing model output

The output of the model (i.e. the posterior) can be saved, so we don’t need to run the model again (which admittedly can take a while). The function requires the name of the fitted model m. We prefer to keep the name of the output similar to the name of the Stan code used. The model is stored as compressed “.rda” file.

saveRDS(m, "MoG.rda", compress = "xz")

The model can be load into the environment of R using readRDS.

m <- readRDS("MoG.rda")

Model diagnostics

Convergence

Model convergence can be established in two practically simple techniques. First, we can inspect the chains in trace plots which show the parameter estimate across iterations (here after warmup) for each chain. We will look at the population-level parameters (names are determined by the model) corresponding to the Greek symbols used in the manuscript and assign those to pars.

pars <- c("beta", "delta", "theta", "sigma", "sigmap_e", "sigma_e")

To create trace plots, we can apply the stan_trace function to the model m and extract the MCMC chains for the parameters in pars. The alpha argument makes the colours slightly more transparent. If the chains overlap and look like “fat hairy caterpillars,” chains have converged on the same target distribution.

# Check convergence
stan_trace(m, pars = pars, alpha = .5)

Second, we can calculate the \(\hat{R}\) statistic. Successful convergence is reflected in \(\hat{R}\) values smaller than 1.05 (Gelman & Rubin, 1992). \(\hat{R}\) is similar to the F statistic: it tells us how much bigger the variability between chains is compared to the variability within chains. A value of \(\approx 1\) indicates that the variability is essentially identical signifying convergence. For this we can use the rhat function applied to the model and the population parameters.

summary(m, pars=pars)$summary %>% as.data.frame() %>% 
  rownames_to_column("parameter") %>% 
  select(parameter, Rhat)
     parameter     Rhat
1      beta[1] 1.000328
2      beta[2] 1.001846
3     delta[1] 1.000258
4     delta[2] 1.000046
5     theta[1] 1.000845
6     theta[2] 1.000266
7     sigma[1] 1.000010
8     sigma[2] 1.000468
9  sigmap_e[1] 1.000094
10 sigmap_e[2] 1.000404
11  sigma_e[1] 1.000283
12  sigma_e[2] 1.000340

Convergence problems can have many reasons and therefore many solutions: running longer chains (in particular longer warmups), increasing the maximum treedepth and the average acceptance probability (adapt_delta), specifying starting values, adjusting priors, constraining parameters, or changing the parametrisation of the model. Severe convergence problems are indicative of a misspecification of the model.

Posterior predictive check

The model used the posterior parameter values to simulate hypothetical data sets. This happened for every iteration for every chains.

We can draw predicted data from the model output and compare these to the observed data. A model that makes reasonable predictions should fit the observed data.

Using the as.matrix function we extract a matrix of y_tilde, the simulated data. This returns a matrix with the size total_samples \(\times\) nrow(data) (so 45,000 \(\times\) 2,071). We use the sample function to draw N randomly sampled hypothetical data sets. The ppc_dens_oberlay function from the bayesplot package is then mapping the observed data (\(y\); thick blue line) to the simulated data (\(y_{rep}\); thin lightblue lines).

y_tilde <- as.matrix(m, pars = "y_tilde")          # extract simulated data sets
N <- 50                                            # number of simulations to use
total_samples <- (iterations - warmup) * n_chains  # total number of samples
rnd_sims <- sample(total_samples, N)               # created random indices
y_tilde_sample <- y_tilde[rnd_sims,]               # draw N random simulations 
library(bayesplot)
ppc_dens_overlay(data_list$y, y_tilde_sample) +
  scale_x_continuous(limits = c(0, 3000))

Posterior probability distribution

At the core of Bayesian inference is the posterior probability distribution. For each model parameter we have 45,000 posterior samples that form a posterior probability distribution. These samples are the results of the number of iterations after warmup for all chains.

(total_samples <- (iterations - warmup) * n_chains)
[1] 45000

The distribution of parameter values represents the uncertainty about parameter values given the data. There is a large range of things one can do with a posterior. Below we will focus on summarising parameter estimates, comparing conditions, and extracting by-participant estimates.

The names function can be used to remind us of the parameter names used in the model (we reduced the output to the first 15 parameter names). The meaning of the parameters is described in the manuscript and can be obtained from the Stan code. Indices refer to the copy-task component (1 = consonants; 2 = LF bigrams) and to the participant identifier for by-participant parameters (indicated by _s).

names(m)[1:15] 
 [1] "beta[1]"     "beta[2]"     "delta[1]"    "delta[2]"    "theta[1]"   
 [6] "theta[2]"    "beta_s[1,1]" "beta_s[2,1]" "beta_s[1,2]" "beta_s[2,2]"
[11] "beta_s[1,3]" "beta_s[2,3]" "beta_s[1,4]" "beta_s[2,4]" "beta_s[1,5]"

Parameter-value estimates

The parameter estimates can be visualised using the stan_hist function with the model variable m and the to-be visualised parameters pars as arguments. The function returns a histogram for each other parameter values. Each histogram is the probability distribution of the parameter values indicating which values are more or less probable to be the true parameter value.

stan_hist(m, pars = pars)

The posterior samples of the parameter values can be summarised using the print function. The probs argument requires the lower and upper bound of the probability interval that we are interested in. A lower bound of .025 and an upper bound of .975 gives the 95% probability interval (PI; also called “credible intervals”), i.e. the range that contains the true parameter value with a 95% probability given the data.

The output summaries the most probable parameter value as mean with its standard error (se_mean) and standard deviation (sd), the effective sample size (n_eff) indicating sampling efficiency and the convergence metric \(\hat{R}\) (Rhat) we introduced above.

print(m, pars = pars, probs = c(.025,.975))
Inference for Stan model: anon_model.
3 chains, each with iter=30000; warmup=15000; thin=1; 
post-warmup draws per chain=15000, total post-warmup draws=45000.

            mean se_mean   sd 2.5% 97.5% n_eff Rhat
beta[1]     5.79       0 0.10 5.60  6.00  3526    1
beta[2]     5.58       0 0.08 5.41  5.74  4228    1
delta[1]    1.05       0 0.10 0.80  1.20  2861    1
delta[2]    0.32       0 0.08 0.17  0.48 16644    1
theta[1]    0.47       0 0.09 0.31  0.65  1949    1
theta[2]    0.27       0 0.06 0.16  0.40  5993    1
sigma[1]    0.45       0 0.02 0.41  0.50  4462    1
sigma[2]    0.55       0 0.04 0.49  0.64  5828    1
sigmap_e[1] 0.56       0 0.06 0.46  0.68  3287    1
sigmap_e[2] 0.83       0 0.06 0.73  0.97  6869    1
sigma_e[1]  0.34       0 0.03 0.28  0.39  4526    1
sigma_e[2]  0.27       0 0.02 0.23  0.31  6235    1

Samples were drawn using NUTS(diag_e) at Mon May 10 07:43:47 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

For the following steps, we will focus on the three parameters that have conceptually interesting interpretations: (1) the average fluent typing speed \(\beta\), (2) the disfluency slowdown \(\delta\), and (3) the disfluency probability \(\theta\).

pars <- c("beta", "delta", "theta")

The plot function shows the posterior probability distribution of the three parameters for the consonants task (indicated as 1) and the LF-bigrams task (indicated as 2) summarised as median and 95% PI.

plot(m, pars = pars, ci_level = .95) # ci = credible interval

The values for \(\beta\) and \(\delta\) are shown on a log-scale. To transform their values back to msecs we can extract the posterior samples using the as.data.frame function. We prefer the use of tibble objects.

posterior <- as.data.frame(m, pars) %>% as_tibble() 
posterior
# A tibble: 45,000 x 6
   `beta[1]` `beta[2]` `delta[1]` `delta[2]` `theta[1]` `theta[2]`
       <dbl>     <dbl>      <dbl>      <dbl>      <dbl>      <dbl>
 1      5.81      5.63       1.09      0.240      0.523      0.377
 2      5.88      5.62       1.07      0.335      0.400      0.297
 3      5.83      5.58       1.10      0.283      0.440      0.375
 4      5.92      5.67       1.05      0.271      0.356      0.313
 5      5.90      5.65       1.06      0.269      0.390      0.359
 6      5.85      5.66       1.09      0.281      0.472      0.267
 7      5.87      5.55       1.15      0.278      0.420      0.304
 8      5.74      5.51       1.04      0.387      0.391      0.311
 9      5.75      5.54       1.12      0.353      0.417      0.287
10      5.63      5.52       1.06      0.265      0.415      0.280
# … with 44,990 more rows

The pivot_longer function is transforming the data above to a long format with an additional column for component and the model parameters kept as columns. The names_pattern argument is using a regular expression to extract the number in the squared brackets.

posterior_long <- pivot_longer(posterior, everything(), 
             names_to = c(".value", "component"), 
             names_pattern = "(.*)\\[(.)\\]") 

posterior_long
# A tibble: 90,000 x 4
   component  beta delta theta
   <chr>     <dbl> <dbl> <dbl>
 1 1          5.81 1.09  0.523
 2 2          5.63 0.240 0.377
 3 1          5.88 1.07  0.400
 4 2          5.62 0.335 0.297
 5 1          5.83 1.10  0.440
 6 2          5.58 0.283 0.375
 7 1          5.92 1.05  0.356
 8 2          5.67 0.271 0.313
 9 1          5.90 1.06  0.390
10 2          5.65 0.269 0.359
# … with 89,990 more rows

This code is then transforming beta and delta to msecs using the exponential function exp to un-log the values. In order to transform the slowdown delta into msecs we need to add beta before using the exponential function; we can then subtract beta again. The recode function changes the component indices from 1 to “Consonants” and 2 to “LF bigrams.”

posterior_in_msecs <- mutate(posterior_long,
                             delta = exp(beta + delta) - exp(beta),
                             beta = exp(beta),
                             component = recode(component, `1` = "Consonants",
                                                           `2` = "LF bigrams"))
posterior_in_msecs
# A tibble: 90,000 x 4
   component   beta delta theta
   <chr>      <dbl> <dbl> <dbl>
 1 Consonants  334. 656.  0.523
 2 LF bigrams  279.  75.8 0.377
 3 Consonants  357. 678.  0.400
 4 LF bigrams  275. 109.  0.297
 5 Consonants  340. 687.  0.440
 6 LF bigrams  266.  87.1 0.375
 7 Consonants  373. 692.  0.356
 8 LF bigrams  290.  90.0 0.313
 9 Consonants  364. 688.  0.390
10 LF bigrams  283.  87.3 0.359
# … with 89,990 more rows

The posterior distribution of the parameter estimates can then be visualised in, for example, histograms.

pivot_longer(posterior_in_msecs, beta:theta, names_to = "parameter") %>%
  ggplot(aes(x = value, colour = component, fill = component)) +
  geom_histogram(position = "identity", alpha = .25) +
  facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
  scale_fill_brewer("Copy-task\ncomponents", palette = "Dark2") +
  scale_color_brewer("Copy-task\ncomponents", palette = "Dark2") +
  labs(x = "Posterior parameter estimate")

Difference between copy-task components

We can calculate the differences between the copy-task components for each parameter value. This is giving an indication of whether fluent typing (\(\beta\)) is influenced by the copy task, maybe the size of the disfluency slowdown (\(\delta\)) is impacted by the task, or the disfluency probability (\(\theta\)).

To determine these differences we change the data format above and create a variable that indicates the parameter and one column for the corresponding values for each copy-task component. The last line calculates the difference between the consonants task and the LF-bigrams task for each of the three parameter values.

posterior_by_component <- posterior_in_msecs %>%
  pivot_longer(beta:theta, names_to = "parameter") %>%
  group_by(component) %>% mutate(id = row_number()) %>%
  pivot_wider(names_from = component, values_from = value) %>%
  select(-id) %>% # drop id column
  mutate(diff = Consonants - `LF bigrams`,  # calculate the difference between copy-task components
         parameter = paste0(parameter, "[diff]")) 

posterior_by_component
# A tibble: 135,000 x 4
   parameter   Consonants `LF bigrams`     diff
   <chr>            <dbl>        <dbl>    <dbl>
 1 beta[diff]     334.         279.     54.6   
 2 delta[diff]    656.          75.8   580.    
 3 theta[diff]      0.523        0.377   0.146 
 4 beta[diff]     357.         275.     81.9   
 5 delta[diff]    678.         109.    569.    
 6 theta[diff]      0.400        0.297   0.103 
 7 beta[diff]     340.         266.     74.2   
 8 delta[diff]    687.          87.1   600.    
 9 theta[diff]      0.440        0.375   0.0647
10 beta[diff]     373.         290.     83.4   
# … with 134,990 more rows

The difference between the conditions can be summarised using the mean, the 95% PIs and the probability that the difference between the components is small 0 (indicated as e.g. \(P(\hat{\beta} <0)\) for the parameter \(\beta\); the hat \(\hat{.}\) symbol indicates that the value is an estimate of the population parameter value). This summary tells us whether the difference between the consonants task and the LF-bigrams task is different from zero and the most probable value for the difference between tasks.

posterior_by_component %>%
  group_by(parameter) %>%                      # group by parameters
  summarise(mean = mean(diff),                 # mean difference
            lower = quantile(diff, .025),      # 2.5% lower bound of difference
            upper = quantile(diff, .975),      # 97.5% upper bound of difference
            p = mean(diff < 0)) %>%            # prob that difference is negative
  mutate(across(where(is.numeric), round, 2))  # round after 2nd decimal place
# A tibble: 3 x 5
  parameter    mean  lower  upper     p
  <chr>       <dbl>  <dbl>  <dbl> <dbl>
1 beta[diff]   64.4  -5.64 144.    0.04
2 delta[diff] 513.  282.   745.    0   
3 theta[diff]   0.2  -0.01   0.41  0.03

These differences can be viewed in histograms. The vertical line indicates a difference between tasks corresponding to a value of zero. The area of the histogram to the left of this line corresponds to the probability p to observe a value smaller than zero as in the output above. We observe difference between copy-task components for all three parameters. Fluent typing is 64 msecs (95% PI: -6, 144) slower for the consonants task, and hesitations are 513 msecs (95% PI: 282, 745) longer and 0.2 (\(\approx\) 20%; 95% PI: -0.01, 0.41) times more likely. Interestingly fluent typing (i.e. planning and executing motor plans) is slower in the consonant task even after modeling hesitations as a separate process.

ggplot(posterior_by_component, aes(x = diff)) +
  geom_histogram(alpha = .5) +
  geom_vline(xintercept = 0, colour = "red", linetype = "dashed", size = .5) +
  facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
  labs(x = "Difference between copy-task components")

In practice we rarely believe that only an effect of exactly null is indicating that there is no difference between our groups. Instead there are values that are non-different from null; for example a slowdown of 1 msecs is meaningless in most, if not all, contexts. To take this into account we can determine the region of practical equivalence (ROPE, Kruschke, 2010, 2011, 2014). We can define the ROPE as the range of values that is a priori considered non-different from a null effect. The ROPE value indicates to what extent the posterior can not rule out a negligible effect. A meaningful difference between groups (or effects more generally) should have a small proportion of posterior samples within the ROPE.

To calculate the ROPE for out differences between the two copy-task components we can use the rope function from the bayestestR package.5 The rope function requires as input a vector with the posterior difference. We can calculate a vector with posterior difference for each of the three parameters, indicated as \(\beta_\text{diff}\), \(\delta_\text{diff}\), and \(\theta_\text{diff}\).

# Fluent typing
beta_diff <- filter(posterior_by_component, parameter == "beta[diff]") %>% pull(diff)

# Slowdown for hesitant typing
delta_diff <- filter(posterior_by_component, parameter == "delta[diff]") %>% pull(diff)

# Proportion of hesitations
theta_diff <- filter(posterior_by_component, parameter == "theta[diff]") %>% pull(diff)

Also, we need to define the range argument which specifies the range of values are not considerably different from null. Of course this range will depend on what the parameter represents and researcher intuition. For example, for fluent typing we would consider small differences, of say 10 msecs, more meaningful than for disfluencies. If we consider differences of 10 msecs or smaller equivalent to a null difference, we can define the range argument with a lower bound of -10 and the upper bound of 10.

library(bayestestR)
rope(beta_diff, range = c(-10, 10))
# Proportion of samples inside the ROPE [-10.00, 10.00]:

inside ROPE
-----------
6.72 %     

The ROPE for fluent typing speed contains 6.72% of the posterior samples. In other words, there is a 6.72% probability that the posterior difference \(\beta_\text{diff}\) is between -10 and 10 msecs which we considered negligible.

For disfluencies we might want to consider a larger range of values as negligible differences between two tasks (also because the variability in disfluencies is larger than the variability for fluent key transitions). Even if we define a ROPE of -250 through 250 msecs, there are no posterior samples inside the ROPE indicating that the difference in typing hesitations are related to difficulty that arise prior to processing on the motor level.

rope(delta_diff, range = c(-250, 250))
# Proportion of samples inside the ROPE [-250.00, 250.00]:

inside ROPE
-----------
0.00 %     

As for the disfluency probably, we need to bear in mind that estimates for \(\theta\) are bound between 0 and 1. Hence, the difference \(\theta_\text{diff}\) cannot ever exceed -1 and 1. If we consider a difference of 5% or smaller as negligible, we can define the ROPE as ranging from -.05 through .05. We observe that 6.37% of the posterior difference \(\theta_\text{diff}\) is within the ROPE. In other words, a large proportion of posterior samples is supporting a meaningful difference between the two copy-task components. We observe that hesitations are largely more frequent in the consonants task.

rope(theta_diff, range = c(-.05, .05))
# Proportion of samples inside the ROPE [-0.05, 0.05]:

inside ROPE
-----------
6.37 %     

The Stan code presented in this walk through can be used to estimate the posterior for two and more conditions: We can calculate the difference between conditions from the posterior as well as main effects and interactions for more complex factorial designs as any factorial design can be reduced to a single variable.6

By-participant estimates

In some contexts we would like to obtain by-participant estimates of the average fluent typing-speed of a participant after accounting for disfluencies (i.e. \(\beta_s\)); by-participant disfluency information can be estimated the proportion of disfluencies (i.e. \(\theta_s\)). These estimates inform us about the fluent typing speed but also about the prevalence of typing hesitations.

As before we can extract the posterior from the model m. This time we use the parameters that stored by-participant estimates beta_s and theta_s corresponding to the population estimates beta and theta.

pars <- c("beta_s", "theta_s")
posterior_ppts <- as.data.frame(m, pars = pars) %>% as_tibble()
names(posterior_ppts)[1:10]
 [1] "beta_s[1,1]" "beta_s[2,1]" "beta_s[1,2]" "beta_s[2,2]" "beta_s[1,3]"
 [6] "beta_s[2,3]" "beta_s[1,4]" "beta_s[2,4]" "beta_s[1,5]" "beta_s[2,5]"

This posterior tibble has the format “\(<\text{parameter name}>\text{_s}[<\text{component id}>, <\text{participant id}>]\)” where indices indicate the copy-task component and the participant identifier of the parameters. The following code is converting the betas to msecs, creates a long format with columns parameter with the names above as levels and a value column with their respective values. The parameter column is then separated into parameter (levels: beta, theta), component (levels: 1, 2), and participant (an index for each participant) using “,” to separate the three variables. To use “,” as separator we replace "_s" with “,” in the line before. Copy-task component names were changed as before. A preview is below.

posterior_ppts_long <- posterior_ppts %>%
  # use exp() on everything starting with "beta"
  mutate(across(starts_with("beta"), exp)) %>%             
  pivot_longer(everything(), names_to = "parameter") %>%
  # replace "_s" with "," 
  mutate(parameter = gsub("_s", ",", parameter)) %>%
  # separate into parameter, component, ppt using "," as separator
  separate(parameter, into = c("parameter", "component", "participant")) %>% 
  mutate(component = recode(component, `1` = "Consonants", `2` = "LF bigrams")) %>%
  group_by(parameter) %>% mutate(id = row_number()) %>%
  pivot_wider(names_from = parameter, values_from = value) %>% 
  select(-id) # drop id

posterior_ppts_long
# A tibble: 4,500,000 x 4
   component  participant  beta theta
   <chr>      <chr>       <dbl> <dbl>
 1 Consonants 1            323. 0.309
 2 LF bigrams 1            295. 0.292
 3 Consonants 2            299. 0.514
 4 LF bigrams 2            305. 0.223
 5 Consonants 3            381. 0.632
 6 LF bigrams 3            348. 0.332
 7 Consonants 4            469. 0.509
 8 LF bigrams 4            301. 0.281
 9 Consonants 5            357. 0.301
10 LF bigrams 5            254. 0.244
# … with 4,499,990 more rows

We can then summarise by-participant estimates for each component and parameter with the most probable parameter value and the 95% PI as before.

posterior_ppts_long_summary <- posterior_ppts_long %>%
  pivot_longer(beta:theta, names_to = "parameter", values_to = "value") %>%
  group_by(parameter, component, participant) %>%
  summarise(mean = mean(value),
            lower = quantile(value, .025),
            upper = quantile(value, .975))

posterior_ppts_long_summary
# A tibble: 200 x 6
# Groups:   parameter, component [4]
   parameter component  participant  mean lower upper
   <chr>     <chr>      <chr>       <dbl> <dbl> <dbl>
 1 beta      Consonants 1            320.  246.  408.
 2 beta      Consonants 10           268.  192.  348.
 3 beta      Consonants 11           269.  197.  358.
 4 beta      Consonants 12           290.  230.  362.
 5 beta      Consonants 13           493.  300.  644.
 6 beta      Consonants 14           427.  318.  562.
 7 beta      Consonants 15           338.  268.  421.
 8 beta      Consonants 16           295.  222.  380.
 9 beta      Consonants 17           263.  204.  333.
10 beta      Consonants 18           303.  223.  411.
# … with 190 more rows

To get an idea how the by-participant estimates look like, we extracted participants 1 through 5 for visualisation. From this subset we see that, although we observed overall larger population estimates for the consonants task, some participants do not show a difference in fluent typing speed and a similar frequency of hesitations (e.g. participant 2) or even slightly more hesitations in the LF-bigrams task (e.g. participant 1).

posterior_ppts_long_summary %>%
  filter(participant %in% 1:5) %>% # only participants 1 through 5
  ggplot(aes(x = participant, 
             y = mean, ymin = lower, ymax = upper,
            colour = component)) +
  geom_pointrange(position = position_dodge(.75)) +
  facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
  coord_flip() +
  labs(x = "Participant id", y = "Parameter estimates") +
  scale_colour_brewer("Copy-task\ncomponent", palette = "Dark2")

Finally, we can calculate and visualise the by-participant differences.

posterior_ppts_long_diff_summary <- posterior_ppts_long %>% group_by(component) %>% 
  mutate(id = row_number()) %>%
  pivot_wider(names_from = component, 
              values_from = c(beta, theta)) %>%
  pivot_longer(-c(participant, id),
             names_to = c("parameter", ".value"), 
             names_pattern = "(.*)_(.*)") %>%
  mutate(diff = Consonants - `LF bigrams`,
         parameter = paste0(parameter, "[diff]")) %>% 
  select(participant, parameter, diff) %>%
  group_by(parameter, participant) %>%
  summarise(mean = mean(diff),
            lower = quantile(diff, .025),
            upper = quantile(diff, .975))

posterior_ppts_long_diff_summary
# A tibble: 100 x 5
# Groups:   parameter [2]
   parameter  participant  mean    lower upper
   <chr>      <chr>       <dbl>    <dbl> <dbl>
 1 beta[diff] 1            72.0  -21.5   172. 
 2 beta[diff] 10           58.4  -26.0   146. 
 3 beta[diff] 11          -16.3 -122.     88.7
 4 beta[diff] 12           47.4  -33.8   133. 
 5 beta[diff] 13          138.   -51.9   301. 
 6 beta[diff] 14           38.5 -105.    192. 
 7 beta[diff] 15          118.    33.0   209. 
 8 beta[diff] 16           55.6  -35.5   152. 
 9 beta[diff] 17           70.1   -0.104 147. 
10 beta[diff] 18           78.3  -11.5   189. 
# … with 90 more rows

For simplicity we show the differences for the first 10 participants only. This calculation and visualisation allows researchers to identify participants that do or do not differ in terms of fluent typing speed or hesitation frequency, in our example for tasks that do and do not involve lexical information.

posterior_ppts_long_diff_summary %>%
  filter(participant %in% 1:10) %>% # only participants 1 through 10
  ggplot(aes(x = factor(as.numeric(participant)), 
             y = mean, ymin = lower, ymax = upper)) +
  geom_hline(yintercept = 0, colour = "red", linetype = "dashed") +
  geom_pointrange(position = position_dodge(.75)) +
  facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
  coord_flip() +
  labs(x = "Participant id", y = "Difference between copy-task components") +
  scale_colour_brewer("Copy-task\ncomponent", palette = "Dark2")

References

Bürkner, P.-C. (2017). brms: An R package for Bayesian multilevel models using Stan. Journal of Statistical Software, 80(1), 1–28.
Bürkner, P.-C. (2018). Advanced Bayesian multilevel modeling with the R package brms. The R Journal, 10(1), 395–411. https://doi.org/10.32614/RJ-2018-017
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.
Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(1), 1593–1623.
Kruschke, J. K. (2010). What to believe: Bayesian methods for data analysis. Trends in Cognitive Sciences, 14(7), 293–300.
Kruschke, J. K. (2011). Bayesian assessment of null values via parameter estimation and model comparison. Perspectives on Psychological Science, 6(3), 299–312.
Kruschke, J. K. (2014). Doing bayesian data analysis: A tutorial with R, JAGS, and Stan (2nd ed.). Academic Press.
Lambert, B. (2018). A student’s guide to Bayesian statistics. Sage.
Matzke, D., & Wagenmakers, E.-J. (2009). Psychological interpretation of the ex-Gaussian and shifted Wald parameters: A diffusion model analysis. Psychonomic Bulletin & Review, 16(5), 798–817.
Roeser, J., De Maeyer, S., Leijten, M., & Van Waes, L. (2021). Modelling typing disfluencies as finite mixture process. Reading and Writing. https://osf.io/y3p4d/
Sorensen, T., Hohenstein, S., & Vasishth, S. (2016). Bayesian linear mixed models using Stan: A tutorial for psychologists, linguists, and cognitive scientists. Quantitative Methods for Psychology, 12(3), 175–200.
Stan Development Team. (2015a). Stan: A C++ library for probability and sampling. http://mc-stan.org/.
Stan Development Team. (2015b). Stan modeling language user’s guide and reference manual. http://mc-stan.org/.
Van Waes, L., Leijten, M., Pauwaert, T., & Van Horenbeeck, E. (2019). A multilingual copy task: Measuring typing and motor skills in writing with inputlog. Journal of Open Research Software, 7(30), 1–8.
Van Waes, L., Leijten, M., Roeser, J., Olive, T., & Grabowski, J. (2021). Measuring and assessing typing skills in writing research. Journal of Writing Research, 13(1), 107–153. https://doi.org/10.17239/jowr-2021.13.01.04
Vasishth, S., Chopin, N., Ryder, R., & Nicenboim, B. (2017). Modelling dependency completion in sentence comprehension as a Bayesian hierarchical mixture process: A case study involving Chinese relative clauses. ArXiv e-Prints.

Stan code

/*
  Mixture model for disfluencies
  With by-ppt mixing proportion
  Random intercepts for ppts
  Random intercepts for bigrams
  By-ppts slope adjustments
*/

data {
  int<lower=1> N;                    // number of observations
  int<lower=1> nS;                   // number of ppts
  int<lower=1, upper=nS> ppts[N];    // ppts identifier
  int<lower=1> nB;                   // number of bigrams
  int<lower=1, upper=nB> bigrams[N]; // ppts identifier
  int<lower=1> K;                    // number of conditions
  int condition[N];                  // vector with condition levels
  vector[N] y;                       // outcome
}


parameters {
  real beta_mu;                   // fluent typing
  vector[K] beta_raw;
  real<lower=0> beta_sigma;

  vector<lower = 0>[K] delta;    // slowdown for hesitations

  vector[K] alpha;               // mixing proportion
  real tau;
  matrix[K,nS] alpha_s;          // by-ppt mixing proportion

  vector<lower = 0>[K] sigma;    // residual error
  vector<lower=0>[K] sigma_diff; // difference between mixture components

  // For random effects
  vector<lower=0>[K] sigma_u;   // ppts slopes
  cholesky_factor_corr[K] L_u;  // variance-covariance matrix
  matrix[K,nS] z_u;             // random effects for participants
  
  vector[nB] w;                 // bigram intercepts
  real<lower=0> sigma_w;        // bigram sd
}

transformed parameters{
  vector[K] theta = 1 - inv_logit(alpha);         // convert mixing proportion to proportion of second component
  matrix[K,nS] theta_s = 1 - inv_logit(alpha_s);
  matrix[K,nS] log_alpha_s_1 = log_inv_logit(alpha_s);
  matrix[K,nS] log_alpha_s_2 = log1m_inv_logit(alpha_s);
  
  vector[K] beta = beta_mu + beta_sigma * beta_raw;

  vector<lower=0>[K] sigmap_e = sigma + sigma_diff; // variance for second mixture component is larger
  vector<lower=0>[K] sigma_e = sigma - sigma_diff;  // variance for first mixture component is smaller      

  matrix[K,nS] u = diag_pre_multiply(sigma_u, L_u) * z_u; // ppts random effects

  vector[N] mu;
  
  for(n in 1:N){
    mu[n] = beta[condition[n]] + u[condition[n], ppts[n]] + w[bigrams[n]];
  }
}

model {
  vector[2] lp_parts;

  // Priors
  beta_mu ~ normal(5, 2.5);
  beta_sigma ~ cauchy(0, 2.5);
  beta_raw ~ normal(0, 1);

  delta ~ normal(0, 1);

  for(s in 1:nS){
    for(k in 1:K){
     alpha_s[k,s] ~ normal(alpha[k], tau);  
    }
  }

  alpha ~ normal(0, 1);
  tau ~ cauchy(0, 1);

  sigma ~ cauchy(0, 2.5);
  sigma_diff ~ normal(0, 1);

  // REs priors
  L_u ~ lkj_corr_cholesky(2.0);
  to_vector(z_u) ~ normal(0,1);
  sigma_u ~ normal(0,2.5);      // ppts random effects
  
  sigma_w ~ normal(0,2.5);
  w ~ normal(0, sigma_w);       // bigram random effects

  // likelihood
  for(n in 1:N){
    lp_parts[1] = log_alpha_s_1[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n], sigma_e[condition[n]]); // fluent typing
    lp_parts[2] = log_alpha_s_2[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n] + delta[condition[n]], sigmap_e[condition[n]]); // hesitatant typing
    target += log_sum_exp(lp_parts);
  }
}

generated quantities{
  vector[2] lp_parts;
  real<lower=0,upper=1> prob_tilde; 

  vector[N] log_lik; // log likelihood
  vector[N] y_tilde; // predicted data
  vector[K] beta2 = beta + delta;    // second mixture component
  matrix[K,nS] beta_s;               // by-ppt fluent typing estimates

  for(s in 1:nS){
    for(k in 1:K){
      beta_s[k, s] = beta[k] + u[k, s];
    }
  }

  for(n in 1:N){
    // calculate log likelihood
    lp_parts[1] = log_alpha_s_1[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n], sigma_e[condition[n]]); // fluent typing
    lp_parts[2] = log_alpha_s_2[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n] + delta[condition[n]], sigmap_e[condition[n]]); // hesitant typing
    log_lik[n] = log_sum_exp(lp_parts);
    prob_tilde = bernoulli_rng(theta_s[condition[n], ppts[n]]); 
    // predict data
    if(prob_tilde) { 
      y_tilde[n] = lognormal_rng(mu[n] + delta[condition[n]], sigmap_e[condition[n]]);
    }
    else{
      y_tilde[n] = lognormal_rng(mu[n], sigma_e[condition[n]]);
    }
  }
}

Session info

sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bayestestR_0.9.0   bayesplot_1.8.0    rstan_2.26.1       StanHeaders_2.26.1
 [5] forcats_0.5.1      stringr_1.4.0      dplyr_1.0.6        purrr_0.3.4       
 [9] readr_1.4.0        tidyr_1.1.3        tibble_3.1.2       ggplot2_3.3.4     
[13] tidyverse_1.3.0    knitr_1.33         rmdformats_1.0.1  

loaded via a namespace (and not attached):
 [1] httr_1.4.2         sass_0.3.1         jsonlite_1.7.2     modelr_0.1.8      
 [5] bslib_0.2.4        RcppParallel_5.0.3 assertthat_0.2.1   highr_0.9         
 [9] stats4_4.0.3       cellranger_1.1.0   yaml_2.2.1         pillar_1.6.1      
[13] backports_1.2.1    glue_1.4.2         digest_0.6.27      RColorBrewer_1.1-2
[17] rvest_1.0.0        colorspace_2.0-1   plyr_1.8.6         htmltools_0.5.1.1 
[21] pkgconfig_2.0.3    broom_0.7.7        haven_2.4.1        bookdown_0.21     
[25] scales_1.1.1       processx_3.4.5     farver_2.1.0       generics_0.1.0    
[29] ellipsis_0.3.2     withr_2.4.2        cli_2.5.0          magrittr_2.0.1    
[33] crayon_1.4.1       readxl_1.3.1       evaluate_0.14      ps_1.6.0          
[37] fs_1.5.0           fansi_0.5.0        xml2_1.3.2         pkgbuild_1.2.0    
[41] tools_4.0.3        loo_2.4.1          prettyunits_1.1.1  hms_1.0.0         
[45] lifecycle_1.0.0    matrixStats_0.59.0 V8_3.4.0           munsell_0.5.0     
[49] reprex_1.0.0       callr_3.5.1        compiler_4.0.3     jquerylib_0.1.3   
[53] rlang_0.4.11       ggridges_0.5.3     grid_4.0.3         rstudioapi_0.13   
[57] labeling_0.4.2     rmarkdown_2.7      gtable_0.3.0       codetools_0.2-16  
[61] inline_0.3.17      DBI_1.1.1          curl_4.3.1         reshape2_1.4.4    
[65] R6_2.5.0           gridExtra_2.3      lubridate_1.7.10   utf8_1.2.1        
[69] insight_0.14.0     stringi_1.6.2      parallel_4.0.3     Rcpp_1.0.6        
[73] vctrs_0.3.8        dbplyr_2.1.0       tidyselect_1.1.1   xfun_0.24         

  1. An alternative to writing models in Stan is the R package brms which provides a flexible framework to implement mixture models and many other types of probability models (Bürkner, 2017, 2018). In particular, brms has a mixture function to specify mixtures of various types of distributions. These could be Gaussian, skewed-Normal, shifted-Normal, ex-Gaussian etc. and combinations thereof. There is a large number of probability models for continuous data that are plausible candidates (for reaction time data see e.g. Matzke & Wagenmakers, 2009).↩︎

  2. Instructions on how to install rstan can be found here.↩︎

  3. Run install.packages("tidyverse") to install tidyverse on your computer.↩︎

  4. If you want to use R to check how many cores are available on your machine, run parallel::detectCores() (you might need to install the package parallel).↩︎

  5. To install bayestestR run install.packages("bayestestR").↩︎

  6. For example, say we have a 2 \(\times\) 2 factorial design with factor 1 having two levels AB and CD and factor 2 having the corresponding levels AC and BD. These two factors render four conditions A, B, C, D. From posterior samples for each level A through D we can calculate main effect 1 as \(\text{ME1}=(A+B) - (C+D)\), main effect 2 as \(\text{ME2}=(A+C) - (B+D)\), and their interaction as \(\text{Interaction}=(A-B) - (C-D)\) (or \(\text{Interaction}=(A-C) - (B-D)\)) and summary statistics as shown above.↩︎