Walkthrough: fitting a mixture model on copy-task data
This is a step-by-step walk-through that describes how to fit a finite mixture model of two log-Normal distributions using the statistical program R and the rstan
package to interface with the probabilistic programming language Stan (Carpenter et al., 2016; Hoffman & Gelman, 2014; Stan Development Team, 2015a, 2015b).1 This walkthrough relates to Roeser et al. (2021).
The model returns population estimates for the average typing speed of fluent keystroke transitions (\(\beta\)), the slowdown for hesitant transitions (\(\delta\)) and the frequency of hesitant transitions (\(\theta\)) for each level of a categorical predictor after taking into account three sources of random error: (1) each participant has an individual fluent typing speed that differs across levels of the categorical predictor; (2) each participant has in individual hesitation frequency that differs across levels of the categorical predictor; (3) keystroke intervals for some bigrams are slower than for others.
This guide shows how to fit a mixture model written in Stan to copy-task data to then calculate the differences between two copy-task components for three parameters of interest and by-participant estimates. This walkthrough is largely self-contained. To run the code below, the reader only needs to install the required packages. Data and Stan code can be loaded from the code presented below.
We require two packages: (1) rstan
to use R to interface with Stan for fitting Bayesian models2; (2) tidyverse
for data processing and visualisation.3
library(rstan)
library(tidyverse)
Fitting the model
Preparing the data
First we load the keystroke data used in the manuscript. Data can be downloaded from OSF using the URL as below and the read_csv
function. For brevity we use a subset of the sample. Data are from participants that each completed the consonants and LF-bigrams component of the copy task (Van Waes et al., 2019, 2021).
# Load data
<- read_csv("https://osf.io/e5gy6/download") %>%
data # take first 50 ppts and only IKIs larger than 10 msecs
filter(ppts %in% 1:50, IKI > 10)
We reduced the data to four variables: a participant identifier ppts
, a bigram identifier bigram
, the inter-keystroke interval IKI
, and the copy task component
(levels: Consonants, LF).
data
# A tibble: 2,071 x 4
ppts bigram IKI component
<dbl> <chr> <dbl> <chr>
1 1 tj 330 Consonants
2 1 jx 278 Consonants
3 1 xg 246 Consonants
4 1 gf 297 Consonants
5 1 fl 168 Consonants
6 1 pg 270 Consonants
7 1 gk 236 Consonants
8 1 kf 440 Consonants
9 1 fk 1784 Consonants
10 1 kq 663 Consonants
# … with 2,061 more rows
The data must be transformed into a list to feed them into Stan. The information required from the data is determined in the Stan code and has to do with what we need to estimate the model parameters and how the model is implemented. The exert from the Stan code below shows which information is expected as input (e.g. int is integer), how they are named, and what the smallest (i.e. lower) and largest possible value (i.e. upper) are. The full Stan code can be found at the end of this document.
// Do not run
// Data chunk of Stan code
data {int<lower=1> N; // number of observations
int<lower=1> nS; // number of ppts
int<lower=1, upper=nS> ppts[N]; // ppts identifier
int<lower=1> nB; // number of bigrams
int<lower=1, upper=nB> bigrams[N]; // ppts identifier
int<lower=1> K; // number of conditions
int condition[N]; // condition identifiers
// outcome
vector[N] y; }
Depending on the data we might have different numbers of participants, bigrams, and conditions. The Stan code is using indices to be able to fit varying numbers for each. In the code below we use factor
in combination with as.numeric
to ensure that there are no empty indices. We create vectors with participant identifiers ppts
(numbers \(1\) through \(I\) where \(I\) is the number of participants), bigram identifiers bigrams
(numbers 1 through \(J\) where \(J\) is the maximum number of bigrams), and a numeric identifier components
for each copy task component (levels: consonants = 1, LF = 2). The returned values are indices for the parameters in the model: For example, for the components identifier, beta[1]
is the population estimate for non-hesitant typing in the consonants task, beta[2]
is same for LF-bigram; theta[1]
and theta[2]
are the disfluency probability for the consonants and LF-bigrams task, respectively.
<- as.numeric(factor(data$ppts))
ppts <- as.numeric(factor(data$bigram))
bigrams <- as.numeric(factor(data$component)) condition
In the below code, the names on the left side of the arrow must correspond to the names expected in the Stan code (above); names on the right side do not. We assign the participant, bigram, and component identifiers created above and their maximum values nS
, nB
, K
, respectively. The keystroke data IKI
are assigned to y
and the total number of observations (number of participants \(\times\) number of bigrams produced by participant) is assigned to N
(number of rows in the data nrow(data
).
<- within( list(), {
data_list <- ppts
ppts <- max(ppts) # max no of ppts
nS <- bigrams
bigrams <- max(bigrams) # max no of bigrams
nB <- condition
condition <- max(condition) # no of conditions
K <- data$IKI
y <- nrow(data)
N } )
The information in the data list can be viewed using the glimpse
function.
glimpse(data_list)
List of 8
$ N : int 2071
$ y : num [1:2071] 330 278 246 297 168 ...
$ K : num 2
$ condition: num [1:2071] 1 1 1 1 1 1 1 1 1 1 ...
$ nB : num 179
$ bigrams : num [1:2071] 139 79 170 48 42 110 52 82 41 84 ...
$ nS : num 50
$ ppts : num [1:2071] 1 1 1 1 1 1 1 1 1 1 ...
Load model
The Stan code described in the manuscript was extended to fit more than one copy-task component (i.e. a categorical predictor with \(K\) levels). The code is stored in the file “MoG.stan” (i.e. mixture of Gaussians) and can be downloaded from the “walkthrough” folder on OSF. Alternatively, readLines
and writeLines
can be used to download the Stan code from OSF so it will be saved in R’s current working directory.
# Download Stan code
<- readLines("https://osf.io/j6dzt/download")
file # Save Stan code to current working directory
writeLines(file, "MoG.stan")
The Stan model can be loaded for data fitting using the stan_model
function and assign it to the object mog
. The “MoG.stan” file needs to be available in the current working directory of R or the file path needs to be indicated.
<- stan_model(file = "MoG.stan") mog
This code fits the posterior of a categorical predictor with any number of levels. In other words, the posterior that model returns can be used to calculate simple effects, main effects and interactions of a factorial design. We demonstrate below how to calculate a simple effect. The same logic can be used to calculate main effects and interactions. Also, participant estimates are calculated for both mixture components and the mixing proportions by condition. For including more predictors, or removing, e.g., random error terms, or adjusting priors, the user will have to work directly in the Stan code.
This Stan code used is largely based on Sorensen et al. (2016) and Vasishth et al. (2017). Sorensen et al. (2016) presents a detailed tutorial on how to fit Stan models (see also Lambert, 2018).
Initiate sampling
For the model to converge we need to run a sufficient number of iterations. 30,000 iterations, as below, are a lot but does not guarantee convergence. We need to test convergence using the model’s posterior (see blow). To test whether the model has converged, we need to run the model more than one time (i.e. different chains). These can be run at the same time (in parallel) using more than one core of your computer (three cores below).4 There is no need to use more cores and chains and using less cores would mean that at least one process has to wait until after other processes are completed. When the model has settled on a parameter values, we should observe that all three streams (chains) overlap. Running long chains (many iterations) is useful to get more accurate parameter estimates.
<- 30000 # No of iterations
iterations <- 15000 # Warmup samples to be discarded
warmup <- n_cores <- 3 # No of chains / cores used (one chain per core) n_chains
The model is fitting a number of parameters that are not relevant for our inference. To reduce the size of the R object that containts the posterior we can select the parameters of interest and disregard all paremters that are not relevant for our inference.
# Parameters to keep in output
<- c("beta", # fluent typing
pars "delta", # disfluency slowdown
"theta", # mixing proportion
"beta_s", # by-participant typing speed
"theta_s", # by-participant disfluency probability
"sigma", # variance component
"sigma_diff", "sigmap_e", "sigma_e", # variance by mixture component
"sigma_u", "sigma_w", # variance for random ppt and bigram intercepts
"log_lik", # log likelihood (for model comparison)
"y_tilde") # predicted data
The sampling
function applies the model to the data using the information specified above (iterations, warmup, chains, cores, parameters pars
to be saved). save_warmup
is set to FALSE
to discard the warmup samples (which are not used for inference anyway) to reduce the size of the posterior. To allow reproducibility (and because Bayesian models involves random number generation) we set the seed. The seed can be any number but using the same number ensures the use of the same random number. Lastly, the control argument was specified with higher values for adapt_delta
and max_treedepth
: using higher values here mean the model runs slower but supports more careful parameter estimation. The model is assigned to the variable m
.
# Fit model
<- sampling(mog,
m data = data_list,
iter = iterations,
warmup = warmup,
chains = n_chains,
cores = n_cores,
pars = pars, # Parameters to keep.
save_warmup = FALSE, # Don't save the warmup samples.
seed = 365,
control = list(adapt_delta = .96, # default: .9
max_treedepth = 12)) # default: 10
Running this model will take a while to complete sampling depending on your hardware specifications. The time it took my machine to complete this job can me viewed using get_elapsed_time
. Therefore it is worth to not use all cores for this to run or to use a dedicated high performance machine.
get_elapsed_time(m)/60 # in mins
warmup sample
chain:1 180.4650 344.6883
chain:2 180.6383 186.8117
chain:3 178.6567 97.3185
Reusing model output
The output of the model (i.e. the posterior) can be saved, so we don’t need to run the model again (which admittedly can take a while). The function requires the name of the fitted model m
. We prefer to keep the name of the output similar to the name of the Stan code used. The model is stored as compressed “.rda” file.
saveRDS(m, "MoG.rda", compress = "xz")
The model can be load into the environment of R using readRDS
.
<- readRDS("MoG.rda") m
Model diagnostics
Convergence
Model convergence can be established in two practically simple techniques. First, we can inspect the chains in trace plots which show the parameter estimate across iterations (here after warmup) for each chain. We will look at the population-level parameters (names are determined by the model) corresponding to the Greek symbols used in the manuscript and assign those to pars
.
<- c("beta", "delta", "theta", "sigma", "sigmap_e", "sigma_e") pars
To create trace plots, we can apply the stan_trace
function to the model m
and extract the MCMC chains for the parameters in pars
. The alpha
argument makes the colours slightly more transparent. If the chains overlap and look like “fat hairy caterpillars,” chains have converged on the same target distribution.
# Check convergence
stan_trace(m, pars = pars, alpha = .5)
Second, we can calculate the \(\hat{R}\) statistic. Successful convergence is reflected in \(\hat{R}\) values smaller than 1.05 (Gelman & Rubin, 1992). \(\hat{R}\) is similar to the F statistic: it tells us how much bigger the variability between chains is compared to the variability within chains. A value of \(\approx 1\) indicates that the variability is essentially identical signifying convergence. For this we can use the rhat
function applied to the model and the population parameters.
summary(m, pars=pars)$summary %>% as.data.frame() %>%
rownames_to_column("parameter") %>%
select(parameter, Rhat)
parameter Rhat
1 beta[1] 1.000328
2 beta[2] 1.001846
3 delta[1] 1.000258
4 delta[2] 1.000046
5 theta[1] 1.000845
6 theta[2] 1.000266
7 sigma[1] 1.000010
8 sigma[2] 1.000468
9 sigmap_e[1] 1.000094
10 sigmap_e[2] 1.000404
11 sigma_e[1] 1.000283
12 sigma_e[2] 1.000340
Convergence problems can have many reasons and therefore many solutions: running longer chains (in particular longer warmups), increasing the maximum treedepth and the average acceptance probability (adapt_delta
), specifying starting values, adjusting priors, constraining parameters, or changing the parametrisation of the model. Severe convergence problems are indicative of a misspecification of the model.
Posterior predictive check
The model used the posterior parameter values to simulate hypothetical data sets. This happened for every iteration for every chains.
We can draw predicted data from the model output and compare these to the observed data. A model that makes reasonable predictions should fit the observed data.
Using the as.matrix
function we extract a matrix of y_tilde
, the simulated data. This returns a matrix with the size total_samples
\(\times\) nrow(data)
(so 45,000 \(\times\) 2,071). We use the sample
function to draw N
randomly sampled hypothetical data sets. The ppc_dens_oberlay
function from the bayesplot
package is then mapping the observed data (\(y\); thick blue line) to the simulated data (\(y_{rep}\); thin lightblue lines).
<- as.matrix(m, pars = "y_tilde") # extract simulated data sets
y_tilde <- 50 # number of simulations to use
N <- (iterations - warmup) * n_chains # total number of samples
total_samples <- sample(total_samples, N) # created random indices
rnd_sims <- y_tilde[rnd_sims,] # draw N random simulations y_tilde_sample
library(bayesplot)
ppc_dens_overlay(data_list$y, y_tilde_sample) +
scale_x_continuous(limits = c(0, 3000))
Posterior probability distribution
At the core of Bayesian inference is the posterior probability distribution. For each model parameter we have 45,000 posterior samples that form a posterior probability distribution. These samples are the results of the number of iterations after warmup for all chains.
<- (iterations - warmup) * n_chains) (total_samples
[1] 45000
The distribution of parameter values represents the uncertainty about parameter values given the data. There is a large range of things one can do with a posterior. Below we will focus on summarising parameter estimates, comparing conditions, and extracting by-participant estimates.
The names
function can be used to remind us of the parameter names used in the model (we reduced the output to the first 15 parameter names). The meaning of the parameters is described in the manuscript and can be obtained from the Stan code. Indices refer to the copy-task component (1 = consonants; 2 = LF bigrams) and to the participant identifier for by-participant parameters (indicated by _s
).
names(m)[1:15]
[1] "beta[1]" "beta[2]" "delta[1]" "delta[2]" "theta[1]"
[6] "theta[2]" "beta_s[1,1]" "beta_s[2,1]" "beta_s[1,2]" "beta_s[2,2]"
[11] "beta_s[1,3]" "beta_s[2,3]" "beta_s[1,4]" "beta_s[2,4]" "beta_s[1,5]"
Parameter-value estimates
The parameter estimates can be visualised using the stan_hist
function with the model variable m
and the to-be visualised parameters pars
as arguments. The function returns a histogram for each other parameter values. Each histogram is the probability distribution of the parameter values indicating which values are more or less probable to be the true parameter value.
stan_hist(m, pars = pars)
The posterior samples of the parameter values can be summarised using the print
function. The probs
argument requires the lower and upper bound of the probability interval that we are interested in. A lower bound of .025 and an upper bound of .975 gives the 95% probability interval (PI; also called “credible intervals”), i.e. the range that contains the true parameter value with a 95% probability given the data.
The output summaries the most probable parameter value as mean with its standard error (se_mean
) and standard deviation (sd
), the effective sample size (n_eff
) indicating sampling efficiency and the convergence metric \(\hat{R}\) (Rhat
) we introduced above.
print(m, pars = pars, probs = c(.025,.975))
Inference for Stan model: anon_model.
3 chains, each with iter=30000; warmup=15000; thin=1;
post-warmup draws per chain=15000, total post-warmup draws=45000.
mean se_mean sd 2.5% 97.5% n_eff Rhat
beta[1] 5.79 0 0.10 5.60 6.00 3526 1
beta[2] 5.58 0 0.08 5.41 5.74 4228 1
delta[1] 1.05 0 0.10 0.80 1.20 2861 1
delta[2] 0.32 0 0.08 0.17 0.48 16644 1
theta[1] 0.47 0 0.09 0.31 0.65 1949 1
theta[2] 0.27 0 0.06 0.16 0.40 5993 1
sigma[1] 0.45 0 0.02 0.41 0.50 4462 1
sigma[2] 0.55 0 0.04 0.49 0.64 5828 1
sigmap_e[1] 0.56 0 0.06 0.46 0.68 3287 1
sigmap_e[2] 0.83 0 0.06 0.73 0.97 6869 1
sigma_e[1] 0.34 0 0.03 0.28 0.39 4526 1
sigma_e[2] 0.27 0 0.02 0.23 0.31 6235 1
Samples were drawn using NUTS(diag_e) at Mon May 10 07:43:47 2021.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
For the following steps, we will focus on the three parameters that have conceptually interesting interpretations: (1) the average fluent typing speed \(\beta\), (2) the disfluency slowdown \(\delta\), and (3) the disfluency probability \(\theta\).
<- c("beta", "delta", "theta") pars
The plot
function shows the posterior probability distribution of the three parameters for the consonants task (indicated as 1) and the LF-bigrams task (indicated as 2) summarised as median and 95% PI.
plot(m, pars = pars, ci_level = .95) # ci = credible interval
The values for \(\beta\) and \(\delta\) are shown on a log-scale. To transform their values back to msecs we can extract the posterior samples using the as.data.frame
function. We prefer the use of tibble objects.
<- as.data.frame(m, pars) %>% as_tibble()
posterior posterior
# A tibble: 45,000 x 6
`beta[1]` `beta[2]` `delta[1]` `delta[2]` `theta[1]` `theta[2]`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 5.81 5.63 1.09 0.240 0.523 0.377
2 5.88 5.62 1.07 0.335 0.400 0.297
3 5.83 5.58 1.10 0.283 0.440 0.375
4 5.92 5.67 1.05 0.271 0.356 0.313
5 5.90 5.65 1.06 0.269 0.390 0.359
6 5.85 5.66 1.09 0.281 0.472 0.267
7 5.87 5.55 1.15 0.278 0.420 0.304
8 5.74 5.51 1.04 0.387 0.391 0.311
9 5.75 5.54 1.12 0.353 0.417 0.287
10 5.63 5.52 1.06 0.265 0.415 0.280
# … with 44,990 more rows
The pivot_longer
function is transforming the data above to a long format with an additional column for component and the model parameters kept as columns. The names_pattern
argument is using a regular expression to extract the number in the squared brackets.
<- pivot_longer(posterior, everything(),
posterior_long names_to = c(".value", "component"),
names_pattern = "(.*)\\[(.)\\]")
posterior_long
# A tibble: 90,000 x 4
component beta delta theta
<chr> <dbl> <dbl> <dbl>
1 1 5.81 1.09 0.523
2 2 5.63 0.240 0.377
3 1 5.88 1.07 0.400
4 2 5.62 0.335 0.297
5 1 5.83 1.10 0.440
6 2 5.58 0.283 0.375
7 1 5.92 1.05 0.356
8 2 5.67 0.271 0.313
9 1 5.90 1.06 0.390
10 2 5.65 0.269 0.359
# … with 89,990 more rows
This code is then transforming beta
and delta
to msecs using the exponential function exp
to un-log the values. In order to transform the slowdown delta
into msecs we need to add beta
before using the exponential function; we can then subtract beta again. The recode
function changes the component indices from 1 to “Consonants” and 2 to “LF bigrams.”
<- mutate(posterior_long,
posterior_in_msecs delta = exp(beta + delta) - exp(beta),
beta = exp(beta),
component = recode(component, `1` = "Consonants",
`2` = "LF bigrams"))
posterior_in_msecs
# A tibble: 90,000 x 4
component beta delta theta
<chr> <dbl> <dbl> <dbl>
1 Consonants 334. 656. 0.523
2 LF bigrams 279. 75.8 0.377
3 Consonants 357. 678. 0.400
4 LF bigrams 275. 109. 0.297
5 Consonants 340. 687. 0.440
6 LF bigrams 266. 87.1 0.375
7 Consonants 373. 692. 0.356
8 LF bigrams 290. 90.0 0.313
9 Consonants 364. 688. 0.390
10 LF bigrams 283. 87.3 0.359
# … with 89,990 more rows
The posterior distribution of the parameter estimates can then be visualised in, for example, histograms.
pivot_longer(posterior_in_msecs, beta:theta, names_to = "parameter") %>%
ggplot(aes(x = value, colour = component, fill = component)) +
geom_histogram(position = "identity", alpha = .25) +
facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
scale_fill_brewer("Copy-task\ncomponents", palette = "Dark2") +
scale_color_brewer("Copy-task\ncomponents", palette = "Dark2") +
labs(x = "Posterior parameter estimate")
Difference between copy-task components
We can calculate the differences between the copy-task components for each parameter value. This is giving an indication of whether fluent typing (\(\beta\)) is influenced by the copy task, maybe the size of the disfluency slowdown (\(\delta\)) is impacted by the task, or the disfluency probability (\(\theta\)).
To determine these differences we change the data format above and create a variable that indicates the parameter and one column for the corresponding values for each copy-task component. The last line calculates the difference between the consonants task and the LF-bigrams task for each of the three parameter values.
<- posterior_in_msecs %>%
posterior_by_component pivot_longer(beta:theta, names_to = "parameter") %>%
group_by(component) %>% mutate(id = row_number()) %>%
pivot_wider(names_from = component, values_from = value) %>%
select(-id) %>% # drop id column
mutate(diff = Consonants - `LF bigrams`, # calculate the difference between copy-task components
parameter = paste0(parameter, "[diff]"))
posterior_by_component
# A tibble: 135,000 x 4
parameter Consonants `LF bigrams` diff
<chr> <dbl> <dbl> <dbl>
1 beta[diff] 334. 279. 54.6
2 delta[diff] 656. 75.8 580.
3 theta[diff] 0.523 0.377 0.146
4 beta[diff] 357. 275. 81.9
5 delta[diff] 678. 109. 569.
6 theta[diff] 0.400 0.297 0.103
7 beta[diff] 340. 266. 74.2
8 delta[diff] 687. 87.1 600.
9 theta[diff] 0.440 0.375 0.0647
10 beta[diff] 373. 290. 83.4
# … with 134,990 more rows
The difference between the conditions can be summarised using the mean, the 95% PIs and the probability that the difference between the components is small 0 (indicated as e.g. \(P(\hat{\beta} <0)\) for the parameter \(\beta\); the hat \(\hat{.}\) symbol indicates that the value is an estimate of the population parameter value). This summary tells us whether the difference between the consonants task and the LF-bigrams task is different from zero and the most probable value for the difference between tasks.
%>%
posterior_by_component group_by(parameter) %>% # group by parameters
summarise(mean = mean(diff), # mean difference
lower = quantile(diff, .025), # 2.5% lower bound of difference
upper = quantile(diff, .975), # 97.5% upper bound of difference
p = mean(diff < 0)) %>% # prob that difference is negative
mutate(across(where(is.numeric), round, 2)) # round after 2nd decimal place
# A tibble: 3 x 5
parameter mean lower upper p
<chr> <dbl> <dbl> <dbl> <dbl>
1 beta[diff] 64.4 -5.64 144. 0.04
2 delta[diff] 513. 282. 745. 0
3 theta[diff] 0.2 -0.01 0.41 0.03
These differences can be viewed in histograms. The vertical line indicates a difference between tasks corresponding to a value of zero. The area of the histogram to the left of this line corresponds to the probability p
to observe a value smaller than zero as in the output above. We observe difference between copy-task components for all three parameters. Fluent typing is 64 msecs (95% PI: -6, 144) slower for the consonants task, and hesitations are 513 msecs (95% PI: 282, 745) longer and 0.2 (\(\approx\) 20%; 95% PI: -0.01, 0.41) times more likely. Interestingly fluent typing (i.e. planning and executing motor plans) is slower in the consonant task even after modeling hesitations as a separate process.
ggplot(posterior_by_component, aes(x = diff)) +
geom_histogram(alpha = .5) +
geom_vline(xintercept = 0, colour = "red", linetype = "dashed", size = .5) +
facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
labs(x = "Difference between copy-task components")
In practice we rarely believe that only an effect of exactly null is indicating that there is no difference between our groups. Instead there are values that are non-different from null; for example a slowdown of 1 msecs is meaningless in most, if not all, contexts. To take this into account we can determine the region of practical equivalence (ROPE, Kruschke, 2010, 2011, 2014). We can define the ROPE as the range of values that is a priori considered non-different from a null effect. The ROPE value indicates to what extent the posterior can not rule out a negligible effect. A meaningful difference between groups (or effects more generally) should have a small proportion of posterior samples within the ROPE.
To calculate the ROPE for out differences between the two copy-task components we can use the rope
function from the bayestestR
package.5 The rope
function requires as input a vector with the posterior difference. We can calculate a vector with posterior difference for each of the three parameters, indicated as \(\beta_\text{diff}\), \(\delta_\text{diff}\), and \(\theta_\text{diff}\).
# Fluent typing
<- filter(posterior_by_component, parameter == "beta[diff]") %>% pull(diff)
beta_diff
# Slowdown for hesitant typing
<- filter(posterior_by_component, parameter == "delta[diff]") %>% pull(diff)
delta_diff
# Proportion of hesitations
<- filter(posterior_by_component, parameter == "theta[diff]") %>% pull(diff) theta_diff
Also, we need to define the range
argument which specifies the range of values are not considerably different from null. Of course this range will depend on what the parameter represents and researcher intuition. For example, for fluent typing we would consider small differences, of say 10 msecs, more meaningful than for disfluencies. If we consider differences of 10 msecs or smaller equivalent to a null difference, we can define the range
argument with a lower bound of -10 and the upper bound of 10.
library(bayestestR)
rope(beta_diff, range = c(-10, 10))
# Proportion of samples inside the ROPE [-10.00, 10.00]:
inside ROPE
-----------
6.72 %
The ROPE for fluent typing speed contains 6.72% of the posterior samples. In other words, there is a 6.72% probability that the posterior difference \(\beta_\text{diff}\) is between -10 and 10 msecs which we considered negligible.
For disfluencies we might want to consider a larger range of values as negligible differences between two tasks (also because the variability in disfluencies is larger than the variability for fluent key transitions). Even if we define a ROPE of -250 through 250 msecs, there are no posterior samples inside the ROPE indicating that the difference in typing hesitations are related to difficulty that arise prior to processing on the motor level.
rope(delta_diff, range = c(-250, 250))
# Proportion of samples inside the ROPE [-250.00, 250.00]:
inside ROPE
-----------
0.00 %
As for the disfluency probably, we need to bear in mind that estimates for \(\theta\) are bound between 0 and 1. Hence, the difference \(\theta_\text{diff}\) cannot ever exceed -1 and 1. If we consider a difference of 5% or smaller as negligible, we can define the ROPE as ranging from -.05 through .05. We observe that 6.37% of the posterior difference \(\theta_\text{diff}\) is within the ROPE. In other words, a large proportion of posterior samples is supporting a meaningful difference between the two copy-task components. We observe that hesitations are largely more frequent in the consonants task.
rope(theta_diff, range = c(-.05, .05))
# Proportion of samples inside the ROPE [-0.05, 0.05]:
inside ROPE
-----------
6.37 %
The Stan code presented in this walk through can be used to estimate the posterior for two and more conditions: We can calculate the difference between conditions from the posterior as well as main effects and interactions for more complex factorial designs as any factorial design can be reduced to a single variable.6
By-participant estimates
In some contexts we would like to obtain by-participant estimates of the average fluent typing-speed of a participant after accounting for disfluencies (i.e. \(\beta_s\)); by-participant disfluency information can be estimated the proportion of disfluencies (i.e. \(\theta_s\)). These estimates inform us about the fluent typing speed but also about the prevalence of typing hesitations.
As before we can extract the posterior from the model m
. This time we use the parameters that stored by-participant estimates beta_s
and theta_s
corresponding to the population estimates beta
and theta
.
<- c("beta_s", "theta_s")
pars <- as.data.frame(m, pars = pars) %>% as_tibble()
posterior_ppts names(posterior_ppts)[1:10]
[1] "beta_s[1,1]" "beta_s[2,1]" "beta_s[1,2]" "beta_s[2,2]" "beta_s[1,3]"
[6] "beta_s[2,3]" "beta_s[1,4]" "beta_s[2,4]" "beta_s[1,5]" "beta_s[2,5]"
This posterior tibble has the format “\(<\text{parameter name}>\text{_s}[<\text{component id}>, <\text{participant id}>]\)” where indices indicate the copy-task component and the participant identifier of the parameters. The following code is converting the beta
s to msecs, creates a long format with columns parameter
with the names above as levels and a value
column with their respective values. The parameter
column is then separated into parameter
(levels: beta
, theta
), component
(levels: 1
, 2
), and participant (an index for each participant) using “,” to separate the three variables. To use “,” as separator we replace "_s" with “,” in the line before. Copy-task component names were changed as before. A preview is below.
<- posterior_ppts %>%
posterior_ppts_long # use exp() on everything starting with "beta"
mutate(across(starts_with("beta"), exp)) %>%
pivot_longer(everything(), names_to = "parameter") %>%
# replace "_s" with ","
mutate(parameter = gsub("_s", ",", parameter)) %>%
# separate into parameter, component, ppt using "," as separator
separate(parameter, into = c("parameter", "component", "participant")) %>%
mutate(component = recode(component, `1` = "Consonants", `2` = "LF bigrams")) %>%
group_by(parameter) %>% mutate(id = row_number()) %>%
pivot_wider(names_from = parameter, values_from = value) %>%
select(-id) # drop id
posterior_ppts_long
# A tibble: 4,500,000 x 4
component participant beta theta
<chr> <chr> <dbl> <dbl>
1 Consonants 1 323. 0.309
2 LF bigrams 1 295. 0.292
3 Consonants 2 299. 0.514
4 LF bigrams 2 305. 0.223
5 Consonants 3 381. 0.632
6 LF bigrams 3 348. 0.332
7 Consonants 4 469. 0.509
8 LF bigrams 4 301. 0.281
9 Consonants 5 357. 0.301
10 LF bigrams 5 254. 0.244
# … with 4,499,990 more rows
We can then summarise by-participant estimates for each component and parameter with the most probable parameter value and the 95% PI as before.
<- posterior_ppts_long %>%
posterior_ppts_long_summary pivot_longer(beta:theta, names_to = "parameter", values_to = "value") %>%
group_by(parameter, component, participant) %>%
summarise(mean = mean(value),
lower = quantile(value, .025),
upper = quantile(value, .975))
posterior_ppts_long_summary
# A tibble: 200 x 6
# Groups: parameter, component [4]
parameter component participant mean lower upper
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 beta Consonants 1 320. 246. 408.
2 beta Consonants 10 268. 192. 348.
3 beta Consonants 11 269. 197. 358.
4 beta Consonants 12 290. 230. 362.
5 beta Consonants 13 493. 300. 644.
6 beta Consonants 14 427. 318. 562.
7 beta Consonants 15 338. 268. 421.
8 beta Consonants 16 295. 222. 380.
9 beta Consonants 17 263. 204. 333.
10 beta Consonants 18 303. 223. 411.
# … with 190 more rows
To get an idea how the by-participant estimates look like, we extracted participants 1 through 5 for visualisation. From this subset we see that, although we observed overall larger population estimates for the consonants task, some participants do not show a difference in fluent typing speed and a similar frequency of hesitations (e.g. participant 2) or even slightly more hesitations in the LF-bigrams task (e.g. participant 1).
%>%
posterior_ppts_long_summary filter(participant %in% 1:5) %>% # only participants 1 through 5
ggplot(aes(x = participant,
y = mean, ymin = lower, ymax = upper,
colour = component)) +
geom_pointrange(position = position_dodge(.75)) +
facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
coord_flip() +
labs(x = "Participant id", y = "Parameter estimates") +
scale_colour_brewer("Copy-task\ncomponent", palette = "Dark2")
Finally, we can calculate and visualise the by-participant differences.
<- posterior_ppts_long %>% group_by(component) %>%
posterior_ppts_long_diff_summary mutate(id = row_number()) %>%
pivot_wider(names_from = component,
values_from = c(beta, theta)) %>%
pivot_longer(-c(participant, id),
names_to = c("parameter", ".value"),
names_pattern = "(.*)_(.*)") %>%
mutate(diff = Consonants - `LF bigrams`,
parameter = paste0(parameter, "[diff]")) %>%
select(participant, parameter, diff) %>%
group_by(parameter, participant) %>%
summarise(mean = mean(diff),
lower = quantile(diff, .025),
upper = quantile(diff, .975))
posterior_ppts_long_diff_summary
# A tibble: 100 x 5
# Groups: parameter [2]
parameter participant mean lower upper
<chr> <chr> <dbl> <dbl> <dbl>
1 beta[diff] 1 72.0 -21.5 172.
2 beta[diff] 10 58.4 -26.0 146.
3 beta[diff] 11 -16.3 -122. 88.7
4 beta[diff] 12 47.4 -33.8 133.
5 beta[diff] 13 138. -51.9 301.
6 beta[diff] 14 38.5 -105. 192.
7 beta[diff] 15 118. 33.0 209.
8 beta[diff] 16 55.6 -35.5 152.
9 beta[diff] 17 70.1 -0.104 147.
10 beta[diff] 18 78.3 -11.5 189.
# … with 90 more rows
For simplicity we show the differences for the first 10 participants only. This calculation and visualisation allows researchers to identify participants that do or do not differ in terms of fluent typing speed or hesitation frequency, in our example for tasks that do and do not involve lexical information.
%>%
posterior_ppts_long_diff_summary filter(participant %in% 1:10) %>% # only participants 1 through 10
ggplot(aes(x = factor(as.numeric(participant)),
y = mean, ymin = lower, ymax = upper)) +
geom_hline(yintercept = 0, colour = "red", linetype = "dashed") +
geom_pointrange(position = position_dodge(.75)) +
facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
coord_flip() +
labs(x = "Participant id", y = "Difference between copy-task components") +
scale_colour_brewer("Copy-task\ncomponent", palette = "Dark2")
References
Stan code
/*
Mixture model for disfluencies
With by-ppt mixing proportion
Random intercepts for ppts
Random intercepts for bigrams
By-ppts slope adjustments
*/
data {
int<lower=1> N; // number of observations
int<lower=1> nS; // number of ppts
int<lower=1, upper=nS> ppts[N]; // ppts identifier
int<lower=1> nB; // number of bigrams
int<lower=1, upper=nB> bigrams[N]; // ppts identifier
int<lower=1> K; // number of conditions
int condition[N]; // vector with condition levels
vector[N] y; // outcome
}
parameters {
real beta_mu; // fluent typing
vector[K] beta_raw;
real<lower=0> beta_sigma;
vector<lower = 0>[K] delta; // slowdown for hesitations
vector[K] alpha; // mixing proportion
real tau;
matrix[K,nS] alpha_s; // by-ppt mixing proportion
vector<lower = 0>[K] sigma; // residual error
vector<lower=0>[K] sigma_diff; // difference between mixture components
// For random effects
vector<lower=0>[K] sigma_u; // ppts slopes
cholesky_factor_corr[K] L_u; // variance-covariance matrix
matrix[K,nS] z_u; // random effects for participants
vector[nB] w; // bigram intercepts
real<lower=0> sigma_w; // bigram sd
}
transformed parameters{
vector[K] theta = 1 - inv_logit(alpha); // convert mixing proportion to proportion of second component
matrix[K,nS] theta_s = 1 - inv_logit(alpha_s);
matrix[K,nS] log_alpha_s_1 = log_inv_logit(alpha_s);
matrix[K,nS] log_alpha_s_2 = log1m_inv_logit(alpha_s);
vector[K] beta = beta_mu + beta_sigma * beta_raw;
vector<lower=0>[K] sigmap_e = sigma + sigma_diff; // variance for second mixture component is larger
vector<lower=0>[K] sigma_e = sigma - sigma_diff; // variance for first mixture component is smaller
matrix[K,nS] u = diag_pre_multiply(sigma_u, L_u) * z_u; // ppts random effects
vector[N] mu;
for(n in 1:N){
mu[n] = beta[condition[n]] + u[condition[n], ppts[n]] + w[bigrams[n]];
}
}
model {
vector[2] lp_parts;
// Priors
beta_mu ~ normal(5, 2.5);
beta_sigma ~ cauchy(0, 2.5);
beta_raw ~ normal(0, 1);
delta ~ normal(0, 1);
for(s in 1:nS){
for(k in 1:K){
alpha_s[k,s] ~ normal(alpha[k], tau);
}
}
alpha ~ normal(0, 1);
tau ~ cauchy(0, 1);
sigma ~ cauchy(0, 2.5);
sigma_diff ~ normal(0, 1);
// REs priors
L_u ~ lkj_corr_cholesky(2.0);
to_vector(z_u) ~ normal(0,1);
sigma_u ~ normal(0,2.5); // ppts random effects
sigma_w ~ normal(0,2.5);
w ~ normal(0, sigma_w); // bigram random effects
// likelihood
for(n in 1:N){
lp_parts[1] = log_alpha_s_1[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n], sigma_e[condition[n]]); // fluent typing
lp_parts[2] = log_alpha_s_2[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n] + delta[condition[n]], sigmap_e[condition[n]]); // hesitatant typing
target += log_sum_exp(lp_parts);
}
}
generated quantities{
vector[2] lp_parts;
real<lower=0,upper=1> prob_tilde;
vector[N] log_lik; // log likelihood
vector[N] y_tilde; // predicted data
vector[K] beta2 = beta + delta; // second mixture component
matrix[K,nS] beta_s; // by-ppt fluent typing estimates
for(s in 1:nS){
for(k in 1:K){
beta_s[k, s] = beta[k] + u[k, s];
}
}
for(n in 1:N){
// calculate log likelihood
lp_parts[1] = log_alpha_s_1[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n], sigma_e[condition[n]]); // fluent typing
lp_parts[2] = log_alpha_s_2[condition[n], ppts[n]] + lognormal_lpdf(y[n] | mu[n] + delta[condition[n]], sigmap_e[condition[n]]); // hesitant typing
log_lik[n] = log_sum_exp(lp_parts);
prob_tilde = bernoulli_rng(theta_s[condition[n], ppts[n]]);
// predict data
if(prob_tilde) {
y_tilde[n] = lognormal_rng(mu[n] + delta[condition[n]], sigmap_e[condition[n]]);
}
else{
y_tilde[n] = lognormal_rng(mu[n], sigma_e[condition[n]]);
}
}
}
Session info
sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
[5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
[7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bayestestR_0.9.0 bayesplot_1.8.0 rstan_2.26.1 StanHeaders_2.26.1
[5] forcats_0.5.1 stringr_1.4.0 dplyr_1.0.6 purrr_0.3.4
[9] readr_1.4.0 tidyr_1.1.3 tibble_3.1.2 ggplot2_3.3.4
[13] tidyverse_1.3.0 knitr_1.33 rmdformats_1.0.1
loaded via a namespace (and not attached):
[1] httr_1.4.2 sass_0.3.1 jsonlite_1.7.2 modelr_0.1.8
[5] bslib_0.2.4 RcppParallel_5.0.3 assertthat_0.2.1 highr_0.9
[9] stats4_4.0.3 cellranger_1.1.0 yaml_2.2.1 pillar_1.6.1
[13] backports_1.2.1 glue_1.4.2 digest_0.6.27 RColorBrewer_1.1-2
[17] rvest_1.0.0 colorspace_2.0-1 plyr_1.8.6 htmltools_0.5.1.1
[21] pkgconfig_2.0.3 broom_0.7.7 haven_2.4.1 bookdown_0.21
[25] scales_1.1.1 processx_3.4.5 farver_2.1.0 generics_0.1.0
[29] ellipsis_0.3.2 withr_2.4.2 cli_2.5.0 magrittr_2.0.1
[33] crayon_1.4.1 readxl_1.3.1 evaluate_0.14 ps_1.6.0
[37] fs_1.5.0 fansi_0.5.0 xml2_1.3.2 pkgbuild_1.2.0
[41] tools_4.0.3 loo_2.4.1 prettyunits_1.1.1 hms_1.0.0
[45] lifecycle_1.0.0 matrixStats_0.59.0 V8_3.4.0 munsell_0.5.0
[49] reprex_1.0.0 callr_3.5.1 compiler_4.0.3 jquerylib_0.1.3
[53] rlang_0.4.11 ggridges_0.5.3 grid_4.0.3 rstudioapi_0.13
[57] labeling_0.4.2 rmarkdown_2.7 gtable_0.3.0 codetools_0.2-16
[61] inline_0.3.17 DBI_1.1.1 curl_4.3.1 reshape2_1.4.4
[65] R6_2.5.0 gridExtra_2.3 lubridate_1.7.10 utf8_1.2.1
[69] insight_0.14.0 stringi_1.6.2 parallel_4.0.3 Rcpp_1.0.6
[73] vctrs_0.3.8 dbplyr_2.1.0 tidyselect_1.1.1 xfun_0.24
An alternative to writing models in Stan is the R package
brms
which provides a flexible framework to implement mixture models and many other types of probability models (Bürkner, 2017, 2018). In particular,brms
has amixture
function to specify mixtures of various types of distributions. These could be Gaussian, skewed-Normal, shifted-Normal, ex-Gaussian etc. and combinations thereof. There is a large number of probability models for continuous data that are plausible candidates (for reaction time data see e.g. Matzke & Wagenmakers, 2009).↩︎Run
install.packages("tidyverse")
to installtidyverse
on your computer.↩︎If you want to use R to check how many cores are available on your machine, run
parallel::detectCores()
(you might need to install the packageparallel
).↩︎To install
bayestestR
runinstall.packages("bayestestR")
.↩︎For example, say we have a 2 \(\times\) 2 factorial design with factor 1 having two levels AB and CD and factor 2 having the corresponding levels AC and BD. These two factors render four conditions A, B, C, D. From posterior samples for each level A through D we can calculate main effect 1 as \(\text{ME1}=(A+B) - (C+D)\), main effect 2 as \(\text{ME2}=(A+C) - (B+D)\), and their interaction as \(\text{Interaction}=(A-B) - (C-D)\) (or \(\text{Interaction}=(A-C) - (B-D)\)) and summary statistics as shown above.↩︎