Tutorial: Fitting log-Gaussian mixed-effects mixture models for keystroke data
This is a step-by-step tutorial for fitting mixed-effects mixture models with two log-Gaussian distributions (MacLahlan & Peel, 2000) using R (R Core Team, 2016) and the rstan
package to interface with the probabilistic programming language Stan (Carpenter et al., 2016; Hoffman & Gelman, 2014; Stan Development Team, 2015b, 2015a). We will compare the mixture-model analysis to standard Gaussian and log-Gaussian mixed-effect models fitted using brms
(Bürkner, 2017, 2018).1
In this tutorial we show how to fit a mixture model written in Stan to keystroke data (similar to Roeser et al., 2021) and how to compare this to model mainstream mixed-effects models using leave-one-out cross-validation. Then we demonstrate how the posterior of this model can be used to calculate differences between transition locations in the text. This tutorial is largely self-contained. To run the code below, the reader only needs to install the required packages. Data and the Stan programme can be loaded using the code below.
Three packages need to be installed before (see footnotes for instructions): (1) rstan
to interface with Stan2; (2) tidyverse
for general data wrangling and visualisation.3; (3) brms
to for Bayesian models without directly working in Stan.4
1 Preparing the data
First we load a sample of keystroke data. Data are loaded from our OSF repository using the URL as below. For simplicity we use a random subset of data from participants who completed an essay-writing task in the context of the ProWrite project (e.g. Dux Speltz et al., 2022).
We reduced the data to 10 participants and three variables: a participant identifier participant
, the interkey intervals iki
, and the transition location location
(levels: within word, before word, before sentence).
# A tibble: 600 × 3
participant iki location
<dbl> <dbl> <chr>
1 116 2881 before-sentence
2 116 465 before-word
3 116 144 before-sentence
4 116 121 before-sentence
5 116 129 before-word
6 116 81 before-sentence
7 116 165 within-word
8 116 114 before-word
9 116 63 before-sentence
10 116 1072 before-sentence
# ℹ 590 more rows
We reduced the number of interkey intervals to 20 observations per participant and per transition location by randomly sampling interkey intervals. See for example:
# A tibble: 30 × 3
participant location n
<dbl> <chr> <int>
1 76 before-sentence 20
2 76 before-word 20
3 76 within-word 20
4 101 before-sentence 20
5 101 before-word 20
6 101 within-word 20
7 116 before-sentence 20
8 116 before-word 20
9 116 within-word 20
10 130 before-sentence 20
# ℹ 20 more rows
2 Mixed-effects models
We start with standard single distribution mixed-effects models (as in e.g. Baayen et al., 2008; Pinheiro et al., 2007), that the data come from a single underlying process that, the first model, can be described as Gaussian and log-Gaussian for the other single distribution model. We compare these models below.
2.1 Gaussian mixed-effects model
2.1.1 Model
We start with a Gaussian mixed-effects model. The process that generates the \(i^\text{th}\) interkey interval (iki below), where \(i \in 1\dots N\) and \(N\) is the total number of keystroke, is a single Gaussian (normal) distribution \(\mathcal{N}()\) which is characterised by a mean \(\mu\) and a standard deviation \(\sigma_\text{e}^2\). This model can be expressed mathematically using the following equation:
\[ \begin{align} \text{iki}_i \sim & \mathcal{N}(\mu_i, \sigma_\text{e}^2)\\ \mu_i = & \beta_\text{location[i]} + u_\text{participant[i]}\\ u_\text{participant[i]} \sim & \mathcal{N}(0, \sigma_\text{p}^2) \end{align} \] The mean can be decomposed into \(\beta\) and \(u_\text{participant[i]}\). \(\beta\) was allowed to take on a different value for each transition location associated with the \(i^\text{th}\) interkey interval. By-participant interkey intervals are assumed to vary around the average which is achieved by placing a normal distribution on the difference between average and by-participant interkey intervals, indicated as \(u_\text{participant[i]}\), with a mean of 0 and a standard deviation \(\sigma_\text{p}^2\).
This model can be described in R using the following formula wrapped in the brms
formula function bf
. The family
argument can be used to specify distributions where gaussian
is the default can could be omitted below.
The syntax used in brms
is generally similar to lme4
syntax that many readers will be familiar with. Like in lm
, glm
, glmer
, lmer
and other modelling functions 0 +
can be used to drop the intercept parameter which means that the model is parametrised to return cell means and standard errors for each level of location
.
2.1.2 Priors
As we are modelling data in the Bayesian framework, we need to define priors before model fit. brms
is using sensible regulating default priors for most model parameters which can be inspected using get_prior
which needs the model formula defined above and the data as input.
prior class coef group resp dpar nlpar lb ub source
(flat) b default
(flat) b locationbeforeMsentence (vectorized)
(flat) b locationbeforeMword (vectorized)
(flat) b locationwithinMword (vectorized)
student_t(3, 0, 187.5) sd 0 default
student_t(3, 0, 187.5) sd participant 0 (vectorized)
student_t(3, 0, 187.5) sd Intercept participant 0 (vectorized)
student_t(3, 0, 187.5) sigma 0 default
We want to make sure that (flat)
is replaced by more (i.e. weakly or regulating) informative priors (see Kruschke, 2010a; Lambert, 2018; McElreath, 2016). For example, we might consider a Student-t distribution as prior over the slope parameters (class b
) which in our model are parameterised as the cell means of the three transition locations (levels: within word, before word, before sentence) with a mean of 350 msecs and a standard deviation of 75 msecs.
Student-t distributions, in comparison to normal distributions, have a degrees of freedom parameter (first argument) in addition to the mean and the standard deviation (second and third parameter, respectively). The degrees of freedom control the thickness of the tails of the distribution with smaller values assigning more probability mass to the tails; the shape of the Student-t distribution is moving closer to a normal distribution when the degrees of freedom approach \(\infty\). This is illustrated in Figure 2.1 for the mean and the standard deviation of the prior above.
Figure 2.1: Student-t prior with varying degrees of freedom.
These priors are weakly informative. The intention is to use general knowledge that we have about how long interkey intervals usually are to point the model towards the right area in the parameter space by assigning low probabilities to unlikely or even implausible values. Even if we know very little about the duration of interkey intervals, we do know that they shouldn’t be longer than a day, or an hour, or even several minutes, because such long intervals are too extreme to reflect active writing.
2.1.3 Sampling parameters
For the model to converge we need to run a sufficient number of iterations. 10,000 iterations, as below, are a lot but does not guarantee convergence. To be able to test whether converged has reached, we need to run the model several time (i.e. different chains). This can be done at the same time (in parallel) when using more than one core of your computer (three cores below).5 There is no need to use more cores than chains; using less cores would mean that at least one process has to wait until after another processes has been completed. When the model has settled on a distribution of parameter values, we should observe, after the model has finished sampling, that all chains mixed nicely. Running long chains (many iterations) is useful to get more accurate parameter estimates but will take more time.
2.1.4 Fitting the model
The following code combines the model, the data, and the prior and returns the posterior. This will take a few of minutes to complete sampling from the posterior after brms
has compiled the model code.
fit_gaus <- brm(formula = formula,
data = data,
prior = prior,
chains = nchains,
cores = ncores,
iter = iterations,
warmup = warmup,
sample_prior = TRUE,
inits = 0,
seed = 365)
Because Bayesian models tend to take more time to run, it makes sense to save the results using saveRDS
for re-use. The function requires the name of the fitted model fit_gaus
. The model is stored as compressed “.rda” file.
Before you save the posterior, create a new directory called “stan” where we will save all Stan outputs in this tutorial.
To read the fitted model back into R’s working environment use readRDS
.
The function fixef
provide a summary of the posterior showing parameter value estimates as average and 95% probability interval by transition location in msecs.
Estimate Est.Error Q2.5 Q97.5
locationbeforeMsentence 2840 427 1977 3657
locationbeforeMword 387 134 149 697
locationwithinMword 343 130 67 600
2.2 Log-Gaussian mixed-effects model
2.2.1 Model
This model is largely identical to the Gaussian mixed-effect model in the previous section, except instead of a normal distribution this model assumes the interkey intervals come from a single underlying process that follows a log-normal distribution indicated as \(\text{log}\mathcal{N}()\).
\[ \begin{align} \text{iki}_i \sim & \text{log}\mathcal{N}(\mu_i, \sigma_\text{e}^2)\\ \mu_i = & \beta_\text{location[i]} + u_\text{participant[i]}\\ u_\text{participant[i]} \sim & \mathcal{N}(0, \sigma_\text{p}^2) \end{align} \]
Assuming a log-normal distribution is useful as interkey intervals and other response times are zero-bound (interkey intervals cannot be shorter than 0). A log-normal distribution forces responses to be positive. Also the log scale is not linear, meaning the distance between 100 msecs and 110 msecs is not the same as the distance between 500 and 510 msecs; instead the distance between adjacent values is reduced for larger values while the distances between adjacent values on the lower end is increased. For example, for a logarithm with the base 10, the distance between 10 msecs and 100 msecs is identical to the distance between 100 msecs and 1000 msecs.
For the model formula we need to change the distribution family from gaussian
(default) to lognormal
.
2.2.2 Priors
The default priors and the priors that we need to specify are now no longer on the msecs scale but on the log scale.
prior class coef group resp dpar nlpar lb ub source
(flat) b default
(flat) b locationbeforeMsentence (vectorized)
(flat) b locationbeforeMword (vectorized)
(flat) b locationwithinMword (vectorized)
student_t(3, 0, 2.5) sd 0 default
student_t(3, 0, 2.5) sd participant 0 (vectorized)
student_t(3, 0, 2.5) sd Intercept participant 0 (vectorized)
student_t(3, 0, 2.5) sigma 0 default
For example, a mean of 5 on the log scale are exp(5.85)
\(\approx\) 347 msecs on a linear scale. This prior with varying degrees of freedom is visualised in Figure 2.2.
Here is again an example for different degrees of freedom for the prior above.
Figure 2.2: Student-t prior with varying degrees of freedom on log scale.
2.2.3 Fitting the model
We fit the model as before with the same sampling parameters.
fit_lgaus <- brm(formula = formula,
data = data,
prior = prior,
chains = nchains,
cores = ncores,
iter = iterations,
warmup = warmup,
sample_prior = TRUE,
inits = 0,
seed = 365)
Save the results using saveRDS
.
The fixef
function provides a summary of the inferred posterior estimates as average and 95% probability interval by transition location in log msecs.
Estimate Est.Error Q2.5 Q97.5
locationbeforeMsentence 6.5 0.1 6.3 6.7
locationbeforeMword 5.8 0.1 5.6 6.0
locationwithinMword 5.1 0.1 4.9 5.3
These can be converted into msecs using the exponential function exp
.
Estimate Q2.5 Q97.5
locationbeforeMsentence 691 557 845
locationbeforeMword 330 266 402
locationwithinMword 162 132 197
The posterior estimates obtained from the log-Gaussian model are radically different from the posterior estimates of the Gaussian model. This difference is likely due to the missing lower bound at zero in the Gaussian model as we as its inability to de-emphasize values in the right tail.
Estimate Q2.5 Q97.5
locationbeforeMsentence 2840 1977 3657
locationbeforeMword 387 149 697
locationwithinMword 343 67 600
3 Model comparison I
To evaluate the out-of-sample predictive performance of the models we can use leave-one-out cross-validation via Pareto smoothed importance-sampling (Vehtari et al., 2015, 2017). Cross-validation is an alternative to more conventional model-comparison metrics such as \(R^2\) as it penalises models with more parameters to prevent overfitting (see Lambert, 2018; McElreath, 2016). Overfitting happens when a model captures nuances of a sample, rather than those patterns that one would expect to observe in new samples.
In contrast, cross-validation is measuring the performance of a model, meaning its ability to generalise to new data. The leave-one-out information criterion (LOO-IC) is determined as follows: train a model on \(N-1\) observations, predict the remaining data points from the training model, repeat process \(N\) times to predict every observation from a model of th remaining data. Adding up the prediction results gives an estimate of the expected log-predictive density (elpd_loo
; \(\widehat{elpd}\)), i.e. an approximation of the results that would be expected for new data.
The function loo
can be used to access the expected log-predictive density which uses probability calculations to approximate the LOO-IC via Pareto smoothed importance sampling (Vehtari et al., 2015, 2017). LOO can be approximated with the log posterior predictive densities as a measure of how likely a data point is, given the parameter estimates. This is shown in Figure 3.1 for each data point in the sample. Extremely small values indicate that an observation is very unlike to repeat in a new sample under the given model and the estimated posterior parameter values.
log_lik(fit_gaus) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "obs",
values_to = "loglik") %>%
summarise(mean = mean(loglik), var = var(loglik), .by = obs) %>%
filter(mean > -10) %>% # there are some terrible ones for this model
mutate(obs = as.numeric(str_remove(obs, "V"))) %>%
ggplot(aes(x = obs, y = mean, ymin = mean - var, ymax = mean + var)) +
geom_pointrange(fatten = .5) +
labs(x = "observations",
y = "log-posterior predictive density")
Figure 3.1: Log-posterior densities of each data point (errorbars show variance) of the Gaussian model.
If we sum over the means shown in Figure 3.1, we obtain \(\widehat{elpd}\), the expected log predictive density, indicated as elpd_loo
. \(p_\text{loo}\) (p_loo
) is the sum of the variances and indicates the flexibility of the model fit (also used as the effective number of parameters). looic
, or LOO-IC, corresponds to \(-2 \times (\)elpd_loo
\(-\)p_loo
\()\) (for deviance scale). To illustrate these, we can calculate them manually
log_lik(fit_gaus) %>%
as_tibble() %>%
pivot_longer(cols = everything(),
names_to = "obs",
values_to = "loglik") %>%
summarise(mean = mean(loglik), var = var(loglik), .by = obs) %>%
summarise(elpd_loo = sum(mean), # sum of mean log lik
p_loo = sum(var), # effective number of parameters
looic = -2 * (elpd_loo - p_loo)) %>% # Bayesian deviance
pivot_longer(everything()) %>%
mutate(across(value, round, 1))
# A tibble: 3 × 2
name value
<chr> <dbl>
1 elpd_loo -6003.
2 p_loo 80.8
3 looic 12168.
or directly using loo
(a small difference isn’t unexpected).
Computed from 15000 by 600 log-likelihood matrix.
Estimate SE
elpd_loo -6037.9 175.2
p_loo 62.4 47.6
looic 12075.8 350.3
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.5, 1.2]).
Pareto k diagnostic values:
Count Pct. Min. ESS
(-Inf, 0.7] (good) 597 99.5% 4818
(0.7, 1] (bad) 1 0.2% <NA>
(1, Inf) (very bad) 2 0.3% <NA>
See help('pareto-k-diagnostic') for details.
The Pareto k diagnostic indicates that there are some observations that are very unusual under the single distribution Gaussian model and the obtained parameter estimates.
The difference in elpd_loo
for the two single distribution models is returned by loo
and is shown in the elpd_diff
column (\(\Delta\widehat{elpd}\)) with standard error shown in the se_diff
column.
elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
fit_lgaus 0.0 0 -4444.2 51.2 10.9 1.2 8888.4 102.3
fit_gaus -1593.7 160 -6037.9 175.2 62.4 47.6 12075.8 350.3
The model comparison shows that the model fit_lgaus
(the log-Gaussian model) is approximately 10 standard errors better than fit_gaus
. This can be calculated by abs(loo_fit$diffs[2,1] / loo_fit$diffs[2,2])
by normalising the difference over its standard error \(\mid\frac{\Delta\widehat{elpd}}{\text{SE}}\mid\), which returns the difference between models expressed in standard errors (Sivula et al., 2020). A difference of 10 standard errors is substantial. By comparison, a t-value more extreme than \(\mid2\mid\) corresponds to an effect that is significant at an \(\alpha\)-level of 0.05 (i.e. \(p<0.05\)) (e.g. Baayen, 2008).
Aside, unlike \(R^2\), elpd_loo
and looic
have no direct interpretation. This is because the size of elpd_loo
increases as a function of the number of observations. Generally elpd_loo
and looic
can be understood as a quantification of how many possible futures the model can rule out. However, what is of importance is the difference between models.
If a measure of variance explained is desired, loo_R2
provides a Bayesian \(R^2\). This quantity is not ideal for model comparisons because, similar to the frequentist \(R^2\), it is subject to overfitting: \(R^2\) will inevitabily increase when more parameters are added to the model. A large \(R^2\) means that a model captures the data well but might mean that the model makes predictions that do not generalise out-of-sample (see chapter 11 in Gelman et al., 2020; also chapter 6 in McElreath, 2016). The is an example for Bayesian \(R^2\) for the single-distribution log-Gaussian model.
Estimate Est.Error Q2.5 Q97.5
R2 0.01960149 0.00890381 0.004511295 0.03963095
4 Two-distributions log-Gaussian mixed-effects mixture model
The mixture model captures the writing process as a combination of three parameters: the average typing speed or fluent interkey intervals (\(\beta\)), the slowdown for hesitant interkey intervals (\(\delta\)) and the probability of hesitant interkey intervals (\(\theta\)). In the model below, the latter two are estimated separately for each level of the categorical prediction location
(i.e. transition location) for the \(i^\text{th}\) observation.
\[ \begin{align} \text{iki}_{i} \sim & \theta_\text{location[i],participant[i]} \times \text{log}\mathcal{N}(\beta + \delta_\text{location[i]} + u_\text{participant[i]}, \sigma_{e'_\text{location[i]}}^2) + \\ & (1 - \theta_\text{location[i],participant[i]}) \times \text{log}\mathcal{N}(\beta+ u_\text{participant[i]}, \sigma_{e_\text{location[i]}}^2)\\ &\text{where}\\ &\delta \sim \mathcal{N}(0,1)\\ &\text{constraint: } \delta > 0 \end{align} \]
This model takes into account two sources of random error: (1) each participant has an individual typing speed modelled as deviations from the average \(u_\text{participant} \sim \mathcal{N}(0, \sigma^2_p)\) as above; (2) each participant has in individual hesitation probability \(\theta_\text{participant}\) that differs across levels of the categorical predictor location
.
The parameters of this equation are visualised in Figure 4.1. This illustration shows the dissociation of two underlying processes that the model is estimating on the basis of the data. There are two distributions of which the left one in grey is the distribution of fluent interkey intervals; the right distribution in green is wider and captures longer interkey intervals (with some overlap between fluent and disfluent interkey intervals).
Figure 4.1: What’s a mixture model for interkey intervals? The parameters of the mixture model explained visually.
Importantly Figure 4.1 illustrates the average typing speed of fluent interkey intervals captured by the \(\beta\) parameter; the distance between fluent transitions and disfluent transitions captured by \(\delta\) (i.e. the distance between the grey and the green distribution); the hesitation probability or the “pausing frequency” which is the height of the green distribution captured by \(\theta\).
Most interkey intervals are, typically, fluent and hence part of the grey distribution. The frequency of the minority of disfluent interkey intervals is captured by the height of the green distribution. As this model assumes a finite mixture of two distributions, the probability of fluent interkey intervals is the inverse \(1-\theta\); in other words, a pausing probability of \(\theta=.4\) means that the probability of fluent interkey intervals is \(.6\). If the pausing probability decreases to \(.3\) the proportion of fluent interkey intervals had increased to \(.7\).
4.1 Model input values
The data must be transformed into a list to feed them into Stan. The information required from the data is determined in the Stan code and has to do with what we need to estimate the model parameters and how the model is implemented. The exert from the Stan code below shows the information that is expected as input, which data type is expected (e.g. int
means integer), their names, and what the smallest (i.e. lower
) and largest possible value (i.e. upper
) are. The full Stan code can be found at the end of this tutorial.
// Do not run
// Data chunk of Stan code
data {
int<lower=1> N; // Number of observations
int<lower=1> nS; // Number of subjects
int<lower=1, upper=nS> ppt[N];// Participant identifiers
int<lower=1> K; // Number of conditions
int<lower=1> condition[N]; // Condition identifiers
vector[N] y; // Outcome: interkey intervals
}
Participants and conditions are expected to represent indices in the Stan code. In the code below we use factor
in combination with as.numeric
to ensure that there are no empty indices for participants and conditions. We create vectors with participant identifiers ppts
(numbers \(1\) through \(I\) where \(I\) is the number of participants) and a numeric identifier condition
for each level of the transition location (levels: before sentence = 1, before word = 2, within word = 3). The returned values are indices for the parameters in the model: For example, for the location identifier, delta[1]
is the population estimate for the slowdown for hesitant interkey intervals at before sentence locations, delta[2]
for before word locations; theta[1]
and theta[2]
are the hesitation probability (on the logit scale) for before-sentence and before-word transitions, respectively.
ppts <- pull(data, participant) %>% factor() %>% as.numeric()
condition <- pull(data, location) %>% factor() %>% as.numeric()
In the below code, the names on the left side of the arrow must correspond to the names expected in the Stan code (above); names on the right side do not. We assigned the participant and location identifiers created above and their maximum values nS
and K
, respectively. The interkey interval data iki
were assigned to y
and the total number of observations (number of rows in the data nrow(data)
) was assigned to N
.
data_list <- within( list(), {
ppt <- ppts
nS <- max(ppts) # max no of ppts
condition <- condition
K <- max(condition) # no of conditions
y <- data$iki
N <- nrow(data)
} )
The information in the data list can be viewed using the glimpse
function.
List of 6
$ N : int 600
$ y : num [1:600] 2881 465 144 121 129 ...
$ K : num 3
$ condition: num [1:600] 1 2 1 1 2 1 3 2 1 1 ...
$ nS : num 10
$ ppt : num [1:600] 3 3 3 3 3 3 3 3 3 3 ...
4.2 Loading the model
The Stan code can be downloaded from our OSF repository using readLines
and saved locally using writeLines
as “molg.stan” (mixture of log-Gaussians) in the directory “stan”.
# Download Stan code
file <- readLines("https://osf.io/r7aqx/download")
# Save Stan code in "stan" directory
writeLines(file, "stan/molg.stan")
The Stan model can be loaded and translated into an S4 class object using the stan_model
function and assigned to molg
. The “molg.stan” file needs to be available in the directory “stan” located inside the current working directory.
This code can fit the posterior of a categorical predictor with any number of levels. In other words, the posterior that the model returns can be used to calculate simple differences, main effects and interactions of factorial designs. We demonstrate below how to calculate a simple difference. The same logic can be used to calculate main effects and interactions. For more predictors, or removing or adding, e.g., random error terms, or adjusting priors, the Stan code can be opened and changed in R or any text editor. The Stan code used is largely based on Vasishth et al. (2017). Sorensen et al. (2016) presents a detailed tutorial on how to write models in Stan (see also Lambert, 2018).
This model estimates a number of parameters of which not all are interesting for inference. To reduce the physical size of the model object that contains the posterior we can determine which parameters we want to save.
# Parameters to keep in output
pars <- c("beta", # fluent typing
"delta", # disfluency slowdown
"theta", # mixing proportion on logit scale
"prob", # mixing proportion as probability
"prob_s", # by-participant disfluency probability
"sigma", # variance component
"sigmap_e", "sigma_e", # variance by mixture component
"sigma_diff", # difference between variances
"sigma_u", # variance for random ppt
"log_lik", # log likelihood (for model comparison)
"y_tilde") # predicted data
The sampling
function applies the model to the data using the information introduced before (iterations, warmup, chains, cores, parameters pars
to be saved). save_warmup
is set to FALSE
to discard the warmup samples (which are not used for inference anyway) to reduce the size of the model object. To allow reproducibility (and because Bayesian models involves random number generation) we set the seed using an arbitrary number. Lastly, the control argument was specified with higher values for adapt_delta
and max_treedepth
: using higher values here means the model runs slower but means more careful parameter estimation. The model is assigned to fit_molg
.
# Fit model
fit_molg <- sampling(molg,
data = data_list,
iter = iterations,
warmup = warmup,
chains = nchains,
cores = ncores,
pars = pars, # Parameters to keep.
save_warmup = FALSE, # Don't save the warmup samples.
seed = 365,
# Improve sampling efficiency
control = list(adapt_delta = 0.96, # default: 0.9
max_treedepth = 14)) # default: 10
As above we save the results using saveRDS
.
Running this model will take a moment to complete sampling depending on your hardware specifications. The time it took my laptop to complete this job can be viewed using get_elapsed_time
.
warmup sample
chain:1 0.8758833 0.7950167
chain:2 0.9136667 0.5474167
chain:3 0.8951833 0.8892333
Inferred posterior estimates shown as averages and 95% probability intervals can be extracted using the print
function for each transition location and the three parameter values of interest. prob
is the \(\theta\) parameter on the probability scale rather than on the logit scale. \(\beta\) and \(\delta\) are on the log scale.
Inference for Stan model: anon_model.
3 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=15000.
mean se_mean sd 2.5% 97.5% n_eff Rhat
beta 5.08 0.00 0.08 4.92 5.24 3668 1
delta[1] 1.97 0.00 0.15 1.69 2.27 8978 1
delta[2] 1.24 0.00 0.15 0.95 1.54 11735 1
delta[3] 0.67 0.00 0.29 0.16 1.28 11620 1
prob[1] 0.73 0.00 0.08 0.56 0.87 7466 1
prob[2] 0.55 0.00 0.10 0.36 0.73 6966 1
prob[3] 0.12 0.00 0.06 0.04 0.26 5506 1
theta[1] 1.03 0.00 0.42 0.24 1.88 7001 1
theta[2] 0.20 0.00 0.40 -0.58 1.01 6943 1
theta[3] -2.06 0.01 0.53 -3.12 -1.06 5160 1
Samples were drawn using NUTS(diag_e) at Sun May 19 04:55:42 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
We return to the interpretation of these estimates below.
4.3 Model diagnostics
4.3.1 Convergence
Before we interpret these results we need to ensure that the model reached convergence. Convergence can be established using two simple methods. First, we can inspect the chains in trace plots which show the parameter estimate across iterations (here after warmup) for each chain in different colours. We will look at the population-level parameters saved in pars
.
To create trace plots, we can apply the stan_trace
function to the model fit_molg
and extract the MCMC chains for the parameters in pars
. The alpha
argument makes the colours slightly more transparent otherwise it is difficult to see the chains hiding in the back. If the chains overlap and look like “fat hairy caterpillars”, chains have converged on the same target distribution. The chains below look good (i.e. well mixed).
Second, we can calculate the \(\hat{R}\) statistic. Successful convergence is reflected in \(\hat{R}\) values smaller than 1.05 (Gelman & Rubin, 1992). \(\hat{R}\) is similar to the F statistic in standard analysis of variance: it tells us how much bigger the variability between chains is compared to the variability within chains. A value of \(\approx 1\) indicates that the variability is essentially identical signifying convergence. Larger \(\hat{R}\) values indicate that the chains are diverging. The rhat
function applied to the model and the parameters of interest extracts the \(\hat{R}\) statistic. There is no evidence for convergence problems.
summary(fit_molg, pars=pars)$summary %>%
as.data.frame() %>%
rownames_to_column("parameter") %>%
mutate(across(Rhat, round, 3)) %>%
select(parameter, Rhat)
parameter Rhat
1 beta 1.002
2 delta[1] 1.000
3 delta[2] 1.000
4 delta[3] 1.000
5 theta[1] 1.000
6 theta[2] 1.000
7 theta[3] 1.000
Convergence problems can have many reasons and therefore many solutions: for example, running longer chains (in particular longer warmups), increasing the maximum treedepth and the average acceptance probability (adapt_delta
), specifying starting values, using sensible regulating priors, constraining parameters, or changing the parametrisation of the model. Severe convergence problems indicate model misspecification.
4.3.2 Posterior predictive check
The model is simulating hypothetical datasets on the basis of the estimated posterior parameter values. This happened for every iteration for every chains. In other words there is a total of 15,000 simulated data sets (and samples per parameter) because
[1] 15000
We can draw predicted data from the model output and compare these to the observed data. A model that makes reasonable predictions should fit the observed data.
Using the as.matrix
function we extract a matrix of y_tilde
, the simulated data. This returns a matrix of the size total_samples
\(\times\) nrow(data)
(so 15,000 \(\times\) 600). We use the sample
function to draw N
randomly sampled hypothetical data sets.
y_tilde <- as.matrix(fit_molg, pars = "y_tilde") # extract simulated data sets
N <- 50 # number of simulations to use
rnd_sims <- sample(total_samples, N) # created random indices
y_tilde_sample <- y_tilde[rnd_sims,] # draw N random simulations
The ppc_dens_oberlay
function from the bayesplot
package is then mapping the observed data (\(y\); thick blue line) to the simulated data (\(y_{rep}\); thin light blue lines). We find that the simulated data sets closely align with the observed data.
library(bayesplot)
ppc_dens_overlay(data_list$y, y_tilde_sample) +
scale_x_continuous(limits = c(0, 3000))
For comparison we also visualised the fit for the simulated data of the Gaussian and the log-Gaussian model in Figure 4.2. Both models show a substantially poorer fit to the keystroke data than the mixture-model based simulations. Note that the Gaussian model does not rule out negative values which are conceptually implausible.
Figure 4.2: Fit of simulated datasets compared to observed data for the Gaussian and the log-Gaussian model. x-axes were truncated for visability.
5 Model comparsion II
Similar to the section “Model comparison I” we use leave-one-out cross-validation to compare now all three models, the mixture of log-Gaussians, the single distribution Gaussian and log-Gaussian models. Because the mixture model was fitted directly in Stan and the single-distribution models were fitted in brms
, we need to extract LOO statistics for which we need the log-likelihood and the relative effective sample size for the Stan model. We can then compare the models using functions provided by the loo
package.
library(loo)
# Mixture of log-Gaussians
log_lik <- extract_log_lik(fit_molg, merge_chains = F)
r_eff <- relative_eff(exp(log_lik)) # relative effective sample size
loo_molg <- loo(log_lik, r_eff = r_eff)
# Single distribution log-Gaussian
loo_lgaus <- loo(fit_lgaus)
# Single distribution Gaussian
loo_gaus <- loo(fit_gaus)
In “Model comparison I” we used loo
to compare two models directly. This time we need the loo_compare
function because we had to extract the LOO statistics for comparison.
We can then extract the model comparisons and fit statistics:
loo_comp %>%
as.data.frame() %>%
rownames_to_column("models") %>%
mutate(across(models, recode, model3 = "fit_molg"),
across(where(is.numeric), round, 1))
models elpd_diff se_diff elpd_loo se_elpd_loo p_loo se_p_loo looic se_looic
1 fit_molg 0.0 0.0 -4311.0 51.5 31.5 2.0 8621.9 102.9
2 fit_lgaus -133.2 14.6 -4444.2 51.2 10.9 1.2 8888.4 102.3
3 fit_gaus -1726.9 164.6 -6037.9 175.2 62.4 47.6 12075.8 350.3
The difference between the model in the top row and the two other models is shown in elpd_diff
(\(\Delta\widehat{elpd}\)) with standard error shown in the se_diff
column. Models are ordered from the model with the highest predictive performance to the lowest (i.e. elpd_loo
). The model comparison shows that the model fit_molg
(the mixture of log-Gaussian) is approximately 9 standard error better than the single distribution log-Gaussian fit_lgaus
(try abs(loo_comp[2,1] / loo_comp[2,2])
). This constitutes substantial support for the mixture model over the single distribution log-Gaussian.
6 Posterior probability distribution
At the core of Bayesian inference is the posterior probability distribution. For each model parameter we have 15,000 posterior samples which constitute the posterior probability distribution. These samples are the result of the posterior samples taken from the target distribution after warmup collapsed across all chains. If the target distribution was not identified by the end of the warmup, inference will not be reliable. We know that the target distribution was determined because chains converged.
The distribution of parameter values represents the uncertainty about parameter values given the data. There is a large range of things one can do with a posterior. Below we focus on summarising parameter estimates, comparing conditions, and extracting by-participant estimates.
The names
function can be used to check which parameters are available in the model (we omitted the log-likelihood log_lik
, and the stimulated datasets y_tilde
). The meaning of the parameters is described above and is determined in the Stan code. Indices refer to the transition locations (1 = before sentence; 2 = before word; 3 = within word) and to the participant identifier for by-participant parameters (indicated by _s
).
[1] "beta" "delta[1]" "delta[2]" "delta[3]" "theta[1]" "theta[2]" "theta[3]" "prob[1]" "prob[2]" "prob[3]" "prob_s[1,1]" "prob_s[2,1]"
[13] "prob_s[3,1]" "prob_s[1,2]" "prob_s[2,2]" "prob_s[3,2]" "prob_s[1,3]" "prob_s[2,3]" "prob_s[3,3]" "prob_s[1,4]" "prob_s[2,4]" "prob_s[3,4]" "prob_s[1,5]" "prob_s[2,5]"
[25] "prob_s[3,5]" "prob_s[1,6]" "prob_s[2,6]" "prob_s[3,6]" "prob_s[1,7]" "prob_s[2,7]" "prob_s[3,7]" "prob_s[1,8]" "prob_s[2,8]" "prob_s[3,8]" "prob_s[1,9]" "prob_s[2,9]"
[37] "prob_s[3,9]" "prob_s[1,10]" "prob_s[2,10]" "prob_s[3,10]" "sigma" "sigmap_e[1]" "sigmap_e[2]" "sigmap_e[3]" "sigma_e[1]" "sigma_e[2]" "sigma_e[3]" "sigma_diff[1]"
[49] "sigma_diff[2]" "sigma_diff[3]" "sigma_u" "lp__"
The posterior probability distribution of the parameter estimates can be visualised using the stan_hist
function applied to the model fit_molg
and the to-be-visualised parameters pars
as arguments. The function returns a histogram for each parameter. Each histogram is a probability distribution indicating which values are more (or less) probable estimates for the true parameter value and the uncertaintly associated with it.
The posterior samples of the parameter values can be summarised using the print
function. The probs
argument can be used to extract the lower and upper bound of the probability interval. A lower bound of .025 and an upper bound of .975 gives the 95% probability interval (PI; also called “credible intervals”), i.e. the range that contains the true parameter value with a 95% probability given the data.
Inference for Stan model: anon_model.
3 chains, each with iter=10000; warmup=5000; thin=1;
post-warmup draws per chain=5000, total post-warmup draws=15000.
mean se_mean sd 2.5% 97.5% n_eff Rhat
beta 5.08 0.00 0.08 4.92 5.24 3668 1
delta[1] 1.97 0.00 0.15 1.69 2.27 8978 1
delta[2] 1.24 0.00 0.15 0.95 1.54 11735 1
delta[3] 0.67 0.00 0.29 0.16 1.28 11620 1
theta[1] 1.03 0.00 0.42 0.24 1.88 7001 1
theta[2] 0.20 0.00 0.40 -0.58 1.01 6943 1
theta[3] -2.06 0.01 0.53 -3.12 -1.06 5160 1
Samples were drawn using NUTS(diag_e) at Sun May 19 04:55:42 2024.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
This output summarises the most probable parameter value as mean and its standard error (se_mean
), standard deviation (sd
), the effective sample size (n_eff
) indicating sampling efficiency and the convergence metric \(\hat{R}\) (Rhat
).
For the following steps, we focus on the three parameters that have conceptually interesting interpretations: (1) the typing speed \(\beta\), (2) the hesitation slowdown \(\delta\), and (3) the hesitation probability \(\theta\) (shown on the logit scale).
The plot
function shows the posterior probability distribution of the three parameters for before-sentence transitions (indicated as 1), before-word transitions (indicated as 2), and within-word transitions summarised as median and 95% PI.
The values for \(\beta\) and \(\delta\) are shown on a log-scale. To transform their values back to msecs we can extract the posterior samples using the as.data.frame
function.
# A tibble: 15,000 × 8
idx beta `delta[1]` `delta[2]` `delta[3]` `theta[1]` `theta[2]` `theta[3]`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5.12 1.91 1.20 0.0271 0.405 0.303 -0.640
2 2 4.98 1.82 1.29 0.00434 1.66 0.409 -2.35
3 3 5.12 1.84 1.33 0.107 1.08 -0.250 -2.52
4 4 5.11 2.17 1.27 0.252 0.389 -0.442 -1.44
5 5 5.11 2.00 1.11 0.336 0.699 0.967 -2.35
6 6 5.17 1.93 0.950 1.33 0.561 -0.0950 -2.70
7 7 5.05 2.09 1.52 0.286 1.53 0.186 -3.05
8 8 5.20 1.87 1.24 0.154 1.68 -0.416 -2.08
9 9 5.04 1.98 1.27 0.257 0.324 -0.202 -2.09
10 10 4.96 2.26 1.17 1.33 0.974 0.115 -2.21
# ℹ 14,990 more rows
The pivot_longer
function transforms the data to a long format with an additional column for location and the delta
and theta
model parameters as columns. The names_pattern
argument is using a regular expression to extract the number in the squared brackets.
posterior_long <- pivot_longer(posterior,
cols = -c(idx, beta),
names_to = c(".value", "location"),
names_pattern = "(.*)\\[(.)\\]")
posterior_long
# A tibble: 45,000 × 5
idx beta location delta theta
<chr> <dbl> <chr> <dbl> <dbl>
1 1 5.12 1 1.91 0.405
2 1 5.12 2 1.20 0.303
3 1 5.12 3 0.0271 -0.640
4 2 4.98 1 1.82 1.66
5 2 4.98 2 1.29 0.409
6 2 4.98 3 0.00434 -2.35
7 3 5.12 1 1.84 1.08
8 3 5.12 2 1.33 -0.250
9 3 5.12 3 0.107 -2.52
10 4 5.11 1 2.17 0.389
# ℹ 44,990 more rows
Note that the values for beta
are now repeated for each location a a consequence. We had to do this for the next step.
The next code transforms beta
and delta
to msecs using the exponential function exp
. In order to transform the slowdown delta
into msecs we had to add beta
before using the exponential function; we can then subtract beta
again. theta
samples are currently on the logit scale and can be transformed to probabilities using brms::inv_logit_scaled
(the model also contained a prob
parameter which is theta
but on the probability scale). The recode
function changes the location indices from 1 to “before sentence”, 2 to “before word”, and 3 to “within word”.
posterior_in_msecs <- mutate(posterior_long,
delta = exp(beta + delta) - exp(beta),
across(beta, exp),
across(theta, inv_logit_scaled),
across(location, recode, `1` = "before sentence",
`2` = "before word",
`3` = "within word"))
posterior_in_msecs
# A tibble: 45,000 × 5
idx beta location delta theta
<chr> <dbl> <chr> <dbl> <dbl>
1 1 168. before sentence 970. 0.600
2 1 168. before word 390. 0.575
3 1 168. within word 4.60 0.345
4 2 146. before sentence 759. 0.840
5 2 146. before word 387. 0.601
6 2 146. within word 0.636 0.0873
7 3 167. before sentence 882. 0.747
8 3 167. before word 463. 0.438
9 3 167. within word 18.8 0.0742
10 4 165. before sentence 1284. 0.596
# ℹ 44,990 more rows
The posterior distribution of the parameter estimates can then be visualised in, for example, histograms. The histograms show the probability distribution of each parameter with the hesitation slowdown \(\delta\) and the hesitation probability \(\theta\) shown for each transition location. The latter two illustrate that their posterior parameter values are sensitive to the interkey intervals location in the text with more and longer hesitations before sentences than before words than within words which we evaluate in the next section.
pivot_longer(posterior_in_msecs, c(beta, delta, theta)) %>%
mutate(across(location, ~case_when(str_detect(name, "beta") ~ "overall",
TRUE ~ .)), # correct location for beta
across(location, ~factor(., # order locations for plotting
levels = c("before sentence",
"before word",
"within word",
"overall"),
ordered = TRUE))) %>%
unique() %>% # remove duplicated values for beta
ggplot(aes(x = value, colour = location, fill = location)) +
geom_histogram(position = "identity", alpha = .25) +
facet_wrap(~name, scales = "free", labeller = label_parsed) +
scale_fill_brewer("Transition\nlocation", palette = "Dark2") +
scale_color_brewer("Transition\nlocation", palette = "Dark2") +
labs(x = "Posterior parameter estimate") +
theme(legend.position = "top")
To illustrate differences for the estimated cells means compared to the single distribution models, here are the means for all parameters and their location. The central tendencies of the single distribution models are in principle a combination of beta
and beta + delta
weighted by prob
.
pivot_longer(posterior_in_msecs, c(beta, delta, theta)) %>%
mutate(across(location, ~case_when(str_detect(name, "beta") ~ "overall",
TRUE ~ .))) %>%
unique() %>%
summarise(across(value, list(mean = mean),
.names = "{.col}"),
.by = c(name, location)) %>%
arrange(name)
# A tibble: 7 × 3
name location value
<chr> <chr> <dbl>
1 beta overall 161.
2 delta before sentence 1013.
3 delta before word 402.
4 delta within word 169.
5 theta before sentence 0.730
6 theta before word 0.548
7 theta within word 0.124
These are the estimated cell means from the log-Gaussian model
locationbeforeMsentence locationbeforeMword locationwithinMword
691 330 162
and the estimated cell means of the Gaussian model
locationbeforeMsentence locationbeforeMword locationwithinMword
2840 387 343
Differences are especially noticeable for before-sentence transitions.
7 Hypothesis testing
7.1 Differences between transition locations
We calculate the differences between transition locations for both hesitation parameters to determine to what extent each of the parameters are sensitive to transition locations. This shows us to what extent the size of the hesitation slowdown \(\delta\), and the hesitation probability \(\theta\) are sensitive to the linguistic location of a key pair.
To determine these differences we change the data format above and create a column that indicates the parameter name and a column for the corresponding values for each transition location. The last line calculates the difference between (1) transitions before sentences and words, and (2) transitions before and within words for each of the two hesitation parameters.
posterior_diffs <- posterior_in_msecs %>%
select(-idx) %>%
pivot_longer(delta:theta, names_to = "parameter") %>%
mutate(id = row_number(), .by = location) %>%
pivot_wider(names_from = location, values_from = value) %>%
transmute(parameter = str_c(parameter, "[diff]"),
diff_sent = `before sentence` - `before word`, # compare sentences to word
diff_word = `before word` - `within word`) # compare within to before word
posterior_diffs
# A tibble: 30,000 × 3
parameter diff_sent diff_word
<chr> <dbl> <dbl>
1 delta[diff] 580. 385.
2 theta[diff] 0.0246 0.230
3 delta[diff] 372. 386.
4 theta[diff] 0.239 0.514
5 delta[diff] 420. 444.
6 theta[diff] 0.309 0.364
7 delta[diff] 859. 378.
8 theta[diff] 0.205 0.199
9 delta[diff] 727. 271.
10 theta[diff] -0.0567 0.637
# ℹ 29,990 more rows
The differences between locations can be summarised as the mean, the 95% PIs and the probability that the difference between transition locations is smaller than 0 (notated as e.g. \(P(\hat{\theta}_\text{diff} < 0)\) for the parameter \(\theta\); the hat symbol \(\hat{.}\) indicates that the value is an estimate of the population parameter value). This summary tells us to what extent we can rule out a null difference between transition locations.
# Create summary functions for convenience
lower <- function(x) quantile(x, .025) # lower bound of 95% PI
upper <- function(x) quantile(x, .975) # upper bound of 95% PI
p_diff <- function(x) mean(x < 0) # prob that difference is negative
# Summarise differences
posterior_diffs %>%
pivot_longer(starts_with("diff")) %>%
summarise(across(value,
list(mean = mean, # mean of difference
lower = lower, # 2.5% lower bound of difference
upper = upper, # 97.5% upper bound of difference
p_diff = p_diff), # prob that difference is negative
.names = "{.fn}"),
.by = c(parameter, name)) %>% # group by parameters, comparison
mutate(across(where(is.numeric), round, 2)) # round after 2nd decimal place
# A tibble: 4 × 6
parameter name mean lower upper p_diff
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 delta[diff] diff_sent 611. 251. 1052. 0
2 delta[diff] diff_word 233. -50.6 483. 0.05
3 theta[diff] diff_sent 0.18 -0.05 0.42 0.06
4 theta[diff] diff_word 0.42 0.21 0.63 0
Form these results we can infer that there is a \(P(\hat{\theta}_\text{diff} < 0)\) = 0.06 chance that hesitations do not occur more frequently before sentence; the results show that hesitations are 0.18 more probable to occur sentence initially than word initially (95% PI: -0.05, 0.42); before-word transitions show an increased hesitation probability of 0.42 (95% PI: 0.21, 0.63) compared to within-word transitions.
Also, the slowdown associated with hesitations is 611 msecs longer for sentences (95% PI: 251, 1052) with a probability of \(P(\hat{\delta}_\text{diff} < 0)\) = 0 (not exactly 0 but smaller than 0.01) to be smaller than zero. Hesitations before words were 233 msecs longer than within words, although the probability interval (95% PI: -51, 483) did not rule out a negative difference, or no difference, which however had a low probability \(P(\hat{\delta}_\text{diff} < 0)\) = 0.05.
These differences can be viewed in histograms. The vertical dotted lines indicates a zero difference between transition locations. The area of the histogram to the left of this line corresponds to the probability p_diff
to observe a negative difference. Sentence boundaries seem to have a stronger effect on the hesitation slowdown while both sentence and word boundaries increase the hesitation probability.
posterior_diffs %>%
pivot_longer(starts_with("diff")) %>%
mutate(across(name, recode, diff_sent = "before sentence vs before word",
diff_word = "before word vs within word")) %>%
ggplot(aes(x = value, colour = name, fill = name)) +
geom_vline(xintercept = 0, colour = "black", linetype = "dotted", size = .5) +
geom_histogram(position = "identity", alpha = .25) +
facet_wrap(~parameter, scales = "free", labeller = label_parsed) +
scale_fill_brewer("Difference", palette = "Dark2") +
scale_color_brewer("Difference", palette = "Dark2") +
labs(x = "Difference between transition locations") +
theme(legend.position = "top")
7.2 ROPE
One problem with null-hypothesis significance testing it that, in practice, we rarely believe that only a difference of exactly null is indicating that there is no difference between groups. Instead there are values that are non-different from null; for example a difference in keystroke intervals of 1 msecs is meaningless in most, if not all, contexts. A hesitation probability increase of 0.001 is too small to be meaningful. To take this into account we can determine a region of practical equivalence (ROPE, Kruschke, 2010b, 2011, 2014).
The ROPE can be defined as the range of values that are a priori considered non-different from a null effect. The result can be understood as the extent to which the posterior cannot rule out a negligible effect. A meaningful difference between groups should have a small proportion of posterior samples within the ROPE.
To calculate the ROPE for differences between transition locations, we use the rope
function from the bayestestR
package.6 The rope
function requires as input a vector with the posterior difference. We can create this from the posterior differences calculated above. Let’s focus on \(\delta_\text{diff}\), the slowdown difference for hesitant transitions.
# Extract vector of differences
# For sentence vs word
delta_diff_sent <- filter(posterior_diffs, str_detect(parameter, "delta")) %>% pull(diff_sent)
# For before word vs within word
delta_diff_word <- filter(posterior_diffs, str_detect(parameter, "delta")) %>% pull(diff_word)
Also, we need to define the range
argument which specifies the range of values that are considered indifferent from null. Of course this range will depend on what the parameter represents and researcher’s intuition as well as prior knowledge of the domain. For example, for fluent transitions we would consider small differences, of say 10 msecs, more meaningful than for hesitations. If we consider differences of 10 msecs or smaller equivalent to a null difference, we can define the range
argument with a lower bound of -10 and the upper bound of 10.
# Proportion of samples inside the ROPE [-10.00, 10.00]:
inside ROPE
-----------
1.19 %
The ROPE for the hesitation duration of the difference between before-word and within-word transitions contains 1.19% of the posterior samples. In other words, there is a 1.19% chance that the true difference \(\delta_\text{diff}\) is between -10 and 10 msecs and therefore negligible.
Let’s see what happens if we determine a wider ROPE between -50 through 50 msecs which can be understood as a more conservative test for a non-zero difference.
# Proportion of samples inside the ROPE [-50.00, 50.00]:
inside ROPE
-----------
6.20 %
The ROPE contains 6.2% of the posterior samples. In other words, there is a 6.2% probability that the true difference \(\delta_\text{diff}\) is between -50 and 50 msecs which would therefore be a negligible effect.
7.3 Bayes Factor
Frequentist null-hypothesis testing is based on the p-value. Conceptually this probability represents how likely the data are if there was no effect and we can, if this probability is low enough (typically smaller than .05) reject the null hypothesis (i.e. the true parameter value being zero). In other words, if the data are very unlikely, the null-hypothesis is implausible. More specifically it tests whether the probability of the obtained test statistic (e.g. t, F, \(\chi^2\)) is as or more extreme than what one would expect if the null hypothesis is true. There are two possible outcomes for p-values: reject the null (i.e. \(p<.05\)); inconclusive evidence (i.e. \(p>.05\)). And that’s it. p-values do neither provide direct evidence for the alternative hypothesis (but evidence against the null), nor do they provide evidence for the null hypothesis (that data are not inconsistent with the null hypothesis is not evidence that there is no difference; or absence of evidence is not evidence of absence).
In contrast, Bayes Factors (BFs) provide a continuum of three possible outcomes (Dienes, 2014, 2016; Dienes & Mclatchie, 2018): (1) there is evidence to reject the null / alternative hypothesis; (2) the evidence is inconclusive (get more data!); (3) accept the null / alternative hypothesis.
For example, if the BF is the probability of the alternative hypothesis \(H_1\) over the null hypothesis \(H_0\), given the data \(y\), values \(>1\) indicate that there is more evidence for \(H_1\) than for \(H_0\), given the data, and \(<1\) indicates that there is more evidence for \(H_0\). This can be turned around by swapping nominator and denominator
\[ \text{BF}_{10} = \frac{p(H_1 \mid y)}{p(H_0 \mid y)} \] or by taking the inverse of \(\text{BF}_{10}\) which is \(\text{BF}_{01} = \frac{1}{\text{BF}_{10}}\).
The strength of evidence for \(H_0\) and \(H_1\) is continuous rather than binary (i.e. evidence vs no evidence). Authors have proposed different soft thresholds for meaningful evidence (see, e.g., Baguley, 2012; Jeffreys, 1961; Lee & Wagenmakers, 2014) but generally a BF larger than 3 is weak evidence, larger than 5 / 6 indicate moderate evidence and 10 / 32 is strong evidence for a statistically meaningful effect. For example BF\(_{10}\) = 2 reflects that the alternative hypothesis is two times more likely than the null hypothesis given the evidence. BF\(_{10}\) = 0.33 is weak evidence in favour of the null hypothesis because \(\frac{1}{0.33} \approx 3\). Notably, BFs can be used to obtain evidence in favour of the null; for discussion see Dienes (2014), Dienes & Mclatchie (2018), Dienes (2016) and for a practical application in writing research see Spilling et al. (2022) and Spilling et al. (2023).
BFs can be calculated using the Savage-Dickey method (see, e.g., Dickey et al., 1970; Jeffreys, 1961; Wagenmakers et al., 2010) where the BF is determined on the basis of the height of the posterior density at zero compared to the height of the prior density at zero. Important to note here is that the BF is sensitive to the prior especially in situations when the sample was small and / or prior information played a big role. It is therefore a good idea to carefully think about the prior in advance of analysis (hence “prior”) and to be mindful of the role of priors in the calculation of BFs.
Figure 7.1 shows an example for the calculation of BFs based on a posterior with an effect difference that is normally distributed with a mean of 2.5 and a SD of 1, expressed as \(\mathcal{N}(2.5, 1)\), compared against three priors with (1) a wide SD \(\mathcal{N}(0, 10)\), (2) a SD equal to the posterior \(\mathcal{N}(0, 2.5)\), and (3) a relatively small SD \(\mathcal{N}(0, 1)\). The resulting BFs are shown in the boxes as well as their calculation. The denominator shows the height of the posterior (black line) at zero and the nominator shows the height of the prior at zero (indicates as black dots). The BF was larger for more informative priors. Of course in practice, the shape of the posterior density is not independent of the prior; for more informative priors the posterior will move closer to the prior.
Figure 7.1: Example for the BF calculated using the Savage-Dickey method for three differently wide priors.
To calculate the BF on the basis of posterior samples, the Savage-Dickey method can be implemented as below (taken from Nicenboim & Vasishth, 2016) which requires an appropriate prior distribution.
BF <- function(posterior_difference, prior_sd = 1){
library(polspline)
fit <- logspline(posterior_difference)
posterior <- dlogspline(0, fit) # Height of the posterior at 0
prior <- dnorm(0, 0, prior_sd) # Height of the prior at 0
return(prior/posterior)
}
The logspline
function estimates the density of the posterior and the dlogspline
and the dnorm
functions return the height of the density at 0. We fixed the mean of the prior at 0 but allow to change the SD as function argument. Aside, the mean of the prior does not need to be zero but can take on any value that can be justified on the basis of domain knowledge. We illustrate here how BFs can be used to test the evidence again the null hypothesis.
For demonstration purposes we calculate the BF with different prior SDs for the hesitation probability \(\theta\), or in particular the posterior difference in hesitation \(\theta_\text{diff}\) for before-sentence vs before-word transitions and before-word vs within-word transitions.
First, we extract vectors of differences
# For sentence vs word
theta_diff_sent <- filter(posterior_diffs, str_detect(parameter, "theta")) %>% pull(diff_sent)
# For before word vs within word
theta_diff_word <- filter(posterior_diffs, str_detect(parameter, "theta")) %>% pull(diff_word)
and then apply the BF
function from above first to the difference between before-sentence and before-word transitions. For interpretation it is important to remember that \(\theta\) is a probability (which we transformed from logit to probability above mostly for ease of explanation; in practice it is a better idea to work with the logit scaled values); hence the difference between two probabilities can only range between -1 and 1 (in the most extreme case). In other words, a prior SD of 1 is saying anything goes (which is not helpful). Similar to the calculations of the ROPE above, it is useful to contemplate which prior distribution appropriately reflects the null hypothesis for the scale the difference is on (e.g. logit, probability, log msecs, msecs). SDs of 1 and .5 render very weakly informative prior distributions (flat, really) as they assign about the same prior probability to differences across the board. An SD of .1 is a more realistic representation for the distribution of null differences.7
[1] 0.4004183
[1] 0.8008365
[1] 4.004183
7.4 Note on using more than a single grouping variable
The Stan code presented in this tutorial can be used to estimate the posterior for two and more conditions represented in one predictor variable (called condition
in the Stan code): We can calculate the difference between conditions from the posterior as well as main effects and interactions for more complex factorial designs as any factorial design can be reduced to a single vector. For example, say we have a 2 \(\times\) 2 factorial design with factor 1 having two levels AB and CD and factor 2 having the corresponding levels AC and BD. These two factors render four conditions A, B, C, D. From posterior samples for each level A through D we can calculate main effect 1 as \(\text{ME1}=(A+B) - (C+D)\), main effect 2 as \(\text{ME2}=(A+C) - (B+D)\), and their interaction as \(\text{Interaction}=(A-B) - (C-D)\) (or \(\text{Interaction}=(A-C) - (B-D)\)) and summary statistics as shown above.
8 By-participant estimates
In some contexts we would like to obtain by-participant estimates of the probability of a participant to exhibit hesitations (i.e. \(\theta_s\)) reflecting how often a participant is pausing. As before we can extract the posterior from the model fit_molg
. This time we use the parameter that stored by-participant estimates prob_s
(which is the inverse-logit of the theta_s
which we didn’t store in the model fit) corresponding to the population estimate theta
.
posterior_ppts <- as.data.frame(fit_molg, pars = 'prob_s') %>% as_tibble()
names(posterior_ppts)[1:10]
[1] "prob_s[1,1]" "prob_s[2,1]" "prob_s[3,1]" "prob_s[1,2]" "prob_s[2,2]" "prob_s[3,2]" "prob_s[1,3]" "prob_s[2,3]" "prob_s[3,3]" "prob_s[1,4]"
The column names have the format <parameter>_s[<location>, <participant>]
where indices indicate the transition-location id and the participant identifier of the parameters. The following code creates a long format with columns location
and participant
and a value
column with their respective values (value
is pivot_longer
default). The previous column names were separated into location
(levels: 1
, 2
, 3
), and participant
(an index for each participant) using “,” to separate the two variables and extracting the numbers inside the squared brackets (the remainder of the name was discarded).
posterior_ppts_long <- posterior_ppts %>%
pivot_longer(everything(),
names_to = c("location", "participant"),
# extract indice for location and ppt using "," as separator
names_pattern = "\\[(.+),(.+)\\]") %>%
mutate(across(location, recode,
`1` = "before sentence",
`2` = "before word",
`3` = "within word")) %>%
# create different columns for location
mutate(id = row_number(),
.by = location) %>%
pivot_wider(names_from = location,
values_from = value) %>%
select(-id) # drop id
posterior_ppts_long
# A tibble: 150,000 × 4
participant `before sentence` `before word` `within word`
<chr> <dbl> <dbl> <dbl>
1 1 0.401 0.285 0.303
2 2 0.704 0.596 0.167
3 3 0.286 0.686 0.211
4 4 0.549 0.315 0.210
5 5 0.573 0.541 0.313
6 6 0.488 0.422 0.439
7 7 0.591 0.738 0.110
8 8 0.701 0.479 0.302
9 9 0.670 0.725 0.202
10 10 0.821 0.360 0.403
# ℹ 149,990 more rows
We can then summarise by-participant estimates for each transition location with the most probable parameter value and the 95% PI as before.
posterior_ppts_long_summary <- posterior_ppts_long %>%
pivot_longer(-participant, names_to = "location") %>%
summarise(across(value, list(mean = mean,
lower = lower,
upper = upper),
.names = '{.fn}'),
.by = c(location, participant))
posterior_ppts_long_summary
# A tibble: 30 × 5
location participant mean lower upper
<chr> <chr> <dbl> <dbl> <dbl>
1 before sentence 1 0.642 0.364 0.895
2 before word 1 0.630 0.365 0.888
3 within word 1 0.190 0.0393 0.464
4 before sentence 2 0.816 0.600 0.967
5 before word 2 0.574 0.313 0.838
6 within word 2 0.0974 0.0124 0.272
7 before sentence 3 0.506 0.261 0.767
8 before word 3 0.697 0.420 0.933
9 within word 3 0.128 0.0237 0.333
10 before sentence 4 0.631 0.389 0.856
# ℹ 20 more rows
For all 10 participants we see the estimate probability of typing hesitations by transition location. We observe in general that hesitations are more frequent before sentences and words than within words. However, participants vary to the extend they hesitate before words and sentences.
posterior_ppts_long_summary %>%
mutate(across(participant, as.numeric)) %>%
ggplot(aes(x = participant,
y = mean,
ymin = lower,
ymax = upper,
colour = location,
shape = location)) +
geom_pointrange(position = position_dodge(.75)) +
labs(x = "Participant id",
y = "Hesitation probability with 95% PIs") +
scale_colour_brewer("Transition location", palette = "Dark2") +
scale_shape_manual("Transition location", values = c(21, 3, 8)) +
scale_x_continuous(breaks = 1:10) +
theme(legend.position = 'top')
Finally, we calculated and visualised the by-participant differences.
posterior_ppts_long_diff_summary <- posterior_ppts_long %>%
mutate(diff_sent = `before sentence` - `before word`,
diff_word = `before word` - `within word`) %>%
select(participant, starts_with('diff')) %>%
pivot_longer(-participant,
names_to = 'difference',
values_to = 'diff') %>%
summarise(across(diff, list(mean = mean,
lower = lower,
upper = upper),
.names = '{.fn}'),
.by = c(participant, difference))
posterior_ppts_long_diff_summary
# A tibble: 20 × 5
participant difference mean lower upper
<chr> <chr> <dbl> <dbl> <dbl>
1 1 diff_sent 0.0116 -0.362 0.371
2 1 diff_word 0.440 0.0916 0.743
3 2 diff_sent 0.242 -0.0812 0.554
4 2 diff_word 0.477 0.179 0.769
5 3 diff_sent -0.192 -0.553 0.186
6 3 diff_word 0.569 0.254 0.845
7 4 diff_sent 0.246 -0.104 0.566
8 4 diff_word 0.273 -0.0469 0.565
9 5 diff_sent 0.157 -0.130 0.462
10 5 diff_word 0.590 0.287 0.866
11 6 diff_sent 0.214 -0.138 0.539
12 6 diff_word 0.228 -0.0816 0.538
13 7 diff_sent 0.356 -0.0154 0.698
14 7 diff_word 0.354 0.0643 0.698
15 8 diff_sent 0.367 0.0568 0.656
16 8 diff_word 0.395 0.0661 0.708
17 9 diff_sent -0.119 -0.423 0.188
18 9 diff_word 0.637 0.295 0.885
19 10 diff_sent 0.468 0.106 0.791
20 10 diff_word 0.210 -0.142 0.549
These allow us to identify participants that do or do not differ in terms of hesitation frequency depending on transition locations.
posterior_ppts_long_diff_summary %>%
mutate(across(difference, recode,
diff_sent = "Comparison: before sentence vs before word",
diff_word = "Comparison: before word vs within word"),
across(participant, as.numeric)) %>%
ggplot(aes(x = participant,
y = mean,
ymin = lower,
ymax = upper)) +
geom_hline(yintercept = 0, colour = "black", linetype = "dotted") +
geom_pointrange() +
facet_wrap(~difference) +
coord_flip() +
scale_x_continuous(breaks = 1:10) +
labs(x = "Participant id",
y = "Difference in hesitation probabilities with 95% PIs")
9 References
10 Stan code
/*
Mixture model for pauses
With by-ppt mixing proportion
Random intercepts for participants
*/
data {
int<lower=1> N; // Number of observations
int<lower=1> nS; // Number of subjects
int<lower=1, upper=nS> ppt[N];// Participant identifiers
int<lower=1> K; // Number of conditions
int<lower=1> condition[N]; // Condition identifiers
vector[N] y; // Outcome: interkey intervals
}
parameters {
// fixed effects
real beta; // fluent interkey intervals
vector<lower=0>[K] delta; // slowdown for long interkey intervals
vector[K] theta; // hesitation probability
real<lower = 0> tau; // error for hesitation probability
real<lower=0> sigma; // residual sd
vector<lower=0>[K] sigma_diff;
// For random effects
vector[nS] u; // participant intercepts
real<lower=0> sigma_u; // participant sd
matrix[K, nS] theta_s; // participant hesitations
}
transformed parameters{
vector<lower=0>[K] sigmap_e = sigma + sigma_diff;
vector<lower=0>[K] sigma_e = sigma - sigma_diff;
matrix[K, nS] log_theta_s_1 = log_inv_logit(theta_s);
matrix[K, nS] log_theta_s_2 = log1m_inv_logit(theta_s);
vector[K] prob = inv_logit(theta);
matrix[K, nS] prob_s = inv_logit(theta_s);
}
model {
vector[2] lp_parts;
// Priors
beta ~ normal(5, .5);
sigma ~ student_t(7, 0, 1);
sigma_diff ~ normal(0, 1);
delta ~ normal(0, .5);
theta ~ normal(0, 1);
tau ~ student_t(7, 0, .5);
for(s in 1:nS){
for(k in 1:K){
theta_s[k,s] ~ normal(theta[k], tau);
}
}
// REs priors
sigma_u ~ normal(0,2.5);
u ~ normal(0, sigma_u); //subj random effects
// likelihood
for(n in 1:N){
real mu = beta + u[ppt[n]];
lp_parts[1] = log_theta_s_1[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu + delta[condition[n]], sigmap_e[condition[n]]);
lp_parts[2] = log_theta_s_2[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu, sigma_e[condition[n]]);
target += log_sum_exp(lp_parts);
}
}
generated quantities{
vector[N] log_lik;
vector[N] y_tilde;
vector[2] lp_parts;
real<lower=0,upper=1> prob_tilde;
for(n in 1:N){
real mu = beta + u[ppt[n]];
lp_parts[1] = log_theta_s_1[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu + delta[condition[n]], sigmap_e[condition[n]]);
lp_parts[2] = log_theta_s_2[condition[n], ppt[n]] + lognormal_lpdf(y[n] | mu, sigma_e[condition[n]]);
log_lik[n] = log_sum_exp(lp_parts);
prob_tilde = bernoulli_rng(prob_s[condition[n], ppt[n]]);
if(prob_tilde) {
y_tilde[n] = lognormal_rng(mu + delta[condition[n]], sigmap_e[condition[n]]);
}
else{
y_tilde[n] = lognormal_rng(mu, sigma_e[condition[n]]);
}
}
}
11 Session info
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so; LAPACK version 3.7.1
locale:
[1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8 LC_PAPER=en_GB.UTF-8
[8] LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
time zone: Europe/London
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] polspline_1.1.25 bayestestR_0.13.2 loo_2.7.0 bayesplot_1.11.1 brms_2.21.0 Rcpp_1.0.12 rstan_2.32.6 StanHeaders_2.32.8 lubridate_1.9.3 forcats_1.0.0
[11] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0 patchwork_1.2.0 knitr_1.46
[21] rmdformats_1.0.4
loaded via a namespace (and not attached):
[1] tidyselect_1.2.1 viridisLite_0.4.2 farver_2.1.2 fastmap_1.2.0 tensorA_0.36.2.1 digest_0.6.35 timechange_0.3.0 estimability_1.5.1 lifecycle_1.0.4
[10] magrittr_2.0.3 posterior_1.5.0 compiler_4.3.3 rlang_1.1.3 sass_0.4.9 tools_4.3.3 utf8_1.2.4 yaml_2.3.8 labeling_0.4.3
[19] bridgesampling_1.1-2 bit_4.0.5 pkgbuild_1.4.4 curl_5.2.1 RColorBrewer_1.1-3 plyr_1.8.9 abind_1.4-5 withr_3.0.0 grid_4.3.3
[28] stats4_4.3.3 fansi_1.0.6 xtable_1.8-4 colorspace_2.1-0 inline_0.3.19 emmeans_1.10.2 scales_1.3.0 insight_0.19.11 cli_3.6.2
[37] mvtnorm_1.2-5 rmarkdown_2.27 crayon_1.5.2 generics_0.1.3 RcppParallel_5.1.7 rstudioapi_0.16.0 reshape2_1.4.4 tzdb_0.4.0 cachem_1.1.0
[46] ggthemes_5.1.0 parallel_4.3.3 matrixStats_1.3.0 vctrs_0.6.5 Matrix_1.6-5 jsonlite_1.8.8 bookdown_0.39 hms_1.1.3 bit64_4.0.5
[55] jquerylib_0.1.4 glue_1.7.0 codetools_0.2-20 distributional_0.4.0 stringi_1.8.4 gtable_0.3.5 QuickJSR_1.1.3 munsell_0.5.1 pillar_1.9.0
[64] htmltools_0.5.8.1 Brobdingnag_1.2-9 R6_2.5.1 vroom_1.6.5 evaluate_0.23 lattice_0.22-6 highr_0.11 backports_1.5.0 bslib_0.7.0
[73] rstantools_2.4.0 coda_0.19-4.1 gridExtra_2.3 nlme_3.1-164 checkmate_2.3.1 xfun_0.44 pkgconfig_2.0.3
The R package
brms
provides a flexible framework to implement mixture models and many other types of probability models (Bürkner, 2017, 2018).brms
has amixture
function to specify mixtures of various types of distributions. These could be Gaussian, skewed-Normal, shifted-Normal, ex-Gaussian etc. and combinations thereof. There is a large number of probability models for continuous data that are plausible candidates (for reaction time data see e.g. Matzke & Wagenmakers, 2009). In this tutorial we use Stan for mixture modelling to demonstrate the process and because the publication associated with this tutorial is fully using Stan instead ofbrms
.↩︎Run
install.packages("tidyverse")
for installation.↩︎Run
install.packages("brms")
for installation.↩︎If you want to check how many cores are available on your machine, run
parallel::detectCores()
(you might need to install the packageparallel
).↩︎To install
bayestestR
runinstall.packages("bayestestR")
.↩︎For
brms
fits thehypothesis
function calculates the BFs indicated as evidence ratio in the output. For a model such asy ~ 1 + condition
,hypothesis
can calculate the evidence for the hypothesis that the effect forcondition
is, for example, equal to 0.hypothesis
can then be applied to the model with a statement that describes the hypothesis that should be evaluated, for examplehypothesis(brm_model, "conditionb = 0")
for a null hypothesis. To calculate the evidence ratio the argumentsample_prior = TRUE
(orsample_prior = 'yes'
) needs to be specified inbrm
.↩︎