Modelling writing hesitations in text writing as finite mixture process

Jens Roeser

Compiled Jul 14 2023

1 Introduction

2 Model definitions

The models presented in the following can be divided into two general groups. The first three models are largely akin to models typically used in the literature. By this we mean models that assume a uni-modal process that generates keystroke data as is incorporated in statistical models such as analysis of variance and linear mixed-effects models. Second, the last two models model keystroke intervals as a combination of two weighted processes of which one presents a smooth information flow from mind into finger; the other component is more important as it represents moments at which the information flow was interrupted leading to longer latencies. The latter two models map directly on the idea of a cascading model of writing.

2.1 Uni-modal Gaussian

We start with a Gaussian mixed-effects models similar to a standard analysis of variance. We describe the process that generates each iki \(i\) as normal Gaussian distribution \(\text{N}()\) characterised by a mean \(\mu\) and a standard deviation \(\sigma_\text{e}^2\). The mean can be decomposed into \(\beta\) and \(\text{participant}_i\). \(\beta\) will be allowed to take on a different value for each transition location. Average participant-ikis are allowed to deviate from the average which is achieved by assuming a normal distribution for participant deviations distributed around 0 with a standard deviation \(\sigma_\text{p}^2\).

\[ \begin{align} \text{iki}_i & \sim \text{N}(\mu_i, \sigma_\text{e}^2)\\ \text{where: } & \mu_i = \beta_\text{location[i]} + \text{participant}_i\\ & \text{participant} \sim \text{N}(0, \sigma_\text{p}^2) \end{align} \]

2.2 Uni-modal log-Gaussian

This model is largely identical to the previous model, except we use a log-normal distribution instead of a normal distribution. The advantage of using a log-normal distribution is two-fold: (1) the log-normal distribution has a lower bound of zero. For our data we generally consider negative ikis as mistakes which occasionally occur when the next key was pressed before the current key. Other than that keystroke intervals are constrained by a persons ability to move their fingers and keyboard polling. (2) the log-scale is known to be a better match for data from human behaviour, in particular motor responses. In particular, a normal distribution assumes that units are linearly scaled. For example, a 50 msecs difference is the same between 100 msecs and 150 msecs than it is between 5 secs and 5 secs 50 msecs (i.e. 5,050 msecs). This is not necessarily plausible for keystroke data. We would assume that differences that are due to motor activity (typing a high frequency bigram vs a low frequency bigram) are smaller than difference that are due to high levels of cognitive activity (retrieving a word in your L1 or L2). Log-normal distributions are a natural way of scaling units so that a 50 msecs difference on the lower end of the iki scale (motor activity) is more meaningful than a 50 msecs difference on the upper end of the iki scale (retrieving words, planning sentences).

The model can be described like this:

\[ \begin{align} \text{iki}_i & \sim \text{logN}(\mu_i, \sigma_\text{e}^2) \\ \text{where: } & \mu_i = \beta_\text{location[i]} + \text{participant}_i\\ & \text{participant} \sim \text{N}(0, \sigma_\text{p}^2) \end{align} \]

2.3 Uni-modal unequal-variance log-Gaussian

\[ \begin{align} \text{iki}_i & \sim \text{logN}(\mu_i, \sigma_{e_\text{location[i]}}^2) \\ \text{where: } & \mu_i = \beta_\text{location[i]} + \text{participant}_i\\ & \text{participant} \sim \text{N}(0, \sigma_\text{p}^2) \end{align} \]

2.4 Bi-modal log-Gaussian (constrained)

This models extends the intuition from the previous model that higher levels of activation lead to longer pauses. Instead of assuming that there is one process that underlies the generation of ikis, we assume there are two. (1) activation can flow into keystrokes without interrupts. These fluent keystroke transitions are merely limited by a person’s ability to move their finger and will be captured but the parameter \(\beta\). In principle, there are no differences for fluent key-transitions between transition location. The next model with loosen this assumption (hence, “unconstrained model”). (2) difficulty in the activation flow leads to pauses when fingers have to catch-up with cognitive activity, when spelling, words, or contents couldn’t be retrieved in time. The size of these pauses will depend on the reason for delays which is typically associated with transition locations (contents are typically planned before sentences, words are retrieved before they are typed and spelling difficulty typically occurs when typing a word). Pauses will be captured by two model parameters: (1) the slowdown for these hesitant transitions will be captured by \(\delta\) which is the deviation compared to normal typing intervals (constrained to be positive). (2) the frequency of hesitant transitions will be captured by \(\theta\) for each level of a categorical predictor.

\[ \begin{align} \text{iki}_{i} & \sim \theta_\text{location[i]} \cdot \text{LogN}(\beta + \delta_\text{location[i]} + \text{participant}_i, \sigma_{e'_\text{location[i]}}^2) + \\ & (1 - \theta_\text{location[i]}) \cdot \text{LogN}(\beta + \text{participant}_i, \sigma_{e_\text{location[i]}}^2)\\ \text{where: } &\delta \sim \text{N}(0,1)\\ & \text{participant} \sim \text{N}(0, \sigma_\text{p}^2) \\ \text{constraint: } & \delta > 0\\ & \sigma_{e'}^2 > \sigma_{e}^2 \end{align} \] This models takes into account two source = s of participant-specific error: (1) each participant has an individual fluent typing as in the previous models; (2) each participant has in individual hesitation frequency that differs across levels of the categorical predictor.

2.5 Bi-modal log-Gaussian (unconstrained)

This model is identical to the previous model with one exception. The distribution of fluent keystroke transitions captured by \(\beta\) was fixed to be the same across transition locations is the previous model. In other words, the mean \(\beta\) and it’s standard deviation \(\sigma_{e}^2\) was the same for before-sentence, before-word, and within-word transitions. This means, because of the naturally larger number of within-word transitions, the posterior is dominated by within-word transitions.

In this model we will loosen this constraint and allow \(\beta\) and \(\sigma_{e}^2\) to vary by transition-location.

\[ \begin{align} \text{iki}_{i} &\sim \theta_\text{location[i]} \cdot \text{LogN}(\beta_\text{location[i]} + \delta_\text{location[i]} + \text{participant}_i, \sigma_{e'_\text{location[i]}}^2) + \\ & (1 - \theta_\text{location[i]}) \cdot \text{LogN}(\beta_\text{location[i]} + \text{participant}_i, \sigma_{e_\text{location[i]}}^2)\\ \text{where: } & \delta \sim \text{N}(0,1)\\ & \text{participant} \sim \text{N}(0, \sigma_\text{p}^2) \\ \text{constraint: } & \delta > 0\\ & \sigma_{e'}^2 > \sigma_{e}^2 \end{align} \]

3 Analysis

We reanalysed data sets including process information from participants writing text. For all data sets we fit a series of four models each with random effects for participants. Probability functions used were normal and log-normal in line with typically treatments used in the literature, a log-normal distribution with unequal variances for model predictions, and a bimodal mixed effects model. Stan code for mixture models was based on Roeser et al. (2021). Text locations (levels: before sentence, before word, within word) was included as predictor in all models.

Data were analysed in Bayesian mixed effects models (Gelman et al., 2014; McElreath, 2016). The R (R Core Team, 2020) package rstan (Stan Development Team, n.d.) was used to interface with the probabilistic programming language Stan (Carpenter et al., 2016) which was used to implement all models. Models were fitted with weakly informative priors (see McElreath, 2016), and run with 20,000 iterations on 3 chains with a warm-up of 10,000 iterations and no thinning. Model convergence was confirmed by the Rubin-Gelman statistic (\(\hat{R}\) = 1) (Gelman & Rubin, 1992) and inspection of the Markov chain Monte Carlo chains.

4 Datasets

4.1 General overview

Five datasets with keystroke data from text production were used for analysis. An overview can be found in Table 4.1.

Table 4.1: Datasets in brief.
Dataset Source Keylogger Writing task N (ppts) conditions Mean age Language
C2L1 Rønneberg et al. (2022) EyeWrite Argumentative essays 126 11.80 Norwegian
CATO Torrance et al. (2016) EyeWrite Expository texts 52 weak decoders / control; masked / unmasked 16.90 Norwegian
SPL2 Torrance et al. (n.d.) CyWrite Argumentative essays 39 write in L1 / L2 20.60 English (L1) / Spanish (L2)
PLanTra Rossetti & Van Waes (2022) InputLog Text simplification 47 pre / post test trained in plain language principles and control 23.00 English (L2)
LIFT Vandermeulen et al. (2020) InputLog Synthesis 658 Various topics and genres 16.95 Dutch
GUNNEXP2 EyeWrite 45 masked / unmasked NA Norwegian

4.1.1 GUNNEXP2

4.1.2 C2L1

The C2L1 data set comprises data Norwegian 6th graders – N=126, mean age 11 years 10 months – published in Rønneberg et al. (2022). The children composed argumentative essays in Norwegian, a language with a relatively shallow orthography.

TODO: might need to remove kids that don’t speak Norwegian at home (see github issue).

4.1.3 CATO

Data are published in Torrance et al. (2016). Norwegian upper secondary students–N=26, mean age = 16.9 years–with weak decoding skills and 26 age-matched controls composed expository texts by keyboard under two conditions: normally and with letters masked to prevent them reading what they were writing.

4.1.4 PLanTra

The PLanTra (Plain Language for Financial Content: Assessing the Impact of Training on Students’ Revisions and Readers’ Comprehension) data set (Rossetti & Van Waes, 2022) involved the collection of keystroke data from 47 university students, who were randomly divided into an experimental and a control group. In a pre-test session, all students were assigned an extract of a corporate report dealing with sustainability and were instructed to revise it to make it easier to read for a lay audience. Subsequently, the experimental group received training on how to apply plain language principles to sustainability content, while the control group received training exclusively on the topic of sustainability. During a post-test session, both groups were instructed to revise a second extract of a corporate sustainability report with the same goal–i.e. making it easier to read for a lay audience–by applying what they had learned from their respective training. The texts were in English while the participants were native speakers of other languages (mainly Dutch), so writing took place in second language. It should be pointed out that, while some students decided to revise the assigned texts, the majority of them opted for rewriting the texts from scratch.

4.1.5 LIFT

LIFT (Improving Pre-university Students’ Performance in Academic Synthesis Tasks with Level-up Instructions and Feedback Tool) (Vandermeulen et al., 2020).

4.1.6 SPL2

Data are going to be published in Torrance et al. (n.d.).

Undergraduate university students–N = 39, 28 female, mean age = 20.6 years (SD = 1.51)–wrote two short argumentative essays, one in English (the student’s first language in all cases; L1) and one in Spanish (L2) using CyWrite (Chukharev-Hudilainen et al., 2019). CyWrite provides a writing environment with basic word processing functionality (e.g., Microsoft WordPad), including text selection by mouse action, and copy-and-paste. We recorded the time of each keystroke and mouse action, and tracked writers’ eye movements within their emerging text.

Writing tasks: Participants were given a 40 minute time limit. They wrote essays in response to each of two prompts, with order and L1 / L2 counterbalanced across subjects.

4.2 Transition types

The transition types that were analysed in this study focuses on those locations that were found, by previous research, to be psycholinguistically meaningful (Chukharev-Hudilainen et al., 2019; De Smet et al., 2018; e.g. Torrance et al., n.d., 2016) and are detailed in Table 4.2. Keytransitions that terminated in an editing operation were excluded from the analysis. Transitions that occurred at the beginning of the text or the beginning of a paragraph were not treated as before-sentence transitions.

Table 4.2: Transition location classification.
Transition type Description Example
Within word Transitions between any letter T^h^e c^a^t m^e^o^w^e^d. T^h^a^t[bsp][bsp]e^n i^t s^l^e^p^t.
Below word Keypress after space followed by any letter The ^cat ^meowed. That[bsp][bsp]en ^it ^slept.
Before sentence Keypress following a space preceding any letter The cat meowed. ^That[bsp][bsp]en it slept.
Note:
‘^’ marks transition location, [bsp] represents backspace. IKIs were timed to the shift keypress.

4.3 Data reduction

For all datasets we only used transitions that were not followed by an editing operation.

We removed participants that did not complete all conditions in studies with within-participant factors (reducing the number of participants to 343 in the LIFT data set, and 41 participants in the PLanTra data set). We removed participants that produced less than 10 sentences (LIFT: 109 participants; PLanTra: 3 participants; SPL2: 1 participant)

We removed keystroke intervals that are extremely short (\(\le\) 50 msecs) or extremely long (\(\ge\) 30 secs). The percentage of remove keystroke data can be found in Table 4.3.

From the remaining data we randomly sampled 100 observations per participant, per condition, and per transition location, with the exception of the LIFT data set. This was done for computational reasons to reduce the time the Bayesian models need to complete. For the LIFT data set we reduced the number of participants to 100 which is substantially more than most of the other data sets in our analysis. Because we included the large number of writing tasks in the LIFT data set as fixed effect, we sampled 50 observations per condition, location and participant. The percentage of keystroke data that went into the final analysis can be found, by transition location, in Table 4.3.

Table 4.3: Data reduction. Mean percentage of extreme data removed and the mean percentage of randomly sampled data by transition locattion. Standard error is shown in parentheses.
Extreme values
Randomly sampled data
Dataset \(\le\) 50 msecs \(\ge\) 30 secs before word within word before sentence
C2L1 0.19% (0.1%) 0.07% (0.06%) 84.5% (1.8%) 35.1% (2.6%) 100% (0%)
CATO 0.65% (0.15%) 0.02% (0.02%) 48.6% (2.2%) 14.9% (0.9%) 100% (0%)
LIFT 2.65% (0.16%) 0% (0%) 13.1% (0.9%) 3.2% (0.2%) 99.4% (0.1%)
PLanTra 2.49% (0.41%) 0.04% (0.03%) 36.6% (1.9%) 9.7% (0.6%) 100% (0%)
SPL2 2.29% (0.2%) 0.03% (0.02%) 22.6% (1.4%) 5.7% (0.4%) 100% (0%)
GUNNEXP2 2.16% (0.17%) 0.01% (0.01%) 22.5% (1.4%) 6.2% (0.4%) 100% (0%)

5 Out-of-samples cross-validation

For model comparisons we used out-of-sample predictions estimated using Pareto smoothed importance-sampling leave-one-out cross-validation (Vehtari et al., 2015, 2017). Predictive performance was estimated as the sum of the expected log predictive density (\(\widehat{elpd}\)) and the difference \(\Delta\widehat{elpd}\) between models. The advantage of using leave-one-out cross-validation is that models with more parameters are penalised to prevent overfit.

Results for all data sets are shown in Table 5.1. For all data sets we found the same pattern. The mixture of log-normal distributions provided a substantially better fit than uni-modal distribution models. The unconstrained version of the mixture of log-normal distributions rendered a higher predictive performance than the constrained version that does not allow the distribution of short keystroke-intervals to vary across conditions.

Table 5.1: Model comparisons. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
GUNNEXP2
CATO
CL21
LIFT
PLanTra
SPL2
SPL2 (shift + C)
Model \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\) \(\Delta\widehat{elpd}\) \(\widehat{elpd}\)
Bimodal log-normal (unconstrained) -120,675 (237) -139,691 (230) -165,434 (231) -272,696 (327) -107,324 (228) -96,671 (217) -97,266 (219)
Bimodal log-normal (constrained) -554 (38) -121,228 (235) -607 (43) -140,298 (230) -546 (51) -165,980 (237) -500 (37) -273,197 (327) -130 (18) -107,454 (228) -564 (41) -97,234 (216) -593 (40) -97,859 (219)
Unimodal log-normal (unequal variance) -2,981 (94) -123,656 (258) -2,389 (79) -142,080 (246) -2,184 (83) -167,617 (256) -5,605 (131) -278,301 (365) -2,514 (92) -109,837 (248) -1,744 (68) -98,415 (224) -1,719 (66) -98,984 (226)
Unimodal log-normal -5,293 (111) -125,968 (263) -3,689 (100) -143,379 (258) -2,968 (98) -168,402 (267) -7,716 (147) -280,412 (375) -3,895 (94) -111,219 (246) -3,457 (81) -100,128 (221) -3,086 (76) -100,351 (223)
Unimodal normal -44,136 (660) -164,811 (713) -41,362 (927) -181,052 (992) -40,485 (856) -205,919 (935) -82,567 (1,691) -355,263 (1,779) -37,718 (572) -145,042 (637) -32,174 (450) -128,845 (493) -32,019 (437) -129,284 (477)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

6 Cross-data set comparisons

6.1 Cross-data set visualisation

The model estimates for the mixture-model with the highest predictive performance are shown in Figure 6.1. In this visualisation we ignore dataset specific conditions that are presented in detail below.

Across studies. Posterior parameter distribution

Figure 6.1: Across studies. Posterior parameter distribution

6.2 Effect of transition location

It’s generally believed that pausing is associated with syntactic edges such that more and longer pauses are predicted for key transitions at larger syntactic edges, i.e. before sentence > before word > within word. We have evaluated the differences between transition locations for all data sets. The results are shown in Table 6.1.

Results are largely consistent across data sets (with caveats) but differ, to some extent, from what the literature would predict. In line with the literature hesitations are more frequent before words than within words. Also hesitations are longer at before-sentence transitions compared to before-word transitions (except dataset C2L1) compared to within-word transitions (except dataset LIFT). However, our results do not support that writers pause more frequently at before-sentence locations compared to before-word locations (except for dataset SPL2; this also shows that more pauses at before-sentence locations can not be explained on the basis of multi-key combinations for sentence-initial capitalisation). Also, we observe that even fluent key-transitions are slower at before-word locations compared to within-word locations but there is generally not difference for fluent transitions for before-sentence transitions compared to before-word transitions (except for dataset SPL2).

The datasets differ to the extent that sentence-initial key transitions do (PLanTra, LIFT) or do not (CATO, C2L1, SPL2) include the character following the shift key for capitalisation. In other words, the pause before sentences may sum across two key intervals, namely _^[shift]^C but only involves one keyintervals, namely _^[shift]. For the SPL2 dataset, we calculated location effects for sentence-initial transitions that do and do not involve the shift-to-key transition. The results were the same for the transition location effects. A comparison that is untangling the effects of the multi-keycombination on the mixture-model estimates can be found in Table 8.3. In short, the duration of fluent transitions and the hesitation slowdown are affected but not the hesitation probability.

In conclusion, while pauses tend to be longer before sentences they are not more frequent than before words.

Table 6.1: Effect of transition location on keystroke intervals. Differences between transition locations are shown on log scale (for transition durations) and logit scale for probability of hesitant transitions. 95% PIs in brackets.
Fluent transitions
Slowdown for hesitations
Probability of hesitations
Data set Difference Est. with 95% PIs BF Est. with 95% PIs BF Est. with 95% PIs BF
C2L1 before sentence vs word 0.01 [-0.13, 0.15] 0.07 0.21 [-0.13, 0.54] 0.35 0.54 [-0.31, 1.43] 0.89
before vs within word 0.4 [0.38, 0.42] > 100 0.49 [0.39, 0.58] > 100 1.1 [0.76, 1.44] > 100
CATO (non-dyslexic unmasked) before sentence vs word -0.03 [-0.18, 0.11] 0.08 1.19 [0.88, 1.49] > 100 0.41 [-0.44, 1.27] 0.67
before vs within word 0.35 [0.31, 0.38] > 100 0.35 [0.14, 0.54] 17.48 1.83 [1.19, 2.49] > 100
GUNNEXP2 (unmasked) before sentence vs word 0.22 [0.12, 0.32] > 100 0.93 [0.78, 1.07] > 100 1.32 [0.86, 1.78] > 100
before vs within word 0.23 [0.21, 0.25] > 100 0.38 [0.21, 0.54] > 100 1.97 [1.59, 2.37] > 100
LIFT before sentence vs word -0.04 [-0.13, 0.05] 0.07 0.35 [0.05, 0.72] 2.14 -0.49 [-1.03, 0] 1.63
before vs within word 0.21 [0.13, 0.27] > 100 0.26 [-0.16, 0.55] 0.71 1.56 [1.02, 2.16] > 100
PLanTra before sentence vs word 0.01 [-0.09, 0.11] 0.07 0.65 [0.46, 0.83] > 100 0.13 [-0.25, 0.53] 0.25
before vs within word 0.16 [0.1, 0.21] > 100 0.37 [0.2, 0.54] > 100 1.83 [1.48, 2.17] > 100
SPL2 (L1; shift + C) before sentence vs word 0.73 [0.66, 0.81] > 100 0.92 [0.78, 1.07] > 100 1.16 [0.64, 1.69] > 100
before vs within word 0.3 [0.27, 0.34] > 100 0.35 [0.15, 0.54] 22.82 1.84 [1.33, 2.35] > 100
SPL2 (L1) before sentence vs word 0.24 [0.17, 0.31] > 100 1.39 [1.25, 1.52] > 100 0.69 [0.17, 1.19] 8.01
before vs within word 0.31 [0.28, 0.34] > 100 0.35 [0.14, 0.54] 16.25 1.94 [1.4, 2.5] > 100
Note:
PIs are probability intervals. BF is the evidence in favour of the alternative hypothesis over the null hypothesis.

7 Model comparison with simulated data

A general concern with mixture models is that in principle, as the mixture model has more parameters it might simply always lead to a better fit, even though cross-validation is addressing potential problems with overfitting models.

To address this concern we simulated two sets of data. Both data sets have two conditions and 1,000 observations each. The first set of data has as underlying data generating process a mixture model with two mixture components similar to the process described above. The difference between the two conditions is that the mixing proportion is larger for condition 2 than for condition 1, hence long observations are more likely in condition 2. This model can be summarised as followed:

\[ \text{y}_i \sim \theta_\text{condition[i]} \cdot \text{logN}(\mu_1, \sigma^2_1) +\\ (1 - \theta_\text{condition[i]}) \cdot \text{logN}(\mu_2, \sigma^2_2) \]

The second data set was generated with an unequal variance unimodal model as data generating process. Condition 2 has a larger mean and standard deviation than condition 1. The model can be summaried as followed:

\[ \text{y}_i \sim \text{logN}(\mu_\text{condition[i]}, \sigma^2_\text{condition[i]}) \]

The true parameter values used for each of the two data simulations can be found in Table 7.1. The simulated data are visualised in Figure 7.1. The data are simulated to be similarly distributed to keystroke transitions.

Data simulated with a bimodal process (left) and a unimodal process (right).

Figure 7.1: Data simulated with a bimodal process (left) and a unimodal process (right).

We run 4 models: 2 mixture models, one on the data generated with a mixture process and one on the data generated with the unimodal unequal variance process. We repeated the same using an unimodal unequal variance model. Models were run with 3 chains, with each 6,000 iterations of which 3,000 were warmup. The Stan models uncovered the model parameters of their respective data sets successfully, as shown in Table 7.1, not less so when the model was applied to data generated with the other underlying process.

Table 7.1: Uncovered parameter estimates with 95% probability interval (PI) and true parameter values for each simulated data set and by model and their respective parameters.
Estimate with 95% PI
Parameter True value Bimodal data Unimodal data
Model: Bimodal mixture model
\(\beta\) 5 5 [4.98, 5.01] 5 [4.99, 5.02]
\(\delta\) 1 0.99 [0.9, 1.06] 1.04 [1, 1.08]
\(\theta_\text{condition=1}\) .10 .09 [.07, .12] .01 [.00, .01]
\(\theta_\text{condition=2}\) .40 .42 [.38, .47] .97 [.95, .99]
\(\sigma^2_1\) 0.25 0.25 [0.24, 0.26] 0.25 [0.24, 0.26]
\(\sigma^2_2\) 0.5 0.48 [0.43, 0.54] 0.48 [0.46, 0.51]
Model: Unimodal process
\(\beta_\text{condition=1}\) 5 5.1 [5.07, 5.12] 5 [4.99, 5.02]
\(\beta_\text{condition=2}\) 6 5.41 [5.37, 5.44] 6.02 [5.99, 6.05]
\(\sigma_\text{condition=1}\) 0.25 0.4 [0.38, 0.42] 0.25 [0.24, 0.26]
\(\sigma_\text{condition=2}\) 0.5 0.61 [0.59, 0.64] 0.51 [0.48, 0.53]

We used LOO-CV to compare the fit of the two models for each data set. The model comparisons can be found for each data generating process in Table (tab:loossim). The results show that the mixture model does not always lead to higher predictive performance. Indeed, the mixture model showed a lower predictive performance for the data that were generated with a unimodal process. However, for the data generated with a bimodal process, the mixture model model shows a higher predictive performance. In fact, the ratio of \(\Delta\widehat{elpd}\) and its SE, as metric for the strength of evidence, shows that the mixture model performs 11 times better than the unimodal model for the data generated with a bimodal process. In comparison, for the unimodal data, the unimodal model shows only 3 times better than the bimodal mixture model. Thus, even though the mixture model does not necessarily perform better for non-bimodal data but it also doesn’t necessarily perform much worse. This contrast is likely a reflection of the increased number of parameters in the mixture model.

Table 7.2: Model comparisons by data set. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
Model \(\Delta\widehat{elpd}\) \(\widehat{elpd}\)
Data: Bimodal mixture process
Bimodal mixture model 0 (0) -11,614 (58)
Unimodal unequal-variance model -325 (30) -11,939 (66)
Data: Unimodal process
Unimodal unequal-variance model 0 (0) -11,788 (53)
Bimodal mixture model -6 (2) -11,794 (53)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

8 Posterior by data set

8.1 GUNNEXP2

8.1.1 Fit to data

8.1.2 Unconstrained mixture model

8.1.2.1 Posterior parameter estimates

GUNNEXP2 (unconstrained model). Posterior parameter distribution

Figure 8.1: GUNNEXP2 (unconstrained model). Posterior parameter distribution

8.1.2.2 Masking effect

Table 8.1: Mixture model estimates for key transitions. Cell means are shown for the masked and unmasked writing task in msecs for fluent key-transitions, the slowdown for long transitions and the probability of disfluent transitions. The effect for masking is shown on log scale (for transition durations) and logit scale for probability of disfluent transitions. 95% PIs in brackets.
Transition location Unmasked Masked Difference BF
Fluent transitions
before sentence 220 [195, 248] 250 [224, 279] 0.13 [0.02, 0.23] 1.01
before word 177 [165, 189] 185 [173, 198] 0.05 [0.03, 0.07] 70.05
within word 140 [131, 150] 146 [137, 156] 0.04 [0.03, 0.06] > 100
Disfluencies
before sentence 1,602 [1,336, 1,908] 2,310 [1,968, 2,703] 0.21 [0.05, 0.38] 1.97
before word 401 [355, 451] 472 [415, 535] 0.08 [-0.02, 0.18] 0.18
within word 173 [129, 227] 172 [124, 230] -0.03 [-0.24, 0.19] 0.11
Probability of disfluencies
before sentence .69 [.61, .77] .69 [.62, .76] 0 [-0.45, 0.45] 0.23
before word .38 [.32, .44] .32 [.27, .38] -0.24 [-0.57, 0.09] 0.46
within word .08 [.06, .10] .06 [.05, .09] -0.22 [-0.65, 0.21] 0.36
Note:
PIs are probability intervals. BF is the evidence in favour of the alternative hypothesis over the null hypothesis.

8.1.3 Cononstrained mixture model

8.1.3.1 Posterior parameter estimates

The posterior of the constrained model is shown in Figure 8.2 showing the posterior slowdown for disfluent keystrokes (left panel) and the probability of disfluent keystrokes (right panel). Fluent keystroke transitions are distributed around a posterior mean of 157 msecs, PI: (147, 168).

GUNNEXP2 (constrained model). Posterior parameter distribution.

Figure 8.2: GUNNEXP2 (constrained model). Posterior parameter distribution.

8.1.3.2 Masking effect

Table 8.2: Mixture model estimates for key transitions. Cell means are shown for the masked and unmasked writing task in msecs for the slowdown for long transitions and the probability of disfluent transitions. The effect for masking is shown on log scale (for transition durations) and logit scale for probability of disfluent transitions. 95% PIs in brackets.
Transition location Unmasked Masked Difference BF
Disfluencies
before sentence 1,124 [954, 1,307] 1,418 [1,218, 1,634] 0.21 [0.05, 0.36] 2.59
before word 363 [325, 403] 379 [341, 420] 0.03 [-0.05, 0.11] 0.05
within word 182 [105, 279] 209 [135, 306] 0.08 [-0.24, 0.4] 0.19
Probability of disfluencies
before sentence .88 [.82, .93] .90 [.85, .94] 0.16 [-0.46, 0.79] 0.36
before word .47 [.40, .54] .45 [.38, .52] -0.06 [-0.46, 0.35] 0.21
within word .05 [.03, .08] .04 [.03, .06] -0.29 [-0.88, 0.28] 0.48
Note:
PIs are probability intervals. BF is the evidence in favour of the alternative hypothesis over the null hypothesis.

8.2 Fit to data

8.2.1 C2L1

C2L1 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 8.3: C2L1 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

8.2.2 CATO

CATO data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 8.4: CATO data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

8.2.3 PLanTra

PLanTra data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 8.5: PLanTra data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

8.2.4 LIFT

LIFT data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 8.6: LIFT data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

8.2.5 SPL2

SPL2 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 8.7: SPL2 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

8.3 Posterior parameter estimates of mixture model

8.3.1 C2L1

C2l1. Posterior parameter distribution

Figure 8.8: C2l1. Posterior parameter distribution

8.3.2 CATO

CATO. Posterior parameter distribution

Figure 8.9: CATO. Posterior parameter distribution

8.3.3 PLanTra

PLanTra. Posterior parameter distribution

Figure 8.10: PLanTra. Posterior parameter distribution

8.3.4 LIFT

LIFT. Posterior parameter distribution

Figure 8.11: LIFT. Posterior parameter distribution

8.3.5 SPL2

Can slowdowns for sentence-pauses be explained on the basis of a complex keystrokes that were summed across? – No

The data sets used differ to the extent that the keystroke interval before sentences does (PLanTra, LIFT) or does not (CATO, C2L1, SPL2) scope over the character following Shift. In other words, the pause before sentences sums across two key intervals in the PLanTra and LIFT data, namely _^[shift]^C but only involves one keyintervals, namely _^[shift] for the remaining data sets.

For the SPL2 dataset we compared whether the different patterns for sentences pauses can be explain but the keycombination. We analysed the SPL2 data including and excluding the keystroke after shift. Model estimates are presented in Figure (fig:spl2post). The results of this comparison can be found in Table 8.3. Overall, fluent transition duration and the hesitation duration were affected by whether or not the sentence-initial transition include the interval between shift and the first character but not the hesitation probability. Fluent keytransitions were substantially longer when including the interval following the shift-key. The slowdown for hesitations was affected too but the difference is numerically small. There was no conclusive evidence for an increased hesitation probability. Taken together, including the character following shift affects the duration of fluent transitions more than it affects pause duration and frequency.

SPL2. Posterior parameter distribution

Figure 8.12: SPL2. Posterior parameter distribution

Table 8.3: Mixture model estimates for key transitions. Cell means are shown for transitions that do and do not involve the transition to the character following shift in msecs for fluent key-transitions, the slowdown for long transitions and the probability of disfluent transitions. The difference for including the transition duration to the character after shift is shown on log scale (for transition durations) and logit scale for probability of disfluent transitions. 95% PIs in brackets.
Language Transition location _^[shift] + C _^[shift] Difference BF
Fluent transitions
L1 before sentence 390 [350, 434] 240 [216, 266] 0.48 [0.34, 0.63] > 100
within word 138 [127, 150] 138 [127, 150] 0 [-0.11, 0.11] 0.06
before word 187 [172, 204] 188 [173, 205] -0.01 [-0.12, 0.11] 0.06
L2 before sentence 448 [379, 521] 296 [253, 343] 0.41 [0.19, 0.63] 46.08
within word 155 [143, 168] 156 [143, 169] 0 [-0.12, 0.11] 0.06
before word 259 [236, 282] 259 [236, 284] 0 [-0.13, 0.12] 0.06
Disfluencies
L1 before sentence 2,398 [2,001, 2,836] 2,469 [2,119, 2,855] -0.46 [-0.61, -0.3] > 100
within word 140 [96, 195] 138 [93, 196] 0.01 [-0.24, 0.25] 0.13
before word 345 [292, 406] 343 [289, 404] 0.01 [-0.12, 0.14] 0.07
L2 before sentence 2,859 [2,407, 3,368] 2,769 [2,348, 3,236] -0.34 [-0.51, -0.17] > 100
within word 170 [132, 215] 171 [132, 217] 0 [-0.17, 0.17] 0.09
before word 764 [673, 867] 759 [667, 862] 0.01 [-0.09, 0.11] 0.05
Probability of disfluencies
L1 before sentence .62 [.53, .71] .50 [.41, .59] 0.49 [-0.04, 1.03] 1.43
within word .08 [.05, .11] .07 [.05, .10] 0.12 [-0.46, 0.7] 0.32
before word .34 [.27, .42] .34 [.26, .42] 0.02 [-0.48, 0.51] 0.25
L2 before sentence .81 [.73, .88] .72 [.63, .80] 0.48 [-0.16, 1.13] 0.94
within word .18 [.14, .24] .18 [.13, .24] 0.03 [-0.47, 0.52] 0.25
before word .59 [.51, .67] .59 [.51, .68] 0.01 [-0.48, 0.5] 0.25
Note:
PIs are probability intervals. BF is the evidence in favour of the alternative hypothesis over the null hypothesis.

References

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20.
Chukharev-Hudilainen, E., Saricaoglu, A., Torrance, M., & Feng, H.-H. (2019). Combined deployable keystroke logging and eyetracking for investigating L2 writing fluency. Studies in Second Language Acquisition, 41(3), 583–604.
De Smet, M. J., Leijten, M., & Van Waes, L. (2018). Exploring the process of reading during writing using eye tracking and keystroke logging. Written Communication, 35(4), 411–447.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman; Hall/CRC.
Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.
McElreath, R. (2016). Statistical rethinking: A bayesian course with examples in R and Stan. CRC Press.
R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Roeser, J., De Maeyer, S., Leijten, M., & Van Waes, L. (2021). Modelling typing disfluencies as finite mixture process. Reading and Writing, 1–26. https://osf.io/y3p4d/
Rossetti, A., & Van Waes, L. (2022). It’s not just a phase: Investigating text simplification in a second language from a process and product perspective. Frontiers in Artificial Intelligence, 5.
Rønneberg, V., Torrance, M., Uppstad, P. H., & Johansson, C. (2022). The process-disruption hypothesis: How spelling and typing skill affects written composition process and product. Psychological Research, 1–17.
Stan Development Team. (n.d.). RStan: The R interface to Stan. https://mc-stan.org/
Torrance, M., Roeser, J., & Chukharev-Hudilainen, E. (n.d.). Lookback in L1 and L2 writing: An eye movement study.
Torrance, M., Rønneberg, V., Johansson, C., & Uppstad, P. H. (2016). Adolescent weak decoders writing in a shallow orthography: Process and product. Scientific Studies of Reading, 20(5), 375–388.
Vandermeulen, N., Steendam, E. V., & Rijlaarsdam, G. (2020). DATASET - Baseline data LIFT Synthesis Writing project [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3893538
Vehtari, A., Gelman, A., & Gabry, J. (2015). Pareto smoothed importance sampling. arXiv Preprint arXiv:1507.02646.
Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.