1 Analysis

We reanalysed data sets including process information from participants writing text. For all data sets we fit a series of four models each with random effects for participants. Probability functions used were normal and log-normal in line with typically treatments used in the literature, a log-normal distribution with unequal variances for model predictions, and a bimodal mixed effects model. Stan code for mixture models was based on Roeser et al. (2021). Text locations (levels: before sentence, before word, within word) was included as predictor in all models.

Data were analysed in Bayesian mixed effects models (Gelman et al., 2014; McElreath, 2016). The R (R Core Team, 2020) package rstan (Stan Development Team, n.d.) was used to interface with the probabilistic programming language Stan (Carpenter et al., 2016) which was used to implement all models. Models were fitted with weakly informative priors (see McElreath, 2016), and run with 10,000 iterations on 3 chains with a warm-up of 5,000 iterations and no thinning. Model convergence was confirmed by the Rubin-Gelman statistic (\(\hat{R}\) = 1) (Gelman & Rubin, 1992) and inspection of the Markov chain Monte Carlo chains.

For model comparisons we used out-of-sample predictions estimated using Pareto smoothed importance-sampling leave-one-out cross-validation (Vehtari et al., 2015, 2017). Predictive performance was estimated as the sum of the expected log predictive density (\(\widehat{elpd}\)) and the difference \(\Delta\widehat{elpd}\) between models. The advantage of using leave-one-out cross-validation is that models with more parameters are penalised to prevent overfit.

All analyses focus on keystroke intervals that terminate in a insertion. Deletions, cursor movements and other editing operations were removed from the data. Also keystroke transitions shorter than 50 msecs and longer than 30 secs were removed from the data.

2 CATO

Data are published in Torrance et al. (2016). Norwegian upper secondary students–N=26, mean age = 16.9 years–with weak decoding skills and 26 age-matched controls composed expository texts by keyboard under two conditions: normally and with letters masked to prevent them reading what they were writing.

2.1 Data processing

2.2 Model comparisons

2.2.1 Fit to data

CATO data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 2.1: CATO data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

2.2.2 Out-of-samples cross-validation

Table 2.1: CATO data. Model comparisons. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
	All transitions		Before sentence		Before word		Within word
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
Bimodal log-normal	–	-72,122 (165)	–	-3,130 (38)	–	-37,183 (118)	–	-31,522 (89)
Unimodal log-normal	-1,905 (70)	-74,028 (181)	-51 (9)	-3,181 (35)	-737 (43)	-37,919 (128)	-403 (37)	-31,926 (103)
Unimodal log-normal (unequal variance)	-1,118 (53)	-73,240 (173)	-52 (9)	-3,182 (35)	-738 (43)	-37,921 (128)	-396 (36)	-31,918 (102)
Unimodal normal	-21,142 (607)	-93,264 (648)	-542 (35)	-3,672 (35)	-8,656 (326)	-45,839 (359)	-4,717 (490)	-36,239 (519)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

2.3 Posterior parameter estimates of mixture model

Figure 2.2: CATO. Posterior parameter distribution

3 SPL2

Data are going to be published in Torrance et al. (n.d.).

3.1 Methods

Undergraduate university students–N = 39, 28 female, mean age = 20.6 years (SD = 1.51)–wrote two short argumentative essays, one in English (the student’s first language in all cases; L1) and one in Spanish (L2) using CyWrite (Chukharev-Hudilainen et al., 2019). CyWrite provides a writing environment with basic word processing functionality (e.g., Microsoft WordPad), including text selection by mouse action, and copy-and-paste. We recorded the time of each keystroke and mouse action, and tracked writers’ eye movements within their emerging text.

Writing tasks: Participants were given a 40 minute time limit. They wrote essays in response to each of two prompts, with order and L1 / L2 counterbalanced across subjects.

3.2 Data processing

3.3 Model comparisons

3.3.1 Fit to data

SPL2 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 3.1: SPL2 data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

3.3.2 Out-of-samples cross-validation

Table 3.1: SPL2 data. Model comparisons. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
	All transitions		Before sentence		Before word		Within word
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
Bimodal log-normal	–	-103,278 (227)	–	-15,127 (65)	–	-48,526 (142)	–	-39,329 (104)
Unimodal log-normal (unequal variance)	-1,703 (71)	-104,981 (238)	-100 (14)	-15,226 (62)	-934 (46)	-49,459 (148)	-583 (45)	-39,912 (123)
Unimodal log-normal	-3,378 (81)	-106,656 (231)	-100 (14)	-15,226 (62)	-1,069 (45)	-49,595 (145)	-640 (50)	-39,969 (127)
Unimodal normal	-34,837 (440)	-138,115 (482)	-1,698 (59)	-16,825 (64)	-11,945 (330)	-60,471 (363)	-9,197 (1,804)	-48,527 (1,828)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

3.4 Posterior parameter estimates of mixture model

Figure 3.2: SPL2. Posterior parameter distribution

4 PLanTra

4.1 Methods

The PLanTra (Plain Language for Financial Content: Assessing the Impact of Training on Students’ Revisions and Readers’ Comprehension) data set (Rossetti & Van Waes, 2022) involved the collection of keystroke data from 47 university students, who were randomly divided into an experimental and a control group. In a pre-test session, all students were assigned an extract of a corporate report dealing with sustainability and were instructed to revise it to make it easier to read for a lay audience. Subsequently, the experimental group received training on how to apply plain language principles to sustainability content, while the control group received training exclusively on the topic of sustainability. During a post-test session, both groups were instructed to revise a second extract of a corporate sustainability report with the same goal–i.e. making it easier to read for a lay audience–by applying what they had learned from their respective training. The texts were in English while the participants were native speakers of other languages (mainly Dutch), so writing took place in second language. It should be pointed out that, while some students decided to revise the assigned texts, the majority of them opted for rewriting the texts from scratch.

4.2 Data processing

4.3 Model comparisons

4.3.1 Fit to data

PLanTra data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 4.1: PLanTra data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

4.3.2 Out-of-samples cross-validation

Table 4.1: PLanTra data. Model comparisons. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
	All transitions		Before sentence		Before word		Within word
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
Bimodal log-normal	–	-52,065 (162)	–	-11,739 (82)	–	-21,917 (104)	–	-18,498 (78)
Unimodal log-normal	-1,809 (62)	-53,874 (171)	-312 (25)	-12,051 (77)	-569 (33)	-22,486 (105)	-399 (43)	-18,897 (101)
Unimodal log-normal (unequal variance)	-1,201 (61)	-53,265 (173)	-313 (25)	-12,052 (77)	-570 (33)	-22,487 (105)	-401 (44)	-18,899 (101)
Unimodal normal	-17,868 (349)	-69,932 (394)	-3,215 (93)	-14,954 (106)	-6,325 (228)	-28,242 (257)	-5,522 (749)	-24,020 (771)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

4.4 Posterior parameter estimates of mixture model

Figure 4.2: PLanTra. Posterior parameter distribution

5 LIFT

5.1 Methods

LIFT (Improving Pre-university Students’ Performance in Academic Synthesis Tasks with Level-up Instructions and Feedback Tool) (Vandermeulen et al., 2020).

5.2 Data processing

5.3 Model comparisons

5.3.1 Fit to data

LIFT data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

Figure 5.1: LIFT data. Comparison of 100 simulated (predicted) sets of data to observed data illustated by model. For illustration the x-axis was truncated at 2,000 msecs.

5.3.2 Out-of-samples cross-validation

Table 5.1: LIFT data. Model comparisons. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses.
	All transitions		Before sentence		Before word		Within word
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
Bimodal log-normal	–	-280,027 (331)	–	-228,441 (322)	–	-390,349 (391)	–	-337,088 (302)
Unimodal log-normal	-7,519 (146)	-287,545 (377)	-6,252 (123)	-234,693 (342)	-8,573 (142)	-398,922 (417)	-5,237 (145)	-342,325 (368)
Unimodal log-normal (unequal variance)	-5,401 (122)	-285,428 (361)	-6,255 (123)	-234,696 (342)	-8,555 (142)	-398,904 (417)	-5,233 (145)	-342,321 (368)
Unimodal normal	-83,305 (1,657)	-363,332 (1,745)	-68,376 (977)	-296,817 (1,065)	-96,164 (1,599)	-486,513 (1,700)	-61,217 (2,552)	-398,306 (2,632)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

5.4 Posterior parameter estimates of mixture model

Figure 5.2: LIFT. Posterior parameter distribution

6 Cross-task / data set comparisons

Figure 6.1: Across studies. Posterior parameter distribution

Table 6.1: Mixture model results of transition duration with predictor estimates (main effects, interactions) for the distribution of short and hesitant transition durations (on log scale) and the probability of hesitant transitions (on logit scale). Estimates are shown with 95% PI. Dotted vertical line indicates 0.
	Short transition duration		Slowdown for hesitant transitions		Probability of hesitant transitions
Predictor	Estimate	BF\(_{10}\)	Estimate	BF\(_{10}\)	Estimate	BF\(_{10}\)
Main effects
Dataset 1 (LIFT, SPL2)		> 100		> 100		> 100
Dataset 2 (LIFT, PLanTra)		0.1		> 100		> 100
Dataset 3 (SPL2, PLanTra)		> 100		0.11		3.42
Dataset 4 (CATO, SPL2)		> 100		> 100		69.89
Dataset 5 (CATO, PLanTra)		> 100		0.06		0.15
Dataset 6 (CATO, LIFT)		0.12		0.06		6.73
Location 1 (before sentence, before word)		> 100		> 100		88.16
Location 2 (before word / sentence, within word)		> 100		> 100		> 100
Two-way interactions
Dataset 1 : Location 1		> 100		> 100		> 100
Dataset 1 : Location 2		> 100		> 100		> 100
Dataset 2 : Location 1		0.03		0.1		> 100
Dataset 2 : Location 2		1.3		0.14		4.02
Dataset 3 : Location 1		> 100		81.03		8.45
Dataset 3 : Location 2		> 100		6.44		8.69
Dataset 4 : Location 1		0.08		> 100		> 100
Dataset 4 : Location 2		> 100		> 100		4.69
Dataset 5 : Location 1		0.06		> 100		0.96
Dataset 5 : Location 2		> 100		> 100		0.29
Dataset 6 : Location 1		> 100		1.93		0.5
Dataset 6 : Location 2		> 100		0.42		1.59
Note:
Colon indicates interactions. PI is the probability interval. BF\(_{10}\) is the evidence in favour of the alternative hypothesis over the null hypothesis.

Table 6.2: By-transition location differences for transition duration estimates inferred from mixture model. Differences are shown on the log scale (for durations) and logit scale for probability of hesitant transitions. 95% PIs in brackets. Dotted vertical line indicates 0.
	Short interval durations		Slowdown for hesitant transitions		Probability of hesitant transitions
Comparisons	Estimate	BF\(_{10}\)	Estimate	BF\(_{10}\)	Estimate	BF\(_{10}\)
before sentence
CATO - LIFT		> 100		> 100		> 100
CATO - PLanTra		> 100		> 100		0.33
CATO - SPL2		> 100		1.52		5.49
LIFT - PLanTra		0.12		78.6		> 100
LIFT - SPL2		> 100		> 100		> 100
SPL2 - PLanTra		> 100		4.81		82.58
before word
CATO - LIFT		> 100		1.15		0.17
CATO - PLanTra		> 100		2.76		0.55
CATO - SPL2		1.27		0.06		2.42
LIFT - PLanTra		0.32		> 100		0.38
LIFT - SPL2		29.86		41.79		2.6
SPL2 - PLanTra		> 100		1.19		0.24
within word
CATO - LIFT		> 100		0.32		0.19
CATO - PLanTra		> 100		16.12		0.27
CATO - SPL2		47.44		0.14		0.22
LIFT - PLanTra		0.04		2.91		0.31
LIFT - SPL2		0.06		0.08		0.17
SPL2 - PLanTra		0.08		4.94		0.22
Note:
PIs are probability intervals. BF\(_{10}\) is the evidence in favour of the alternative hypothesis over the null hypothesis.

References

Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 20.

Chukharev-Hudilainen, E., Saricaoglu, A., Torrance, M., & Feng, H.-H. (2019). Combined deployable keystroke logging and eyetracking for investigating L2 writing fluency. Studies in Second Language Acquisition, 41(3), 583–604.

Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2014). Bayesian data analysis (3rd ed.). Chapman; Hall/CRC.

Gelman, A., & Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472.

McElreath, R. (2016). Statistical rethinking: A bayesian course with examples in R and Stan. CRC Press.

R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

Roeser, J., De Maeyer, S., Leijten, M., & Van Waes, L. (2021). Modelling typing disfluencies as finite mixture process. Reading and Writing, 1–26. https://osf.io/y3p4d/

Rossetti, A., & Van Waes, L. (2022). It’s not just a phase: Investigating text simplification in a second language from a process and product perspective. Frontiers in Artificial Intelligence, 5.

Stan Development Team. (n.d.). RStan: The R interface to Stan. https://mc-stan.org/

Torrance, M., Roeser, J., & Chukharev-Hudilainen, E. (n.d.). Lookback in L1 and L2 writing: An eye movement study.

Torrance, M., Rønneberg, V., Johansson, C., & Uppstad, P. H. (2016). Adolescent weak decoders writing in a shallow orthography: Process and product. Scientific Studies of Reading, 20(5), 375–388.

Vandermeulen, N., Steendam, E. V., & Rijlaarsdam, G. (2020). DATASET - Baseline data LIFT Synthesis Writing project [Data set]. Zenodo. https://doi.org/10.5281/zenodo.3893538

Vehtari, A., Gelman, A., & Gabry, J. (2015). Pareto smoothed importance sampling. arXiv Preprint arXiv:1507.02646.

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432.

Modelling writing hesitations in text writing as finite mixture process