1 Methods

1.1 Design

All participants were exposed to adjacent and non-adjacent dependencies. In a sequence A-B-C, the location of a dot B was target of an adjacent dependency when always following the same A location while the location of dot C was random. In nonadjacent dependencies, the location of a dot C was always following a location A dot with the location of dot B being random.

Adjacent dependency: \(A_{dependee}\) \(B_{dependant}\) \(C_{random}\)
Nonadjacent dependency: \(A_{dependee}\) \(B_{random}\) \(C_{dependant}\)

Each participant was presented with two blocks with different location sets (order of location sets was counterbalanced across participants). Each block contained 4 sequences of three elements. Each of the four sequences was repeated 40 times per block.

There are two main differences between the three experiments. First, participants were either encouraged to predict the next target (directive instruction) or not (non-directive instruction). Second, adjacent and nonadjacent dependencies were either included in the same block, and thus presented concurrently in a mixed block or adjacent and non adjacent dependencies were presented in different blocks (not mixed).

Experiment 1: non-directive instruction; dependencies not mixed (4 sequences of one dependency type per block / location set)
Experiment 2: directive instruction; dependencies not mixed (4 sequences of one dependency type per block / location set)
Experiment 3: directive instruction; dependencies mixed (2 sequences per dependency type in each block / location set)

1.2 Participants

Experiment 1: We tested 32 participants (24 females, 7 males, 1 prefer not to say). The median age of the sample was 20 years (SD = 2.51) with an age range from 18 to 29 years.

Experiment 2: We tested 32 participants (24 females, 8 males). The median age of the sample was 24 years (SD = 14.43) with an age range from 18 to 62 years.

Experiment 3: We tested 32 participants (25 females, 7 males). The median age of the sample was 27 years (SD = 11.82) with an age range from 19 to 57 years.

2 Results

2.1 Model fit

The dependent variable was the number of eye samples on the target dot (C for nonadjacent dependencies and B for adjacent dependencies in the sequence A-B-C) before it illuminated (and 250 msecs after the previous target illuminated) accumulating across occurrences. Fixations longer than 100 msecs were coded TRUE as successful anticipation of the target, fixations shorter than 100 msecs were coded FALSE. The resulting binary variable was used to calculate the cumulative number of successful anticipations of the target dot within block, so that the resulting value for the last occurrence of a sequence indicates the total of correctly anticipated dependencies.

Data were analysed in Bayesian mixed effects models following a zero-inflated negative binomial distribution (Gelman et al. 2014; McElreath 2016). The R package brms (Bürkner 2017, 2018) was used to model the data using the probabilistic programming language Stan (Carpenter et al. 2016; Hoffman and Gelman 2014). Fixed effects were main effects and interactions of occurrence id (1 to 40) and dependency type (levels: adjacent, nonadjacent, baseline). Occurrence id was treatment coded and dependency type was sum coded (for contrasts coding see Schad et al. 2020) comparing each dependency type to baseline; the baseline consists of transitions to every dot that was essentially random and not part of a dependency (neither dependee, nor dependant). Further, to model the learning curves, occurrence was modelled as quadratic function (modelled as second order orthogonal polynomial).

Models were fitted with maximal random effects structure (Barr et al. 2013; Bates et al. 2015) including random participant intercepts with by-participant slope adjustments for the second order function of occurrence id as well as random slopes for the second-order polynomial of occurrence id (a quadratic function) and their interaction with the combination of location set (2 sets), sequence (4 sequences per dependency type and location set), dependency type, and transition (levels: to A for adjacent dependencies, to B for nonadjacent dependencies, and to C) rending 16 levels for the latter.¹

We calculated the statistical support for the alternative hypothesis over the null hypothesis. This evidence was obtained using Bayes Factors (henceforth, BF) calculated using the Savage-Dickey method (see, e.g., Dickey, Lientz, and others 1970; Wagenmakers et al. 2010). We calculated both the evidence for the alternative hypothesis H\(_1\) over the null hypothesis given the data. A BF larger than 5 indicate moderate and larger than 10 strong evidence for a statistically meaningful effect compared to the null hypothesis (see, e.g., Baguley 2012; Jeffreys 1961; Lee and Wagenmakers 2014). For example, a BF of 2 reflects that the alternative hypothesis is two times more likely than the null hypothesis given the evidence. In contrast to traditional statistical methods (null-hypothesis significance testing), the Bayesian framework allows us to infer the evidence against the alternative hypothesis typically corresponding to BFs smaller than 0.33 (for discussion see Dienes 2014, 2016; Dienes and Mclatchie 2018; Schönbrodt et al. 2017; Wagenmakers et al. 2018).

Models were fitted with weakly informative priors (see McElreath 2016) and run with 10,000 iterations on 3 chains with a warm-up of 5,000 iterations and no thinning. Model convergence was confirmed by the Rubin-Gelman statistic (Gelman and Rubin 1992) and inspection of the Markov chain Monte Carlo chains.

2.2 Shape of function

We established that a quadratic function is better suited for the data than a linear (first-order polynomial) model. For model comparisons we used leave-one-out cross-validation to prevent overfitting (Vehtari, Gelman, and Gabry 2015, 2017). The comparisons shown in Table 2.1 revealed a higher predictive performance for occurrence id modelled as quadratic compared to a linear predictor and a model with a cubic growth component. We therefore used the quadratic model for modelling the cumulative number of on-target looks.

Table 2.1: Model comparisons of the orthogonal polynomial growth function used for occurrence id modelled as linear, and a quadratic function. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses. ‘x’ in the model formulas represents the occurrence of a dot in the sequence.
	Experiment 1		Experiment 2		Experiment 3
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
\(y \sim x + x^2\)	–	-57,204 (138)	–	-57,684 (137)	–	-58,478 (140)
\(y \sim x\)	-1,191 (43)	-58,395 (133)	-1,266 (45)	-58,950 (130)	-859 (36)	-59,337 (138)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

2.3 Dependency effect

We compared three models to establish whether adjacent and nonadjacent dependencies are learned differently and with different timecourse trajectories: (1) a model with the quadratic function of occurrence, (2) independent main effects of the quadratic timecourse function and dependency type, and (3) the main effects and interactions of the quadratic timecourse function and dependency type. For model comparisons we used leave-one-out cross-validation (Vehtari, Gelman, and Gabry 2015, 2017).

The comparisons shown in Table 2.2 revealed highest predictive performance for the model with main effects and interactions. However, the best model was negligibly better than the model without the interaction term, showing a small advantage for Experiment 1 and no advantage for the interaction model for Experiments 2 and 3. The independent main effects model was substantially better than the model without a main effect of dependency for Experiments 1 and 2 but only marginally better for Experiment 3. The model without the main effect of dependency type revealed the weakest predictive performance.

Table 2.2: Model comparisons including the effect of dependency. The top row shows the models with the highest predictive performance. Standard error is shown in parentheses. ‘x’ in the model formulas represents the occurrence of a dot in the sequence.
	Experiment 1		Experiment 2		Experiment 3
Model	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)	\(\Delta\widehat{elpd}\)	\(\widehat{elpd}\)
\(y \sim (x + x^2) \cdot \text{dependency}\)	–	-57,033 (138)	–	-57,607 (136)	–	-58,469 (140)
\(y \sim x + x^2 + \text{dependency}\)	-6 (2)	-57,039 (138)	-1 (2)	-57,609 (136)	0 (1)	-58,469 (140)
\(y \sim x + x^2\)	-171 (20)	-57,204 (138)	-76 (14)	-57,684 (137)	-9 (4)	-58,478 (140)
Note:
\(\widehat{elpd}\) = predictive performance indicated as expected log pointwise predictive density; \(\Delta\widehat{elpd}\) = difference in predictive performance relative to the model with the highest predictive performance in the top row.

The results in Table 2.2 suggest that some dependency types are associated with a stronger learning effect that appear to relatively similar across time. We summarise the posterior of the interaction models in Figure 2.1 to inspect the differences between dependency types.

Figure 2.1 shows the modelled learning curves for adjacent and nonadjacent dependencies across experiments compared to baseline. The posterior learning curves highlight that the adjacency learning effect in Experiments 1 and 2 had a stronger magnitude than the disadvantage for nonadjacent dependencies. Both dependency effects disappear for the mixed presentation in Experiment 3.

Figure 2.1: Modelled learning curves with posterior mean and 95% PIs indicated as ribbons.

The overall estimates for dependency types are shown in Figure 2.2. This figure primarily indicates the advantage for adjacent dependencies for Experiments 1 and 2 which is substantially reduced in magnitude for Experiment 3.

Figure 2.2: Modelled learning overall for dependency types with posterior mean and 95% PIs indicated as error bars.

From the posterior shown in Figure 2.2, we calculated the difference for adjacent and nonadjacent dependencies (compared to baseline) after controlling for the effect of occurrence id. The posterior is summarised as mean estimate with 95% probability intervals and the evidence in support of the alternative hypothesis indicated as \(H_1\). The results are shown in Table 2.3 showing evidence for learning of adjacent dependencies across all experiments with a substantially reduced learning effect in Experiment 3. For nonadjacent dependencies, we found evidence for an inhibitory effect in Experiment 1 but evidence against such an inhibition in Experiment 2 and negligible evidence for an inhibitory effect for nonadjacent dependencies in Experiment 3 (although concurrent presentation of adjacent and nonadjacent dependencies appear to inhibit learning for adjacent dependencies). This finding – learning adjacent and nonadjacent dependencies simultaneously reduces the advantage for adjacent dependencies – suggests that although nonadjacent dependencies did lead to anticipatory eye-movements, participants changed their behaviour for adjacent dependencies when being exposed to nonadjacent dependencies at the same time, rather than independently.

Table 2.3: Marginal learning effects for adjacent and nonadjacent dependency (compared to baseline) after accounting for occurrence id. Differences are on the log scale.
	Experiment 1		Experiment 2		Experiment 3
Comparison	Est. with 95% PI	H₁	Est. with 95% PI	H₁	Est. with 95% PI	H₁
Adjacent vs baseline	0.3 [0.27 – 0.34]	>100	0.18 [0.15 – 0.22]	>100	0.09 [0.05 – 0.13]	32.73
Nonadjacent vs baseline	-0.08 [-0.11 – -0.05]	>100	-0.01 [-0.04 – 0.03]	0.02	-0.06 [-0.1 – -0.02]	1.59
Note:
H₁ = evidence in favour of the alternative hypothesis over the null hypothesis (Bayes Factor); PI = probability interval; ‘:’ = interaction

2.4 Pooled analysis

In a pooled analysis of all three experiments, we aimed to find statistical evidence that learning adjacent dependencies fails when people are exposed to adjacent and nonadjacent dependencies at the same time. In the pooled analysis we added experiment as interaction term for to answer this question. The factor experiment was coded via Helmert contrast comparing Experiment 1 to Experiment 2 and comparing Experiment 3 to both Experiments 1 and 2 in combination (see Schad et al. 2020). We fitted three models all including the same model specification as above and, in addition, (1) a model with the main effect of Experiment, (2) a model with Experiment and all by-Experiment two-way interactions, and (3) the by-Experiment three-way interaction.

Model comparisons revealed highest predictive performance for the model that includes all by-Experiment two-way interactions (model with all by-Experiment interactions: \(\widehat{elpd}\) = -173,542, SE = 237) which was negligibly better than the model with all three-way interactions (\(\Delta\widehat{elpd}\) = -0.75, SE = 0.48; model with all by-Experiment two-way interactions: \(\widehat{elpd}\) = -173,543, SE = 237). The model with the highest predictive performance – the model with the by-Experiment two-way interactions – was substantially better than the model without by-Experiment interactions (\(\Delta\widehat{elpd}\) = -24.65, SE = 7.90; model without by-Experiment interactions: \(\widehat{elpd}\) = -173,566, SE = 237). Therefore the model will all two-way interactions is the most parsimonious model suggesting that dependencies (i.e. adjacent dependencies) were learned differently across experiments.

Table 2.4 shows main effects and interactions of dependency type and experiment after controlling for time course. There is evidence for learning adjacent dependencies and an inhibitory learning effect for nonadjacent dependencies. Evidence for by-experiment interactions was found adjacent dependencies demonstrating a stronger learning effect for adjacent dependencies in Experiments 1 and 2 compare to Experiment 3. Also learning for adjacent dependencies was stronger in Experiment 1 compared to Experiment 2. Evidence for all other predictors was negligible.

Table 2.4: Main effects and interactions of dependency types (compared to baseline) and experiment after accounting for occurrence id. Dependency type is coded as sum contrast and experiment is coded using a Helmert contrast. Differences are on the log scale.
Predictor	Est. with 95% PI	H₁
Main effects
Adjacent vs baseline	0.19 [0.16 – 0.21]	>100
Nonadjacent vs baseline	-0.05 [-0.07 – -0.03]	53.59
Exp. 1,2 vs 3	-0.08 [-0.16 – 0]	0.3
Exp. 1 vs 2	-0.01 [-0.1 – 0.08]	0.05
Interactions
Adjacent : Exp. 1,2 vs 3	-0.36 [-0.42 – -0.29]	>100
Nondjacent : Exp. 1,2 vs 3	0.04 [-0.02 – 0.1]	0.07
Adjacent : Exp. 1 vs 2	-0.09 [-0.14 – -0.04]	6
Nonadjacent : Exp. 1 vs 2	0.08 [0.03 – 0.13]	2.4
Note:
H₁ = evidence in favour of the alternative hypothesis over the null hypothesis (Bayes Factor); PI = probability interval; ‘:’ = interaction

References

Baguley, Thomas. 2012. Serious Stats: A Guide to Advanced Statistics for the Behavioral Sciences. Basingstoke: Palgrave Macmillan.

Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. “Random Effects Structure for Confirmatory Hypothesis Testing: Keep It Maximal.” Journal of Memory and Language 68 (3): 255–78.

Bates, Douglas M., Reinhold Kliegl, Shravan Vasishth, and R. Harald Baayen. 2015. “Parsimonious Mixed Models.” arXiv Preprint arXiv:1506.04967.

Bürkner, Paul-Christian. 2017. “brms: An R Package for Bayesian Multilevel Models Using Stan.” Journal of Statistical Software 80 (1): 1–28. https://doi.org/10.18637/jss.v080.i01.

———. 2018. “Advanced Bayesian Multilevel Modeling with the R Package brms.” The R Journal 10 (1): 395–411. https://doi.org/10.32614/RJ-2018-017.

Carpenter, Bob, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Michael A. Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. 2016. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software 20.

Dickey, James M., B. P. Lientz, and others. 1970. “The Weighted Likelihood Ratio, Sharp Hypotheses about Chances, the Order of a Markov Chain.” The Annals of Mathematical Statistics 41 (1): 214–26.

Dienes, Zoltan. 2014. “Using Bayes to Get the Most Out of Non-Significant Results.” Frontiers in Psychology 5 (781): 1–17.

———. 2016. “How Bayes Factors Change Scientific Practice.” Journal of Mathematical Psychology 72: 78–89.

Dienes, Zoltan, and Neil Mclatchie. 2018. “Four Reasons to Prefer Bayesian Analyses over Significance Testing.” Psychonomic Bulletin & Review 25 (1): 207–18.

Gelman, Andrew, J. B. Carlin, H. S. Stern, D. B. Dunson, Aki Vehtari, and D. B. Rubin. 2014. Bayesian Data Analysis. 3rd ed. Chapman; Hall/CRC.

Gelman, Andrew, and Donald B. Rubin. 1992. “Inference from Iterative Simulation Using Multiple Sequences.” Statistical Science 7 (4): 457–72.

Hoffman, Matthew D., and Andrew Gelman. 2014. “The No-U-Turn sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.” Journal of Machine Learning Research 15 (1): 1593–623.

Jeffreys, Harold. 1961. The Theory of Probability. Vol. 3. Oxford: Oxford University Press, Clarendon Press.

Lee, Michael D., and Eric-Jan Wagenmakers. 2014. Bayesian Cognitive Modeling: A Practical Course. Cambridge University Press.

McElreath, Richard. 2016. Statistical Rethinking: A Bayesian Course with Examples in R and Stan. CRC Press.

Schad, Daniel J., Shravan Vasishth, Sven Hohenstein, and Reinhold Kliegl. 2020. “How to Capitalize on a Priori Contrasts in Linear (Mixed) Models: A Tutorial.” Journal of Memory and Language 110: 104038.

Schönbrodt, Felix D., Eric-Jan Wagenmakers, Michael Zehetleitner, and Marco Perugini. 2017. “Sequential Hypothesis Testing with Bayes Factors: Efficiently Testing Mean Differences.” Psychological Methods 22 (2): 322–39.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2015. “Pareto Smoothed Importance Sampling.” arXiv Preprint arXiv:1507.02646.

———. 2017. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32.

Wagenmakers, Eric-Jan, Tom Lodewyckx, Himanshu Kuriyal, and Raoul Grasman. 2010. “Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method.” Cognitive Psychology 60 (3): 158–89.

Wagenmakers, Eric-Jan, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Jonathon Love, Ravi Selker, et al. 2018. “Bayesian Inference for Psychology. Part I: Theoretical Advantages and Practical Ramifications.” Psychonomic Bulletin & Review 25 (1): 35–57.

In brms syntax: y ~ poly(occurrence, 2) * dependency + (poly(occurrence, 2) * sequence|participant) where sequence has 16 levels per participant comprising location sets, dependency types, transition location, and 4 different sequences per dependency type.↩︎

Concurrent learning of adjacent and nonadjacent dependencies

Jens Roeser

Compiled 05 April 2023