Applying the Levante IRT Model to Spark Datasets

Author

Fionnuala O’Reilly

Published

April 29, 2026

1 Introduction

The Levante Hearts and Flowers (H&F) task is a computerized measure of inhibitory control and cognitive flexibility based on the Hearts and Flowers paradigm (Davidson et al., 2006), modelled after the AMES tablet version which has been validated across diverse global settings (Ahmed et al., 2022; Khan et al., 2024). Participants respond according to stimulus type: pressing a key on the same side as a heart (congruent rule) and on the opposite side for a flower (incongruent rule). The task includes three blocks — congruent (hearts only), incongruent (flowers only), and mixed (hearts and flowers) — with the congruent block serving as a baseline with minimal executive demands. Participants are encouraged to respond as quickly as possible; no hard time limit is imposed per trial, though responses slower than 2000ms are scored as incorrect. The task is scored using a Rasch IRT model calibrated on a cross-national sample of children aged 5–12 years.

The goal of this analysis is to evaluate the generalizability of the Levante H&F IRT model by applying it to two external datasets collected by Spark Lab. The PLUS dataset was collected from students in grades 3–5 (approximately 8–11 years) in a large urban school district in the US, who completed the task on individual digital devices during the school day in a classroom setting. The Kimochis dataset was collected from preschool-aged children (2–5 years) in a one-to-one assessment context. Although both datasets are described in the participants and data processing sections below, the primary generalizability analysis focuses on PLUS only. Kimochis children fall below the Levante calibration age range of 5–12 years, and the Kimochis assessment did not include a mixed block — the component of the task that places the greatest demands on executive function and contributes most of the IRT model’s discriminating power. For these reasons, meaningful generalizability testing is not feasible for Kimochis, and we do not pursue further analysis of that dataset beyond the descriptive sections.

The generalizability analysis of PLUS addresses three research questions:

How comparable are item difficulties across datasets? We examine whether the Levante items rank in the same order of difficulty in PLUS as in the original Levante calibration sample, using raw percent correct as a format-neutral index of item difficulty.
How well do Levante’s item parameters transfer to PLUS? We apply fixed Levante item parameters to score PLUS children and evaluate model fit, reliability, and the plausibility of the resulting ability estimates.
What happens when we re-estimate item parameters from PLUS? We re-estimate item difficulty parameters freely from the PLUS Fall sample using the Levante pipeline, and compare these to the original Levante parameters to quantify how much the parameters shift and what this means for measurement quality.

We additionally examine reaction time profiles across datasets as a descriptive check on construct validity, asking whether the expected patterns of faster RTs with age and greater difficulty for incongruent trials are present in PLUS as in Levante.

2 Participants

Table 1: Sample characteristics by dataset and timepoint. Age M (SD) is reported in years. Retention reflects the number of children appearing at one, two, three, or four timepoints within each dataset.

Dataset	Timepoint	N	Age M (SD)	% Female	Retention
PLUS	PLUS Fall	717	9.89 (0.82)	48.9	1 tp: n = 169 2 tp: n = 634
PLUS	PLUS Spring	720	9.91 (0.82)	48.6	1 tp: n = 169 2 tp: n = 634
Kimochis	2021-2022 Fall	471	4.08 (0.52)	46.2	1 tp: n = 873 2 tp: n = 633 3 tp: n = 119 4 tp: n = 45
	2021-2022 Spring	764	4.14 (0.59)	46.2
	2022-2023 Fall	599	4.32 (0.53)	51.1
	2022-2023 Spring	842	4.26 (0.6)	50.9

2.1 Plus

The Plus dataset includes 803 unique students across two timepoints (PLUS Fall and PLUS Spring) drawn from 8 schools. Sample sizes were stable across timepoints (Fall: n = 717, Spring: n = 720) (see Table 1). Mean age was approximately 9.9 years (SD ~0.82) at both timepoints, with students in grades 3, 4, and 5 roughly evenly distributed. The sample was approximately 49% female at both timepoints.

2.2 Kimochis

The Kimochis dataset includes 1670 unique children across four timepoints spanning two academic years (2021-2022 and 2022-2023, Fall and Spring) (see Table 1). Sample sizes ranged from 471 children at the first timepoint to 842 at the last. The sample was approximately evenly split by sex (~46-51% female) across timepoints, with mean age ranging from 4.08 to 4.32 years. Children were enrolled in Pre-K and, from 2021-2022 Spring onwards, Transitional Kindergarten (TK).

Longitudinal retention was limited: the majority of children appeared at only one timepoint (n = 873), with 633 appearing at two, 119 at three, and 45 at all four timepoints.

For the majority of children (n = 1445), age did not vary across timepoints, suggesting it reflects approximate age at enrolment rather than age at time of assessment. Age values also appear to be rounded to the nearest month rather than recorded as a continuous measure.

3 Data processing

3.1 Data preparation and scoring - PLUS

Trial-level data from PLUS were filtered to test blocks only (hearts test, flowers test, mixed test), excluding practice trials. Trials with RT < 100ms were excluded as likely accidental key presses (n = 842). Accuracy was derived from the stimulus-response pairing: trials with a heart stimulus were scored correct if the response matched the stimulus side (congruent rule), and trials with a flower stimulus were scored correct if the response was on the opposite side (incongruent rule). Trials with no response (NA) were scored as incorrect, consistent with the Levante scoring approach. Additionally, consistent with the 750ms response window enforced during PLUS data collection, trials with RT > 750ms were scored as incorrect (19.8% of trials). This yielded a final analytic sample of 803 children (2,874 observations across fall and spring timepoints).

Each trial was assigned an item label following the Levante naming convention: {block}_{stimulus}_{trial_type}-{trial_number}, where trial type was classified as start (first trial in block), stay (same stimulus type as previous trial), or switch (different stimulus type from previous trial). Repeated trials of the same type (e.g., multiple heart stay trials) were collapsed into a single score per trial type. A child was scored as correct on that trial type if they got more than half of those trials right, and incorrect otherwise, giving one binary score per trial type per child to enter into the IRT model. Item difficulty parameters were taken from the Levante H&F Rasch model, which was estimated under scalar invariance across sites; parameters are therefore identical across all three Levante sites. This yielded 9 IRT items in PLUS out of the 10 analytic units used in Levante scoring (see Table 2). The missing item type (heartsflowers_heart_start) was absent from PLUS because the mixed block always began with a flower stimulus in this assessment (n = 1434 start trials, all flower), whereas the Levante mixed block begins with a heart stimulus. As a result, no child contributed a heartsflowers_heart_start trial and this analytic unit could not be scored.

Ability scores (θ) showed an increasing pattern across grades, both under a Rasch model (3rd grade: M = -0.58, SD = 0.81; 4th grade: M = -0.37, SD = 0.81; 5th grade: M = -0.07, SD = 0.81) and under a 2PL model (3rd grade: M = -0.35, SD = 0.65; 4th grade: M = -0.18, SD = 0.66; 5th grade: M = 0.08, SD = 0.69), consistent with the expected developmental trajectory and supporting the construct validity of the scoring approach in this sample. Grade-level means are negative relative to the Levante scale origin, likely reflecting the stricter 750ms response window applied in PLUS scoring, which results in a higher proportion of trials scored as incorrect compared to Levante’s 2000ms rule.

Table 2: Item coverage across Levante, PLUS, and Kimochis

	Levante	PLUS	Kimochis
Total raw items	74	73	36
— Hearts block	12	12	12
— Flowers block	16	16	24
— Mixed block	46	45	0
Raw items matched to dataset	74	73	28
Reason(s) for non-overlap	—	Missing unit type: heartsflowers_heart_start	No mixed block (46 items); 8 flowers trials exceed Levante block length (stay-16 to stay-23)
IRT analytic units	10	9	4

3.2 Data preparation and scoring - Kimochis

Trial-level data from Kimochis were filtered to test blocks only (hearts test, flowers test), excluding practice trials. The dataset did not include a mixed block, as the children were considered too young for its task-switching demands. Accuracy was taken from the pre-computed “acc” column rather than derived from stimulus-response pairings. Timeout trials (14.5% of test trials) were scored as incorrect, consistent with the Levante scoring approach. This yielded a final analytic sample of 1670 children (5,352 observations) across two academic year cohorts (2021–2022 and 2022–2023), each with fall and spring timepoints.

Each trial was assigned an item label following the Levante naming convention. Of the 74 items in the Levante H&F model, 28 were present in Kimochis, all from the hearts and flowers blocks; all 46 mixed block items were absent by design. The 8 additional flowers trials in Kimochis (stay-16 through stay-23) exceeded the length of the Levante flowers block and were excluded from scoring. Repeated trials of the same type were aggregated into a single analytic unit scored as correct if the child got more than half right, yielding 4 IRT items (flowers_flower_start, flowers_flower_stay, hearts_heart_start, hearts_heart_stay) (see Table 2).

Ability scores (θ) showed an increasing pattern across age groups (2–3 years: M = -0.88, -1.03, SD = 0.72, 0.75; 3–4 years: M = -0.5, -0.64, SD = 0.82, 0.82; 4–5 years: M = -0.06, -0.21, SD = 0.73, 0.71), consistent with the expected developmental trajectory. All means were negative relative to the Levante scale origin, which is expected given that Kimochis children fall below the Levante calibration age range of 5–12 years and scores are based on only 4 items with no mixed block contribution.

Given these limitations, the structural differences between the two assessments are too great to support meaningful generalizability testing. The remaining sections focus exclusively on PLUS.

4 How does raw performance compare across datasets?

Figure 1: Age-related proportion correct (raw scores) on the Hearts and Flowers task across datasets. Each point represents one participant run; the blue line shows a GAM-smoothed trend with 95% confidence interval. PLUS data are split by timepoint (Fall and Spring). Note that PLUS proportion correct reflects the stricter 750ms response window applied during data collection.

Figure 2: Proportion correct (raw scores) per item type across datasets, computed from trial-level data prior to IRT scoring. Items are ordered by mean proportion correct across Levante sites (easiest at top). Note that heartsflowers_heart_start is shown for Levante only, as this item does not exist in the PLUS assessment design.

We first examined raw proportion correct across datasets as an index of item difficulty, prior to any IRT modelling. Figure 1 shows overall accuracy as a function of age for each dataset. Across all three Levante sites, accuracy increased with age, rising from approximately 50–60% at age 5 to 90–100% by age 12. PLUS Fall showed a similar upward trend for 8-11 year olds, though at a lower overall level than comparable Levante ages, consistent with the stricter 750ms response window applied during PLUS data collection.

Figure 2 shows raw proportion correct broken down by item type. The item difficulty ordering was broadly consistent across datasets — hearts and flowers block items were easiest across all datasets, while mixed block items were harder, consistent with the greater executive demands of task-switching. PLUS Fall was systematically lower than all three Levante sites on mixed block items, again likely reflecting the 750ms scoring rule, which disproportionately affects slower mixed block trials. One exception was heartsflowers_flower_start, where PLUS Fall showed substantially lower accuracy (~35%) relative to Levante sites (~75–85%), suggesting this item may function differently in the PLUS assessment context. This item is examined further in the item fit analysis in Section 5.1.

5 How well do Levante’s item parameters transfer to PLUS?

Table 3: Marginal reliability estimates by sample and timepoint. Note: For PLUS, N reflects scored child-timepoint observations; 803 unique children contributed data at one or both timepoints.

Sample	Timepoint	N	Reliability (2PL)	Reliability (Rasch)
Levante (Germany)		336	0.91	0.56
Levante (Colombia)		725	0.94	0.73
Levante (Canada)		225	0.90	0.61
PLUS	Pooled	1,437	0.55	0.38
	PLUS Fall	717	0.53	0.34
	PLUS Spring	720	0.54	0.38

Table 4: Item-level pass rate (proportion of children scoring correct) by timepoint (PLUS)

Item	PLUS Fall	PLUS Spring
flowers_flower_start	0.744	0.818
flowers_flower_stay	0.909	0.967
hearts_heart_start	0.905	0.928
hearts_heart_stay	0.987	0.988
heartsflowers_flower_start	0.355	0.499
heartsflowers_flower_stay	0.541	0.565
heartsflowers_flower_switch	0.352	0.508
heartsflowers_heart_stay	0.662	0.622
heartsflowers_heart_switch	0.327	0.515

Reliability estimates are reported as marginal reliability, computed as 1 minus the ratio of mean error variance to total score variance (see Table 3). Levante reliability ranged from 0.56 to 0.94 across the three sites, reflecting moderate reliability. PLUS reliability was modest but positive at both timepoints (Fall: r = 0.528, 0.335, Spring: r = 0.541, 0.382), and notably lower than Levante. The lower reliability relative to Levante likely reflects the stricter 750ms response window rather than a fundamental problem with the IRT model.

The distribution of θ estimates (Figure 3) shows a rightward shift from Fall to Spring, consistent with expected growth over the academic year. The Spring distribution is multimodal, with distinct peaks around θ = −1, 0.3, and 0.9. This likely reflects ceiling effects on the simpler items: by Spring, hearts and flowers block items showed near-perfect pass rates (see Table 4), leaving little variance to separate children at the upper end of the ability range, and producing the clustering visible in the distribution.

Figure 3: Distribution of ability estimates (θ) by timepoint in PLUS

5.1 Check item fit

Table 5: Item-level fit statistics for PLUS Fall (S-X2 test)

Model	Item	S_X2	df.S_X2	RMSEA	p
2PL	heartsflowers_flower_start	285.090	8.000	0.220	6.11e-57
2PL	flowers_flower_stay	177.550	7.000	0.185	6.41e-35
2PL	hearts_heart_stay	173.357	8.000	0.170	2.55e-33
2PL	heartsflowers_flower_switch	106.396	7.000	0.141	5.13e-20
2PL	heartsflowers_heart_switch	103.505	7.000	0.139	2.04e-19
2PL	heartsflowers_heart_stay	61.792	7.000	0.105	6.62e-11
2PL	flowers_flower_start	40.321	8.000	0.075	2.79e-06
2PL	hearts_heart_start	39.770	8.000	0.075	3.54e-06
2PL	heartsflowers_flower_stay	23.203	7.000	0.057	1.57e-03
Rasch	heartsflowers_flower_start	274.931	7.000	0.232	1.35e-55
Rasch	hearts_heart_stay	169.911	8.000	0.168	1.35e-32
Rasch	flowers_flower_stay	149.182	8.000	0.157	2.90e-28
Rasch	heartsflowers_flower_switch	107.743	7.000	0.142	2.70e-20
Rasch	heartsflowers_heart_switch	74.289	7.000	0.116	2.00e-13
Rasch	heartsflowers_heart_stay	61.707	7.000	0.105	6.88e-11
Rasch	flowers_flower_start	39.317	8.000	0.074	4.29e-06
Rasch	hearts_heart_start	37.060	8.000	0.071	1.12e-05
Rasch	heartsflowers_flower_stay	34.717	7.000	0.075	1.26e-05

Item level fit statistics are presented in Table 5. Item fit was evaluated using the S-X2 statistic for the PLUS Fall sample (N = 717). All items showed significant misfit (all p < .001), with RMSEA values ranging from 0.071 to 0.232 for Rasch and 0.057 to 0.22 for 2PL. Most items showed moderate misfit (RMSEA = 0.07–0.17 for Rasch and 0.06–0.18 for 2PL), but heartsflowers_flower_start stood out as the worst (RMSEA = 0.232 for Rasch and 0.22 for 2PL). This pattern is consistent with the 750ms response window applied during PLUS data collection. Mixed block items require task-switching, which is cognitively demanding and inherently slower, meaning that correct responses are more likely to exceed the 750ms cutoff and be scored as incorrect. The first trial of the mixed block is particularly affected: regardless of ability, children tend to respond slowly on the initial switch into the mixed context. The Levante model, calibrated with a 2000ms window, expects this item to be easier than PLUS children’s scores suggest, producing the severe misfit observed. By contrast, hearts and flowers block items — where median RTs fall well below 750ms — are less affected by the cutoff and show comparatively better fit.

Together, the low reliability and level of item misfit suggest the Levante model does not generalize well to the PLUS assessment context. The misfit pattern is interpretable as a measurement artefact of the stricter response window rather than a fundamental failure of the model.

6 What happens when we re-estimate item parameters from PLUS?

Table 6: Re-estimated PLUS Fall item difficulty and discrimination parameters compared to original LEVANTE values. Positive difference indicates the item was estimated as more difficult / more discrimating in PLUS than in LEVANTE; negative difference indicates less difficult / less discriminating.

Item	Model	Parameter	PLUS & LEVANTE value	LEVANTE value	Difference (PLUS – LEVANTE)
hearts_heart_start	Rasch	difficulty	-3.083	-2.726	-0.357
hearts_heart_stay	Rasch	difficulty	-2.872	-2.618	-0.254
flowers_flower_start	Rasch	difficulty	-1.972	-1.901	-0.071
flowers_flower_stay	Rasch	difficulty	-2.117	-2.071	-0.046
heartsflowers_heart_stay	Rasch	difficulty	-1.226	-1.208	-0.018
heartsflowers_flower_stay	Rasch	difficulty	-1.159	-1.226	0.067
heartsflowers_heart_switch	Rasch	difficulty	-0.525	-0.634	0.109
heartsflowers_flower_switch	Rasch	difficulty	-0.709	-0.819	0.111
heartsflowers_flower_start	Rasch	difficulty	-0.408	-1.383	0.975
hearts_heart_start	2PL	difficulty	-1.697	-1.478	-0.218
hearts_heart_stay	2PL	difficulty	-1.566	-1.388	-0.177
flowers_flower_stay	2PL	difficulty	-1.030	-0.980	-0.050
flowers_flower_start	2PL	difficulty	-1.006	-0.983	-0.023
heartsflowers_heart_stay	2PL	difficulty	-0.644	-0.641	-0.003
heartsflowers_flower_stay	2PL	difficulty	-0.538	-0.589	0.051
heartsflowers_flower_switch	2PL	difficulty	-0.345	-0.435	0.090
heartsflowers_heart_switch	2PL	difficulty	-0.306	-0.410	0.104
heartsflowers_flower_start	2PL	difficulty	-0.211	-0.835	0.624
flowers_flower_stay	2PL	discrimination	2.116	2.253	-0.137
hearts_heart_stay	2PL	discrimination	1.734	1.842	-0.108
hearts_heart_start	2PL	discrimination	1.678	1.758	-0.080
heartsflowers_heart_stay	2PL	discrimination	1.890	1.896	-0.006
flowers_flower_start	2PL	discrimination	1.962	1.941	0.021
heartsflowers_flower_stay	2PL	discrimination	2.257	2.162	0.094
heartsflowers_heart_switch	2PL	discrimination	1.787	1.623	0.164
heartsflowers_flower_switch	2PL	discrimination	2.084	1.919	0.165
heartsflowers_flower_start	2PL	discrimination	2.442	1.583	0.859

Figure 4: Comparison of re-estimated PLUS Fall item difficulty and discrimination parameters against original LEVANTE parameters. Points falling above the diagonal indicate items estimated as more diffult / more discriminating in PLUS than in LEVANTE; points below indicate items estimated as less difficult / less discriminating. Colour indicates block type.

Item difficulty parameters were re-estimated freely from the PLUS Fall sample (N = 717) combined with the LEVANTE pilot sample using the LEVANTE pipeline, fitting Rasch and 2PL models with scalar invariance across the three LEVANTE sites and PLUS Fall. Parameters were estimated on the 9 items present in all four groups (i.e., excluding heartsflowers_heart_start, which was absent from PLUS by design). Figure 4 shows the re-estimated PLUS parameters plotted against the original LEVANTE values.

Hearts block and flowers block items were estimated as less difficult in PLUS than in LEVANTE (i.e. negative difference, range: -0.36 to -0.05 for Rasch, for -0.22 to -0.02 2PL), while mixed block items were estimated as more difficult, (i.e. positive difference, range: 0.07 to 0.98 for Rasch, 0.05 to 0.62 for 2PL), except for heartsflowers_heart_stay. This is consistent with the differential impact of the 750ms response window across blocks: hearts and flowers trials are fast enough that the cutoff rarely affects scoring, whereas mixed block trials — which require task-switching and are inherently slower — are more affected as items appear harder than they would under a lenient time limit.

The largest shift was for heartsflowers_flower_start (LEVANTE d = -1.38 for Rasch, -0.84 for 2PL; PLUS d = -0.41 for Rasch, -0.21 for 2PL; d_diff = 0.98 for Rasch, 0.62 for 2PL), which was estimated as substantially harder in PLUS than in LEVANTE This is consistent with the severe item misfit observed in Section 5 and reinforces the interpretation that this item is disproportionately affected by the strict response window.

7 How do RTs change with age across datasets?

Table 7: Descriptive statistics for RT and accuracy by dataset and block. RT values are in milliseconds.

Dataset	Block	N Trials	Acc Mean	Acc SD	RT Mean	RT Median	RT SD	RT Min	RT Max	Age M (SD)
Levante (Germany)	flowers test	8,665	0.83	0.38	1,391.29	1,113.00	1,081.99	100	21,110	9.04 (2.21)
	hearts test	6,504	0.86	0.35	1,287.59	1,010.50	1,075.58	114	19,763	9.04 (2.21)
	mixed test	16,975	0.74	0.44	1,630.59	1,398.00	986.29	113	23,444	9.08 (2.2)
Levante (Colombia)	flowers test	12,463	0.88	0.33	948.14	770.00	944.63	100	25,905	9.18 (2.11)
	hearts test	9,357	0.94	0.25	781.68	641.00	810.41	100	24,668	9.18 (2.11)
	mixed test	24,485	0.78	0.42	1,330.87	1,040.00	1,456.32	102	29,341	9.2 (2.1)
Levante (Canada)	flowers test	3,823	0.89	0.32	1,154.48	990.00	841.30	101	15,952	8.8 (2.13)
	hearts test	2,863	0.91	0.28	1,081.74	865.00	1,167.90	115	25,603	8.81 (2.13)
	mixed test	7,577	0.81	0.39	1,480.23	1,262.00	1,185.16	119	29,425	8.82 (2.13)
PLUS Fall	flowers test	8,534	0.83	0.37	577.72	572.00	121.06	105	773	9.88 (0.82)
	hearts test	8,562	0.95	0.22	508.53	493.00	109.50	100	763	9.88 (0.82)
	mixed test	23,110	0.45	0.50	671.91	725.00	104.73	100	776	9.89 (0.82)
PLUS Spring	flowers test	8,562	0.96	0.20	537.00	509.00	156.78	103	1,256	9.91 (0.82)
	hearts test	8,583	0.98	0.13	470.34	447.00	127.86	105	1,252	9.91 (0.82)
	mixed test	23,378	0.86	0.34	733.53	712.00	197.27	100	1,264	9.91 (0.82)
Note: PLUS Fall trials were scored incorrect if RT exceeded 750ms. PLUS Spring appears to have applied a longer cutoff of approximately 1250ms, which is understood to reflect an error in task administration.

Figure 5: RT distributions by block and dataset. X-axis capped at 2000ms, reflecting the analytic RT ceiling applied in Levante; the 750ms cutoff applied in PLUS is marked by a dashed vertical line. Levante distributions show the expected right skew; PLUS distributions are truncated at the cutoff.

Figure 6: GAM-smoothed RT as a function of age by block across datasets, with 95% CI. Y-axis capped at 2000ms to match the Levante response window; a small number of extreme trials above this threshold are excluded from the plot but not the model. Levante sites show the expected developmental decrease in RT across all three blocks. PLUS replicates this trend within the narrower 8–11 year range, though RTs are compressed by the 750ms response window (dashed line).

We examined RT profiles across datasets as a descriptive check on construct validity, asking whether the expected patterns — faster RTs with age and greater difficulty for incongruent and mixed trials — are present in PLUS as in Levante.

Figure 5 shows the RT distributions by block and dataset. Across all three Levante sites, distributions had the expected right-skewed shape, with median RTs ranging from 1010ms (hearts, Germany) to 1398ms (mixed, Germany). In contrast, PLUS distributions were narrow and sharply truncated at the 750ms response window, with median RTs of 493ms (hearts), 572ms (flowers), and 725ms (mixed) at Fall. The practical impact of this cutoff is substantial: across Levante sites, 51.2%, 65.2%, and 87.1% of hearts, flowers, and mixed block trials respectively exceeded 750ms. The effect is largest for the mixed block, where task-switching demands produce inherently slower responses, and is reflected in RT standard deviations: Levante sites showed a mean SD of 1209ms on the mixed block versus 105ms in PLUS Fall.

Despite this compression, PLUS preserved the expected ordering of block difficulty. As shown in Figure 6, RTs were fastest for the hearts block and slowest for the mixed block in both datasets, consistent with the greater executive demands of task-switching. The developmental decrease in RT with age was clearly visible across all three Levante sites, with RTs declining steeply from age 5 to approximately age 10 before levelling off. PLUS replicated the direction of this trend within its narrower 8–11 year range, though the absolute RT level and developmental slope were both compressed by the response window, with all three block lines falling below the 750ms cutoff. Together these patterns suggest that while the 750ms window substantially alters the RT distribution in PLUS, the ordinal structure of the task — hearts easier than flowers easier than mixed, and faster responding with age — is preserved.