Spark_h&f

Applying the Levante IRT Model to the PLUS and Kimochis datasets

The goal of this analysis is to evaluate the Levante Hearts and Flowers (H&F) IRT model’s performance on an external dataset collected by Spark Lab, as a test of the model’s generalizability. The Spark Lab data consists of two datasets: PLUS (3rd–5th graders, group-based tablet assessment) and Kimochis (preschoolers, 1:1 assessment).

Part 1: Sample Descriptives

Dataset	Timepoint	N	Age M (SD)	% Female	Retention
PLUS	PLUS Fall	719	9.89 (0.82)	48.8	1 tp: n = 170 2 tp: n = 636
PLUS	PLUS Spring	723	9.92 (0.83)	48.1	1 tp: n = 170 2 tp: n = 636
Kimochis	2021-2022 Fall	471	4.08 (0.52)	46.1	1 tp: n = 873 2 tp: n = 633 3 tp: n = 119 4 tp: n = 45
	2021-2022 Spring	764	4.13 (0.6)	46.0
	2022-2023 Fall	599	4.31 (0.53)	50.9
	2022-2023 Spring	842	4.25 (0.6)	50.8

Plus

The Plus dataset includes 806 unique students across two timepoints (PLUS Fall and PLUS Spring) drawn from 8 schools. Sample sizes were stable across timepoints (Fall: n = 719, Spring: n = 723). Mean age was approximately 9.91 years (SD ~0.82) at both timepoints, with students in grades 3, 4, and 5 roughly evenly distributed. The sample was approximately 49% female at both timepoints.

Kimochis

The Kimochis dataset includes 1670 unique children across four timepoints spanning two academic years (2021-2022 and 2022-2023, Fall and Spring). Sample sizes ranged from 471 children at the first timepoint to 842 at the last. The sample was approximately evenly split by sex (~46-51% female) across timepoints, with mean age ranging from 4.08 to 4.31 years. Children were enrolled in Pre-K and, from 2021-2022 Spring onwards, Transitional Kindergarten (TK).

Longitudinal retention was limited: the majority of children appeared at only one timepoint (n = 873), with 633 appearing at two, 119 at three, and 45 at all four timepoints.

For the majority of children (n = 1445), age did not vary across timepoints, suggesting it reflects approximate age at enrolment rather than age at time of assessment. Age values also appear to be rounded to the nearest month rather than recorded as a continuous measure.

Trial-level descriptive statistics by dataset and block

Dataset	Block	N Trials	Acc Mean	Acc SD	RT Mean	RT Median	RT SD	RT Min	RT Max	Age M (SD)
kimochis	flowers test	36,478	0.75	0.43	1,234.40	1,148.00	530.77	200.00	2,984.00	4.26 (0.55)
kimochis	hearts test	28,281	0.93	0.26	1,069.54	971.00	503.05	200.00	2,968.00	4.25 (0.56)
plus	flowers test	17,254	0.89	0.31	553.97	538.00	146.94	1.00	1,256.00	9.9 (0.83)
	hearts test	17,276	0.96	0.19	487.36	468.00	124.14	1.00	1,252.00	9.9 (0.83)
	mixed test	47,269	0.66	0.47	693.66	715.00	177.85	1.00	1,264.00	9.9 (0.83)

Part 2: Fixed Parameter Scoring

Data preparation and scoring - PLUS

Trial-level data from PLUS were filtered to test blocks only (hearts test, flowers test, mixed test), excluding practice trials. Accuracy was derived from the stimulus-response pairing: trials with a heart stimulus were scored correct if the response matched the stimulus side (congruent rule), and trials with a flower stimulus were scored correct if the response was on the opposite side (incongruent rule). Trials with no response (NA) were coded as incorrect. Children were excluded if they scored below 50% on the hearts block, treating this as an attention check. Among children retained after the attention check, 18% of trials in the flowers and mixed blocks had no response recorded and were coded as incorrect. This yielded a final analytic sample of 805 children (1,429 observations across fall and spring timepoints).

Each trial was assigned an item label following the Levante naming convention: {block}_{stimulus}_{trial_type}-{trial_number}, where trial type was classified as start (first trial in block), stay (same stimulus type as previous trial), or switch (different stimulus type from previous trial). Repeated trials of the same type (e.g., multiple heart stay trials) were collapsed into a single score per trial type. A child was scored as correct on that trial type if they got more than half of those trials right, and incorrect otherwise, giving one binary score per trial type per child to enter into the IRT model. Item difficulty parameters were taken from the Levante H&F Rasch model, which was estimated under scalar invariance across sites; parameters are therefore identical across all three Levante sites. This yielded 9 IRT items in PLUS out of the 10 analytic units used in Levante scoring. The missing item type (heartsflowers_heart_start) was absent from PLUS because the mixed block always began with a flower stimulus in this assessment (n = 1439 start trials, all flower), whereas the Levante mixed block begins with a heart stimulus. As a result, no child contributed a heartsflowers_heart_start trial and this analytic unit could not be scored.

Ability scores (θ) showed an increasing pattern across grades (3rd grade: M = 0.04, SD = 0.92; 4th grade: M = 0.14, SD = 0.82; 5th grade: M = 0.35, SD = 0.72), consistent with the expected developmental trajectory and supporting the construct validity of the scoring approach in this sample. Grade-level means are close to the Levante scale origin and slightly positive, suggesting that 3rd–5th graders in PLUS perform at a broadly similar level to the Levante sample.

Item coverage across Levante, PLUS, and Kimochis
	Levante	PLUS	Kimochis
Total raw items	74	74	74
— Hearts block	12	12	12
— Flowers block	16	16	16
— Mixed block	46	46	46
Raw items matched to dataset	74	73	28
Reason(s) for non-overlap	—	Missing unit type: heartsflowers_heart_start	No mixed block (46 items); 8 flowers trials exceed Levante block length (stay-16 to stay-23)
IRT analytic units	10	9	4

Data preparation and scoring - Kimochis

Trial-level data from Kimochis were filtered to test blocks only (hearts test, flowers test), excluding practice trials. The Kimochis dataset did not include a mixed block, as the children were considered too young for the task-switching demands it requires. Accuracy was taken from the pre-computed “acc” column rather than derived from stimulus-response pairings. Timeouts were treated as missing rather than incorrect, consistent with the 1:1 assessment context in which no strict response time limit was imposed. Children were excluded if they scored below 50% on the hearts block as an attention check. This yielded a final analytic sample of 1622 children (2,540 observations) across two academic year cohorts (2021–2022 and 2022–2023), each with fall and spring timepoints.

Each trial was assigned an item label following the Levante naming convention. Of the 74 items in the Levante H&F model, 28 were present in Kimochis — all from the hearts and flowers blocks. The 8 additional flowers trials in Kimochis (stay-16 through stay-23) exceeded the length of the Levante flowers block and were excluded from scoring. All 46 mixed block items were absent by design. Repeated trials of the same block-stimulus-trial_type combination were aggregated into a single analytic unit: a child was scored as correct on that trial type if they got more than half of those trials right, and incorrect otherwise. This yielded 4 IRT items (flowers_flower_start, flowers_flower_stay, hearts_heart_start, hearts_heart_stay).

Ability scores (θ) showed an increasing pattern across age groups (2–3 years: M = -0.23, SD = 0.56; 3–4 years: M = -0.06, SD = 0.51; 4–5 years: M = 0.06, SD = 0.47) consistent with the expected developmental trajectory.

All age-group means are negative relative to the Levante scale origin. This is expected given that Kimochis children (ages 2–5) fall below the Levante calibration age range of 5–12, and θ estimates are based on only 4 IRT items covering the hearts and flowers blocks only, with no mixed block items contributing to scoring.

Comparing reliability estimates with LEVANTE

Sample	Timepoint	N	Reliability
Levante (Germany)			0.54
Levante (Colombia)			0.66
Levante (Canada)			0.59
PLUS	Pooled	1,429	0.30
	PLUS Fall	709	0.29
	PLUS Spring	720	-4.37
Kimochis	Pooled	2,540	-2.08
	2021-2022 Fall	433	-1.73
	2021-2022 Spring	730	-2.06
	2022-2023 Fall	570	-1.96
	2022-2023 Spring	807	-2.59

Item-level proportion correct by timepoint (PLUS)
Item	PLUS Fall	PLUS Spring
flowers_flower_start	0.766	0.953
flowers_flower_stay	0.924	0.994
hearts_heart_start	0.922	0.989
hearts_heart_stay	0.997	1.000
heartsflowers_flower_start	0.380	0.875
heartsflowers_flower_stay	0.592	0.992
heartsflowers_flower_switch	0.394	0.975
heartsflowers_heart_stay	0.705	0.974
heartsflowers_heart_switch	0.347	0.958

Reliability estimates are reported as marginal reliability, computed as 1 minus the ratio of mean error variance to total score variance. Levante reliability ranged from 0.54 to 0.66 across the three sites, reflecting moderate reliability.

For PLUS Fall, reliability was acceptable (r = 0.289), but the PLUS Spring estimate was severely negative (r = -4.366). Negative reliability occurs when measurement error variance exceeds total score variance — that is, when scores show very little spread. This is consistent with the pronounced ceiling effect observed in PLUS Spring, where item-level proportion correct ranged from 0.875 to 1. When nearly all children answer nearly all items correctly, the items no longer meaningfully differentiate between individuals, and the reliability estimate becomes unstable.

Kimochis reliability estimates were consistently negative across all timepoints (range: -2.588 to -1.727). These values should not be interpreted as evidence of poor measurement quality per se, but rather reflect two fundamental limitations of applying this scoring approach to the Kimochis sample. First, scoring is based on only 4 IRT items — far fewer than the 9–10 used in PLUS and Levante — providing very limited information per child. Second, the Levante item parameters were calibrated on children aged 5–12, whereas Kimochis children are aged 2–5. The fixed difficulty parameters are therefore misaligned with the ability range of this sample, meaning the items do not function as intended for this age group.

Distribution of ability estimates (θ) by timepoint in PLUS, illustrating the ceiling effect in Spring

Check item fit

Item-level fit statistics for PLUS Fall (S-X2 test)
Item	S_X2	df.S_X2	RMSEA	p
heartsflowers_flower_start	426.849	7.000	0.291	4.14e-88
flowers_flower_stay	142.359	7.000	0.165	1.63e-27
hearts_heart_stay	131.611	8.000	0.148	1.31e-24
heartsflowers_flower_switch	106.574	7.000	0.142	4.71e-20
heartsflowers_heart_switch	95.924	7.000	0.134	7.48e-18
heartsflowers_heart_stay	72.363	7.000	0.115	4.91e-13
flowers_flower_start	48.528	7.000	0.092	2.81e-08
hearts_heart_start	49.907	8.000	0.086	4.26e-08
heartsflowers_flower_stay	42.718	7.000	0.085	3.78e-07

Item fit was evaluated using the S-X2 statistic for the PLUS Fall sample (N = 709). All items showed significant misfit (all p < .001), with RMSEA values ranging from 0.085 to 0.291. The worst-fitting item was heartsflowers_flower_start, which showed substantially higher misfit than the remaining items.