Dataset | Timepoint | N | Age M (SD) | % Female | Retention |
|---|---|---|---|---|---|
PLUS | PLUS Fall | 719 | 9.89 (0.82) | 48.8 | 1 tp: n = 170 |
PLUS Spring | 723 | 9.92 (0.83) | 48.1 | ||
Kimochis | 2021-2022 Fall | 471 | 4.08 (0.52) | 46.1 | 1 tp: n = 873 |
2021-2022 Spring | 764 | 4.13 (0.6) | 46.0 | ||
2022-2023 Fall | 599 | 4.31 (0.53) | 50.9 | ||
2022-2023 Spring | 842 | 4.25 (0.6) | 50.8 |
Spark_h&f
Applying the Levante IRT Model to the PLUS and Kimochis datasets
The goal of this analysis is to evaluate the Levante Hearts and Flowers (H&F) IRT model’s performance on an external dataset collected by Spark Lab, as a test of the model’s generalizability. The Spark Lab data consists of two datasets: PLUS (3rd–5th graders, group-based tablet assessment) and Kimochis (preschoolers, 1:1 assessment).
Part 1: Sample Descriptives
Plus
The Plus dataset includes 806 unique students across two timepoints (PLUS Fall and PLUS Spring) drawn from 8 schools. Sample sizes were stable across timepoints (Fall: n = 719, Spring: n = 723). Mean age was approximately 9.91 years (SD ~0.82) at both timepoints, with students in grades 3, 4, and 5 roughly evenly distributed. The sample was approximately 49% female at both timepoints.
Kimochis
The Kimochis dataset includes 1670 unique children across four timepoints spanning two academic years (2021-2022 and 2022-2023, Fall and Spring). Sample sizes ranged from 471 children at the first timepoint to 842 at the last. The sample was approximately evenly split by sex (~46-51% female) across timepoints, with mean age ranging from 4.08 to 4.31 years. Children were enrolled in Pre-K and, from 2021-2022 Spring onwards, Transitional Kindergarten (TK).
Longitudinal retention was limited: the majority of children appeared at only one timepoint (n = 873), with 633 appearing at two, 119 at three, and 45 at all four timepoints.
For the majority of children (n = 1445), age did not vary across timepoints, suggesting it reflects approximate age at enrolment rather than age at time of assessment. Age values also appear to be rounded to the nearest month rather than recorded as a continuous measure.
Trial-level descriptive statistics by dataset and block
Dataset | Block | N Trials | Acc Mean | Acc SD | RT Mean | RT Median | RT SD | RT Min | RT Max | Age M (SD) |
|---|---|---|---|---|---|---|---|---|---|---|
kimochis | flowers test | 36,478 | 0.75 | 0.43 | 1,234.40 | 1,148.00 | 530.77 | 200.00 | 2,984.00 | 4.26 (0.55) |
hearts test | 28,281 | 0.93 | 0.26 | 1,069.54 | 971.00 | 503.05 | 200.00 | 2,968.00 | 4.25 (0.56) | |
plus | flowers test | 17,254 | 0.89 | 0.31 | 553.97 | 538.00 | 146.94 | 1.00 | 1,256.00 | 9.9 (0.83) |
hearts test | 17,276 | 0.96 | 0.19 | 487.36 | 468.00 | 124.14 | 1.00 | 1,252.00 | 9.9 (0.83) | |
mixed test | 47,269 | 0.66 | 0.47 | 693.66 | 715.00 | 177.85 | 1.00 | 1,264.00 | 9.9 (0.83) |
Part 2: Fixed Parameter Scoring
Data preparation and scoring - PLUS
Trial-level data from PLUS were filtered to test blocks only (hearts test, flowers test, mixed test), excluding practice trials. Accuracy was derived from the stimulus-response pairing: trials with a heart stimulus were scored correct if the response matched the stimulus side (congruent rule), and trials with a flower stimulus were scored correct if the response was on the opposite side (incongruent rule). Trials with no response (NA) were coded as incorrect. Children were excluded if they scored below 50% on the hearts block, treating this as an attention check. Among children retained after the attention check, 18% of trials in the flowers and mixed blocks had no response recorded and were coded as incorrect. This yielded a final analytic sample of 805 children (1,429 observations across fall and spring timepoints).
Each trial was assigned an item label following the Levante naming convention: {block}_{stimulus}_{trial_type}-{trial_number}, where trial type was classified as start (first trial in block), stay (same stimulus type as previous trial), or switch (different stimulus type from previous trial). Repeated trials of the same type (e.g., multiple heart stay trials) were collapsed into a single score per trial type. A child was scored as correct on that trial type if they got more than half of those trials right, and incorrect otherwise, giving one binary score per trial type per child to enter into the IRT model. Item difficulty parameters were taken from the Levante H&F Rasch model, which was estimated under scalar invariance across sites; parameters are therefore identical across all three Levante sites. This yielded 9 IRT items in PLUS out of the 10 analytic units used in Levante scoring. The missing item type (heartsflowers_heart_start) was absent from PLUS because the mixed block always began with a flower stimulus in this assessment (n = 1439 start trials, all flower), whereas the Levante mixed block begins with a heart stimulus. As a result, no child contributed a heartsflowers_heart_start trial and this analytic unit could not be scored.
Ability scores (θ) showed an increasing pattern across grades (3rd grade: M = 0.04, SD = 0.92; 4th grade: M = 0.14, SD = 0.82; 5th grade: M = 0.35, SD = 0.72), consistent with the expected developmental trajectory and supporting the construct validity of the scoring approach in this sample. Grade-level means are close to the Levante scale origin and slightly positive, suggesting that 3rd–5th graders in PLUS perform at a broadly similar level to the Levante sample.
| Levante | PLUS | Kimochis |
|---|---|---|---|
Total raw items | 74 | 74 | 74 |
— Hearts block | 12 | 12 | 12 |
— Flowers block | 16 | 16 | 16 |
— Mixed block | 46 | 46 | 46 |
Raw items matched to dataset | 74 | 73 | 28 |
Reason(s) for non-overlap | — | Missing unit type: heartsflowers_heart_start | No mixed block (46 items); 8 flowers trials exceed Levante block length (stay-16 to stay-23) |
IRT analytic units | 10 | 9 | 4 |
Data preparation and scoring - Kimochis
Trial-level data from Kimochis were filtered to test blocks only (hearts test, flowers test), excluding practice trials. The Kimochis dataset did not include a mixed block, as the children were considered too young for the task-switching demands it requires. Accuracy was taken from the pre-computed “acc” column rather than derived from stimulus-response pairings. Timeouts were treated as missing rather than incorrect, consistent with the 1:1 assessment context in which no strict response time limit was imposed. Children were excluded if they scored below 50% on the hearts block as an attention check. This yielded a final analytic sample of 1622 children (2,540 observations) across two academic year cohorts (2021–2022 and 2022–2023), each with fall and spring timepoints.
Each trial was assigned an item label following the Levante naming convention. Of the 74 items in the Levante H&F model, 28 were present in Kimochis — all from the hearts and flowers blocks. The 8 additional flowers trials in Kimochis (stay-16 through stay-23) exceeded the length of the Levante flowers block and were excluded from scoring. All 46 mixed block items were absent by design. Repeated trials of the same block-stimulus-trial_type combination were aggregated into a single analytic unit: a child was scored as correct on that trial type if they got more than half of those trials right, and incorrect otherwise. This yielded 4 IRT items (flowers_flower_start, flowers_flower_stay, hearts_heart_start, hearts_heart_stay).
Ability scores (θ) showed an increasing pattern across age groups (2–3 years: M = -0.23, SD = 0.56; 3–4 years: M = -0.06, SD = 0.51; 4–5 years: M = 0.06, SD = 0.47) consistent with the expected developmental trajectory.
All age-group means are negative relative to the Levante scale origin. This is expected given that Kimochis children (ages 2–5) fall below the Levante calibration age range of 5–12, and θ estimates are based on only 4 IRT items covering the hearts and flowers blocks only, with no mixed block items contributing to scoring.
Comparing reliability estimates with LEVANTE
Sample | Timepoint | N | Reliability |
|---|---|---|---|
Levante (Germany) | 0.54 | ||
Levante (Colombia) | 0.66 | ||
Levante (Canada) | 0.59 | ||
PLUS | Pooled | 1,429 | 0.30 |
PLUS Fall | 709 | 0.29 | |
PLUS Spring | 720 | -4.37 | |
Kimochis | Pooled | 2,540 | -2.08 |
2021-2022 Fall | 433 | -1.73 | |
2021-2022 Spring | 730 | -2.06 | |
2022-2023 Fall | 570 | -1.96 | |
2022-2023 Spring | 807 | -2.59 |
Item | PLUS Fall | PLUS Spring |
|---|---|---|
flowers_flower_start | 0.766 | 0.953 |
flowers_flower_stay | 0.924 | 0.994 |
hearts_heart_start | 0.922 | 0.989 |
hearts_heart_stay | 0.997 | 1.000 |
heartsflowers_flower_start | 0.380 | 0.875 |
heartsflowers_flower_stay | 0.592 | 0.992 |
heartsflowers_flower_switch | 0.394 | 0.975 |
heartsflowers_heart_stay | 0.705 | 0.974 |
heartsflowers_heart_switch | 0.347 | 0.958 |
Reliability estimates are reported as marginal reliability, computed as 1 minus the ratio of mean error variance to total score variance. Levante reliability ranged from 0.54 to 0.66 across the three sites, reflecting moderate reliability.
For PLUS Fall, reliability was acceptable (r = 0.289), but the PLUS Spring estimate was severely negative (r = -4.366). Negative reliability occurs when measurement error variance exceeds total score variance — that is, when scores show very little spread. This is consistent with the pronounced ceiling effect observed in PLUS Spring, where item-level proportion correct ranged from 0.875 to 1. When nearly all children answer nearly all items correctly, the items no longer meaningfully differentiate between individuals, and the reliability estimate becomes unstable.
Kimochis reliability estimates were consistently negative across all timepoints (range: -2.588 to -1.727). These values should not be interpreted as evidence of poor measurement quality per se, but rather reflect two fundamental limitations of applying this scoring approach to the Kimochis sample. First, scoring is based on only 4 IRT items — far fewer than the 9–10 used in PLUS and Levante — providing very limited information per child. Second, the Levante item parameters were calibrated on children aged 5–12, whereas Kimochis children are aged 2–5. The fixed difficulty parameters are therefore misaligned with the ability range of this sample, meaning the items do not function as intended for this age group.
Check item fit
Item | S_X2 | df.S_X2 | RMSEA | p |
|---|---|---|---|---|
heartsflowers_flower_start | 426.849 | 7.000 | 0.291 | 4.14e-88 |
flowers_flower_stay | 142.359 | 7.000 | 0.165 | 1.63e-27 |
hearts_heart_stay | 131.611 | 8.000 | 0.148 | 1.31e-24 |
heartsflowers_flower_switch | 106.574 | 7.000 | 0.142 | 4.71e-20 |
heartsflowers_heart_switch | 95.924 | 7.000 | 0.134 | 7.48e-18 |
heartsflowers_heart_stay | 72.363 | 7.000 | 0.115 | 4.91e-13 |
flowers_flower_start | 48.528 | 7.000 | 0.092 | 2.81e-08 |
hearts_heart_start | 49.907 | 8.000 | 0.086 | 4.26e-08 |
heartsflowers_flower_stay | 42.718 | 7.000 | 0.085 | 3.78e-07 |
Item fit was evaluated using the S-X2 statistic for the PLUS Fall sample (N = 709). All items showed significant misfit (all p < .001), with RMSEA values ranging from 0.085 to 0.291. The worst-fitting item was heartsflowers_flower_start, which showed substantially higher misfit than the remaining items.