In-Hee Choi Sookmyung Women’s University, Seoul, South Korea ichoi@sookmyung.ac.kr
Background - Self-report measures for teaching aptitude are vulnerable to social desirability bias and limited in predicting actual classroom behavior (Lievens et al., 2008). - Choi, Park, & Lee (2025) developed a scenario-based essay assessment with ChatGPT-4 automated scoring, but found a 4.5-point leniency gap (out of 24) and limited scoring precision. - Ham et al. (2024) applied MFRM to integrate GPT-4 scores with human ratings for science inquiry assessments and found severe model degradation: all human rater Infit values exceeded 1.5.
Purpose This study develops a context-specific AI scoring system and evaluates its comparability with human ratings using MFRM, comparing: - M1: Human raters only - M2: Human + AI raters - M3: Human + AI + Scenario facet
Participants: 535 pre-service teachers at a university in Seoul - Between-subject design: each respondent answered one of four scenarios
| Scenario | Type | N |
|---|---|---|
| A1 (Phone rules) | School discipline | 135 |
| A2 (Slipper rules) | School discipline | 123 |
| B1 (Field trip complaint) | Parental complaint | 146 |
| B2 (Club activity complaint) | Parental complaint | 131 |
Scoring Criteria: 8 items across 3 domains (0–3 scale, max = 24) - Problem-solving: Problem Definition, Fluency, Validity - Judgment: Decision-making, Inclusiveness - Planning: Resource Utilization, Specificity, Effectiveness
Human Scoring - 16 expert raters in 8 pairs; 2 independent ratings per response - Linking design: 16 common anchor items scored by all raters
AI Scoring - Context-specific system built on ChatGPT-4 API - Enhanced rubrics with scoring examples and scenario context - Each response scored twice → analyzed as 2 separate AI raters (AI₁, AI₂) AND as averaged single rater (AI_avg)
Three-facet partial credit model:
\[\ln\frac{P(X_{nij}=k)}{P(X_{nij}=k-1)} = \theta_n - \delta_i - \alpha_j - \tau_{ik}\]
| Model | Raters | Facets | Obs. |
|---|---|---|---|
| M1 | 16 human | Person × Rater × Item | 1,294 |
| M2a | 16 human + AI_avg | Person × Rater × Item | 1,829 |
| M2b | 16 human + AI₁ + AI₂ | Person × Rater × Item | 2,364 |
| M3 | 16 human + AI₁ + AI₂ | + Scenario facet | 2,364 |
[Figure 1: Rater Severity Bar Chart — M2b]
| Ham et al. (2024) | This Study | |
|---|---|---|
| AI severity | −0.52, −0.48 (most lenient) | AI₁ = −0.193, AI₂ = −0.217 |
| AI location | More lenient than ALL humans | Center of human distribution |
| AI₁ − AI₂ difference | — | 0.024 logit (high consistency) |
[Figure 2: Item Difficulty M1 vs M2 Scatterplot]
| Item | M1 | M2a (AI_avg) | M2b (AI×2) |
|---|---|---|---|
| Problem Def. | −0.565 | −0.595 | −0.779 |
| Fluency | −3.034 | −3.206 | −3.545 |
| Validity | −2.927 | −2.842 | −2.844 |
| Decision | −1.538 | −1.375 | −1.221 |
| Inclusiveness | −1.245 | −1.253 | −1.282 |
| Resource | −0.881 | −0.820 | −0.750 |
| Specificity | −0.780 | −0.574 | −0.363 |
| Effectiveness | −0.708 | −0.493 | −0.241 |
| Profile r with M1 | — | .994 | .985 |
Item-Level Fit (Infit Mean Square)
| Item | M1 | M2a (AI_avg) | M2b (AI×2) |
|---|---|---|---|
| Problem Def. | 1.136 | 1.279 | 1.399 |
| Fluency | 1.146 | 1.278 | 1.436 |
| Validity | 1.006 | 1.244 | 1.426 |
| Decision | 0.892 | 1.095 | 1.231 |
| Inclusiveness | 1.026 | 1.072 | 1.143 |
| Resource | 0.992 | 1.030 | 1.084 |
| Specificity | 0.889 | 0.966 | 1.052 |
| Effectiveness | 0.934 | 1.022 | 1.090 |
| Items > 1.5 | 0 | 0 | 0 |
→ All items within acceptable range (0.5–1.5) in all models
Rater-Level Fit (Infit Mean Square)
| Ham et al. (2024) | M2a (AI_avg) | M2b (AI×2) | |
|---|---|---|---|
| Human Infit range | 1.42–2.07 | 0.74–1.37 | 0.92–1.83 |
| Humans > 1.5 | ALL (8/8) | 0/16 | 3/16 (19%) |
| AI Infit | — | 0.66 (overfit) | AI₁=0.64, AI₂=0.63 |
→ Context-specific system preserves model fit substantially better than general-purpose GPT-4
| Index | M1 | M2a (AI_avg) | M2b (AI×2) | M3 (+Scenario) |
|---|---|---|---|---|
| Observations | 1,294 | 1,829 | 2,364 | 2,364 |
| Parameters | 40 | 41 | 42 | 45 |
| EAP Reliability | .784 | .830 | .774 | .772 |
| θ correlation with M1 | — | .973 | .997 | .993 |
| Item profile r with M1 | — | .994 | .985 | .985 |
| Indicator | Ham et al. (2024) General-purpose GPT-4 | This Study Context-specific System |
|---|---|---|
| AI severity location | Most lenient of all | Center of human distribution |
| Item difficulty change | Large shifts | r = .985–.994 |
| Human rater Infit in M2 | All > 1.5 | 0/16 (M2a), 3/16 (M2b) |
| AI internal consistency | AI α > Human α | Human α > AI α |
| Total score difference | ~4.5 pts (all lenient) | 0.67 pts (bidirectional) |
System design matters more than AI capability. The context-specific system locates AI at the center of human severity, while general-purpose GPT-4 was more lenient than all humans.
Averaging AI replications optimizes integration. Using AI_avg (r=.994, 0/16 misfit) yields better stability than individual AI scores (r=.985, 3/16 misfit), suggesting mean AI scores effectively reduce stochastic noise.
AI shows differential behavior by item type.
Structural comparability ≠ Construct equivalence. Human scores show weak but significant criterion correlations (.090–.095); AI scores show near-zero correlations with all external criteria.
In-Hee Choi, Ph.D. Department of Education, Sookmyung Women’s University ichoi@sookmyung.ac.kr