IMPS 2026 Poster Draft (A0)

Development and Validation of a Context-Specific Generative AI Scoring System for Scenario-Based Teaching Aptitude Assessments: A Many-Facet Rasch Measurement Approach

In-Hee Choi Sookmyung Women’s University, Seoul, South Korea


[COLUMN 1]

INTRODUCTION

Background - Self-report measures for teaching aptitude are vulnerable to social desirability bias and limited in predicting actual classroom behavior (Lievens et al., 2008). - Choi, Park, & Lee (2025) developed a scenario-based essay assessment with ChatGPT-4 automated scoring, but found a 4.5-point leniency gap (out of 24) and limited scoring precision. - Ham et al. (2024) applied MFRM to integrate GPT-4 scores with human ratings for science inquiry assessments and found severe model degradation: all human rater Infit values exceeded 1.5.

Purpose This study develops a context-specific AI scoring system and evaluates its comparability with human ratings using MFRM, comparing: - M1: Human raters only - M2: Human + AI raters - M3: Human + AI + Scenario facet

METHOD

Participants: 535 pre-service teachers at a university in Seoul - Between-subject design: each respondent answered one of four scenarios

Scenario Type N
A1 (Phone rules) School discipline 135
A2 (Slipper rules) School discipline 123
B1 (Field trip complaint) Parental complaint 146
B2 (Club activity complaint) Parental complaint 131

Scoring Criteria: 8 items across 3 domains (0–3 scale, max = 24) - Problem-solving: Problem Definition, Fluency, Validity - Judgment: Decision-making, Inclusiveness - Planning: Resource Utilization, Specificity, Effectiveness

Human Scoring - 16 expert raters in 8 pairs; 2 independent ratings per response - Linking design: 16 common anchor items scored by all raters

AI Scoring - Context-specific system built on ChatGPT-4 API - Enhanced rubrics with scoring examples and scenario context - Each response scored twice → analyzed as 2 separate AI raters (AI₁, AI₂) AND as averaged single rater (AI_avg)

MFRM DESIGN

Three-facet partial credit model:

\[\ln\frac{P(X_{nij}=k)}{P(X_{nij}=k-1)} = \theta_n - \delta_i - \alpha_j - \tau_{ik}\]

Model Raters Facets Obs.
M1 16 human Person × Rater × Item 1,294
M2a 16 human + AI_avg Person × Rater × Item 1,829
M2b 16 human + AI₁ + AI₂ Person × Rater × Item 2,364
M3 16 human + AI₁ + AI₂ + Scenario facet 2,364

[COLUMN 2]

RESULTS 1: Rater Severity

[Figure 1: Rater Severity Bar Chart — M2b]

Ham et al. (2024) This Study
AI severity −0.52, −0.48 (most lenient) AI₁ = −0.193, AI₂ = −0.217
AI location More lenient than ALL humans Center of human distribution
AI₁ − AI₂ difference 0.024 logit (high consistency)
  • AI₁ and AI₂ located at the center of human rater distribution
  • 6 humans more lenient than AI, 9 more severe

RESULTS 2: Item Difficulty Stability

[Figure 2: Item Difficulty M1 vs M2 Scatterplot]

Item M1 M2a (AI_avg) M2b (AI×2)
Problem Def. −0.565 −0.595 −0.779
Fluency −3.034 −3.206 −3.545
Validity −2.927 −2.842 −2.844
Decision −1.538 −1.375 −1.221
Inclusiveness −1.245 −1.253 −1.282
Resource −0.881 −0.820 −0.750
Specificity −0.780 −0.574 −0.363
Effectiveness −0.708 −0.493 −0.241
Profile r with M1 .994 .985
  • Near-perfect stability in both conditions
  • Averaging AI replications yields higher stability

RESULTS 3: Fit Statistics

Item-Level Fit (Infit Mean Square)

Item M1 M2a (AI_avg) M2b (AI×2)
Problem Def. 1.136 1.279 1.399
Fluency 1.146 1.278 1.436
Validity 1.006 1.244 1.426
Decision 0.892 1.095 1.231
Inclusiveness 1.026 1.072 1.143
Resource 0.992 1.030 1.084
Specificity 0.889 0.966 1.052
Effectiveness 0.934 1.022 1.090
Items > 1.5 0 0 0

→ All items within acceptable range (0.5–1.5) in all models

Rater-Level Fit (Infit Mean Square)

Ham et al. (2024) M2a (AI_avg) M2b (AI×2)
Human Infit range 1.42–2.07 0.74–1.37 0.92–1.83
Humans > 1.5 ALL (8/8) 0/16 3/16 (19%)
AI Infit 0.66 (overfit) AI₁=0.64, AI₂=0.63

→ Context-specific system preserves model fit substantially better than general-purpose GPT-4


[COLUMN 3]

RESULTS 4: Model Comparison Summary

Index M1 M2a (AI_avg) M2b (AI×2) M3 (+Scenario)
Observations 1,294 1,829 2,364 2,364
Parameters 40 41 42 45
EAP Reliability .784 .830 .774 .772
θ correlation with M1 .973 .997 .993
Item profile r with M1 .994 .985 .985
  • M2a (averaged AI) improves EAP reliability without fit degradation
  • M2b (individual AI) maintains very high θ correlation (.997)
  • M3: Scenario difficulty range = 0.251 logit (small relative to rater severity range of 1.47 logit)

RESULTS 5: Comparison with Ham et al. (2024)

Indicator Ham et al. (2024) General-purpose GPT-4 This Study Context-specific System
AI severity location Most lenient of all Center of human distribution
Item difficulty change Large shifts r = .985–.994
Human rater Infit in M2 All > 1.5 0/16 (M2a), 3/16 (M2b)
AI internal consistency AI α > Human α Human α > AI α
Total score difference ~4.5 pts (all lenient) 0.67 pts (bidirectional)

DISCUSSION

  1. System design matters more than AI capability. The context-specific system locates AI at the center of human severity, while general-purpose GPT-4 was more lenient than all humans.

  2. Averaging AI replications optimizes integration. Using AI_avg (r=.994, 0/16 misfit) yields better stability than individual AI scores (r=.985, 3/16 misfit), suggesting mean AI scores effectively reduce stochastic noise.

  3. AI shows differential behavior by item type.

    • Adequate for observable/countable elements (Inclusiveness, Resource, Specificity)
    • Limited for qualitative judgment items (Fluency, Validity, Decision-making)
    • AI SD as low as 46% of human SD on Validity → range restriction
  4. Structural comparability ≠ Construct equivalence. Human scores show weak but significant criterion correlations (.090–.095); AI scores show near-zero correlations with all external criteria.

CONCLUSIONS

  • Context-specific AI scoring systems can be integrated into MFRM without model degradation, unlike general-purpose LLMs.
  • Practical recommendation: Use averaged AI replications for MFRM integration.
  • Future work: Few-shot anchor responses, chain-of-thought scoring protocols, and refined rubrics for qualitative judgment items.

REFERENCES

  • Choi, I. H., Park, S. Y., & Lee, Y. K. (2025). Korean Journal of Educational Research, 63(6).
  • Eckes, T. (2015). Introduction to many-facet Rasch measurement. Peter Lang.
  • Ham, E. H., et al. (2024). Journal of Educational Information and Media, 30(3), 713–742.
  • Linacre, J. M. (2019). Facets (Version 3.81.2).
  • Robitzsch, A., et al. (2023). TAM: Test Analysis Modules [R package].

CONTACT

In-Hee Choi, Ph.D. Department of Education, Sookmyung Women’s University