IMPS 2026 Poster Draft (A0)

Development and Validation of a Context-Specific Generative AI Scoring System for Scenario-Based Teaching Aptitude Assessments: A Many-Facet Rasch Measurement Approach

In-Hee Choi Sookmyung Women’s University, Seoul, South Korea ichoi@sookmyung.ac.kr

[COLUMN 1]

INTRODUCTION

Background - Self-report measures for teaching aptitude are vulnerable to social desirability bias and limited in predicting actual classroom behavior (Lievens et al., 2008). - Choi, Park, & Lee (2025) developed a scenario-based essay assessment with ChatGPT-4 automated scoring, but found a 4.5-point leniency gap (out of 24) and limited scoring precision. - Ham et al. (2024) applied MFRM to integrate GPT-4 scores with human ratings for science inquiry assessments and found severe model degradation: all human rater Infit values exceeded 1.5.

Purpose This study develops a context-specific AI scoring system and evaluates its comparability with human ratings using MFRM, comparing: - M1: Human raters only - M2: Human + AI raters - M3: Human + AI + Scenario facet

METHOD

Participants: 535 pre-service teachers at a university in Seoul - Between-subject design: each respondent answered one of four scenarios

Scenario	Type	N
A1 (Phone rules)	School discipline	135
A2 (Slipper rules)	School discipline	123
B1 (Field trip complaint)	Parental complaint	146
B2 (Club activity complaint)	Parental complaint	131

Scoring Criteria: 8 items across 3 domains (0–3 scale, max = 24) - Problem-solving: Problem Definition, Fluency, Validity - Judgment: Decision-making, Inclusiveness - Planning: Resource Utilization, Specificity, Effectiveness

Human Scoring - 16 expert raters in 8 pairs; 2 independent ratings per response - Linking design: 16 common anchor items scored by all raters

AI Scoring - Context-specific system built on ChatGPT-4 API - Enhanced rubrics with scoring examples and scenario context - Each response scored twice → analyzed as 2 separate AI raters (AI₁, AI₂) AND as averaged single rater (AI_avg)

MFRM DESIGN

Three-facet partial credit model:

\[\ln\frac{P(X_{nij}=k)}{P(X_{nij}=k-1)} = \theta_n - \delta_i - \alpha_j - \tau_{ik}\]

Model	Raters	Facets	Obs.
M1	16 human	Person × Rater × Item	1,294
M2a	16 human + AI_avg	Person × Rater × Item	1,829
M2b	16 human + AI₁ + AI₂	Person × Rater × Item	2,364
M3	16 human + AI₁ + AI₂	+ Scenario facet	2,364

[COLUMN 2]

RESULTS 1: Rater Severity

[Figure 1: Rater Severity Bar Chart — M2b]

	Ham et al. (2024)	This Study
AI severity	−0.52, −0.48 (most lenient)	AI₁ = −0.193, AI₂ = −0.217
AI location	More lenient than ALL humans	Center of human distribution
AI₁ − AI₂ difference	—	0.024 logit (high consistency)

AI₁ and AI₂ located at the center of human rater distribution
6 humans more lenient than AI, 9 more severe

RESULTS 2: Item Difficulty Stability

[Figure 2: Item Difficulty M1 vs M2 Scatterplot]

Item	M1	M2a (AI_avg)	M2b (AI×2)
Problem Def.	−0.565	−0.595	−0.779
Fluency	−3.034	−3.206	−3.545
Validity	−2.927	−2.842	−2.844
Decision	−1.538	−1.375	−1.221
Inclusiveness	−1.245	−1.253	−1.282
Resource	−0.881	−0.820	−0.750
Specificity	−0.780	−0.574	−0.363
Effectiveness	−0.708	−0.493	−0.241
Profile r with M1	—	.994	.985

Near-perfect stability in both conditions
Averaging AI replications yields higher stability

RESULTS 3: Fit Statistics

Item-Level Fit (Infit Mean Square)

Item	M1	M2a (AI_avg)	M2b (AI×2)
Problem Def.	1.136	1.279	1.399
Fluency	1.146	1.278	1.436
Validity	1.006	1.244	1.426
Decision	0.892	1.095	1.231
Inclusiveness	1.026	1.072	1.143
Resource	0.992	1.030	1.084
Specificity	0.889	0.966	1.052
Effectiveness	0.934	1.022	1.090
Items > 1.5	0	0	0

→ All items within acceptable range (0.5–1.5) in all models

Rater-Level Fit (Infit Mean Square)

	Ham et al. (2024)	M2a (AI_avg)	M2b (AI×2)
Human Infit range	1.42–2.07	0.74–1.37	0.92–1.83
Humans > 1.5	ALL (8/8)	0/16	3/16 (19%)
AI Infit	—	0.66 (overfit)	AI₁=0.64, AI₂=0.63

→ Context-specific system preserves model fit substantially better than general-purpose GPT-4

[COLUMN 3]

RESULTS 4: Model Comparison Summary

Index	M1	M2a (AI_avg)	M2b (AI×2)	M3 (+Scenario)
Observations	1,294	1,829	2,364	2,364
Parameters	40	41	42	45
EAP Reliability	.784	.830	.774	.772
θ correlation with M1	—	.973	.997	.993
Item profile r with M1	—	.994	.985	.985

M2a (averaged AI) improves EAP reliability without fit degradation
M2b (individual AI) maintains very high θ correlation (.997)
M3: Scenario difficulty range = 0.251 logit (small relative to rater severity range of 1.47 logit)

RESULTS 5: Comparison with Ham et al. (2024)

Indicator	Ham et al. (2024) General-purpose GPT-4	This Study Context-specific System
AI severity location	Most lenient of all	Center of human distribution
Item difficulty change	Large shifts	r = .985–.994
Human rater Infit in M2	All > 1.5	0/16 (M2a), 3/16 (M2b)
AI internal consistency	AI α > Human α	Human α > AI α
Total score difference	~4.5 pts (all lenient)	0.67 pts (bidirectional)

DISCUSSION

System design matters more than AI capability. The context-specific system locates AI at the center of human severity, while general-purpose GPT-4 was more lenient than all humans.
Averaging AI replications optimizes integration. Using AI_avg (r=.994, 0/16 misfit) yields better stability than individual AI scores (r=.985, 3/16 misfit), suggesting mean AI scores effectively reduce stochastic noise.
AI shows differential behavior by item type.
- Adequate for observable/countable elements (Inclusiveness, Resource, Specificity)
- Limited for qualitative judgment items (Fluency, Validity, Decision-making)
- AI SD as low as 46% of human SD on Validity → range restriction
Structural comparability ≠ Construct equivalence. Human scores show weak but significant criterion correlations (.090–.095); AI scores show near-zero correlations with all external criteria.

CONCLUSIONS

Context-specific AI scoring systems can be integrated into MFRM without model degradation, unlike general-purpose LLMs.
Practical recommendation: Use averaged AI replications for MFRM integration.
Future work: Few-shot anchor responses, chain-of-thought scoring protocols, and refined rubrics for qualitative judgment items.

REFERENCES

Choi, I. H., Park, S. Y., & Lee, Y. K. (2025). Korean Journal of Educational Research, 63(6).
Eckes, T. (2015). Introduction to many-facet Rasch measurement. Peter Lang.
Ham, E. H., et al. (2024). Journal of Educational Information and Media, 30(3), 713–742.
Linacre, J. M. (2019). Facets (Version 3.81.2).
Robitzsch, A., et al. (2023). TAM: Test Analysis Modules [R package].

CONTACT

In-Hee Choi, Ph.D. Department of Education, Sookmyung Women’s University ichoi@sookmyung.ac.kr