Both fields have developed rich but largely separate vocabularies for overlapping concepts. A complete side-by-side reference table — mapping every key psychometric term to its ML/AI equivalent, with notes on where the analogy holds and where it breaks down — is provided in the Supplementary Materials at the end of this document. The brief overview below introduces the translation challenge before the main framework begins.
The two fields at the heart of this dissertation — psychometrics and machine learning / AI — have developed in parallel without much mutual translation. A psychometrician reading an ML benchmark paper will encounter terms like “accuracy,” “benchmark,” and “prompt” used in ways that partially overlap with, but are not identical to, “observed score,” “test instrument,” and “item stimulus.” An AI scientist reading a psychometric validity study will encounter “construct-irrelevant variance,” “nomological network,” and “universe score” — concepts with precise technical meanings but no standard ML equivalents.
The full Terminology Comparison Table covering 20+ term pairs is in the Supplementary Materials. Readers new to psychometrics are encouraged to review it before Pillars I–III. Readers fluent in both fields may proceed directly to the framework.
The systematic application of psychometric methods to the evaluation of artificial intelligence has been formalized as AI Psychometrics — defined as the principled measurement of cognitive and behavioral properties of AI systems using validated psychometric instruments and frameworks (Wang et al., 2023; Li et al., 2025). Li et al. (2025) draw a conceptually important three-way distinction among related but non-identical research programs:
The present dissertation occupies the Psychometric AI branch: its goal is not to describe LLMs’ psychological profiles but to apply and adapt psychometric measurement theory to ensure that LLM benchmark scores are valid, reliable, and comparable. This positioning distinguishes the dissertation from prior AI Psychometrics work (e.g., Pellert et al., 2023) and aligns it with a rigorous measurement-science agenda.
For AI scientists: Psychometric AI means using measurement science to make your benchmarks more trustworthy — not just asking “did the model get it right?” but “does getting it right mean what we claim it means, and does that meaning hold up under different conditions?”
For psychometricians: The “examinees” here are AI models. The “test instrument” is a benchmark dataset. The core challenge is familiar — establishing that scores mean what we claim they mean — but the population, the construct definition, and the administration conditions are all novel.
Audience bridge: This section presents the dissertation’s central argument — that LLM benchmark scores need the same validity infrastructure that educational tests have used for a century. For AI scientists: think of this as applying the engineering discipline of metrology (measurement standards) to your evaluation pipelines. For psychometricians: think of benchmarks as unvalidated tests, and this dissertation as the validation study they never had.
The assessment of large language models is an active and rapidly developing area in which engineers, scientists, and measurement specialists are collectively working to understand what LLMs can do, how best to measure it, and when particular measurements are appropriate for particular decisions. Evaluation matters enormously at this moment — not because current practice is fundamentally broken, but because the stakes of model capability claims are rising faster than the measurement infrastructure needed to substantiate them. Psychometric theory, grounded in over a century of rigorous measurement science, provides precisely this infrastructure: a systematic framework of validity arguments, reliability estimation, and score comparability standards that can bring interpretive rigor to LLM benchmarking as the field matures and as high-stakes deployment decisions increasingly depend on benchmark scores.
The field of AI Psychometrics (Wang et al., 2023; Li et al., 2025) provides the disciplinary anchor for this work. A foundational distinction within this field is between construct-oriented evaluation—which asks whether a benchmark measures a well-defined latent construct with appropriate validity evidence—and task-oriented evaluation—which asks only whether a model succeeds on a given task, without concern for construct interpretation (Wang et al., 2023). Contemporary LLM benchmarking is almost exclusively task-oriented; the present dissertation argues that a construct-oriented framework is necessary for benchmarks to support high-stakes inferences about model capability.
The limitations of current LLM benchmarking practices mirror what Swiecki et al. (2022) termed the Standard Assessment Paradigm (SAP), a bundle of assumptions inherited from large-scale standardized testing that renders assessments onerous, discrete, uniform, inauthentic, and antiquated. Although the SAP was developed to critique human assessments, its five critiques apply with equal force to contemporary LLM benchmarks: they are administered in isolated one-shot episodes (discrete), applied uniformly across heterogeneous architectures (uniform), untethered from real-world task contexts (inauthentic), and anchored to static item banks that are rapidly contaminated by training data overlap (antiquated). The present dissertation treats the psychometric renovation of LLM benchmarking as an applied test of whether formal measurement frameworks—developed over a century of human assessment research—can correct the structural deficiencies the SAP identifies.
The dissertation pursues this renovation through a Three-Pillar architecture that mirrors the lifecycle of a rigorous assessment system. Pillar I establishes the construct definition and validity framework — defining what is being measured, for whom, and under what interpretive assumptions — before any data are collected. Pillar II conducts the core empirical validation: testing the structural integrity of the proposed construct through dimensionality analysis, bifactor modeling, and generalizability estimation across conditions that include the non-normal latent distributions characteristic of heterogeneous LLM populations. Pillar III addresses the operational sustainability of the assessment system over time — how domain specifications are maintained, how item banks are refreshed, and how score comparability is preserved as model generations evolve. Together, the three pillars constitute a complete validity argument (Kane, 2013) rather than a collection of discrete technical analyses, with each methodological choice serving an explicit inferential purpose within the broader argument that psychometric standards are both applicable and necessary for high-stakes LLM evaluation.
The dissertation is organized into three integrated pillars that follow the lifecycle of a high-stakes assessment — from foundational definition through empirical validation to long-term maintenance. This architecture replaces the original five-question sequence with a construct-first logic in which every methodological tool serves an explicit measurement argument rather than standing as a demonstration of technique.
Goal: Define the object of measurement before measuring anything. This pillar establishes the nomological network for LLM capabilities, specifies the intended score interpretations and uses that determine what validity evidence is required, and articulates what “reasoning” or “knowledge” means behaviorally for a non-human agent. Evaluation is not a neutral act; its validity depends entirely on this definitional groundwork. Components: Intended Use Taxonomy (formerly RQ1) + Construct Validity Evidence — theoretical framework (formerly RQ2, §1–§3.2).
Goal: Given the construct definition established in Pillar I, determine whether scores actually capture it and quantify every source of error. The bifactor model answers whether benchmark performance is organized by a general ability dimension (\(G\)) or fragmented domain knowledge (\(S_k\)); Generalizability Theory then quantifies how much prompt wording, format, and stochastic replication contaminate \(G\)-scores. Dimensionality and reliability are addressed in sequence because reliability estimates are interpretable only once the structural model is established. Components: Internal Structure Evidence — bifactor IRT + simulation study (formerly RQ2, §3.3 onward) + Item Format as Construct-Irrelevant Variance and G-Theory (formerly RQ3).
Goal: With a validated, reliable score in hand, address the longitudinal and operational questions: how to maintain a vertical scale as models improve, how to generate fresh parallel forms through automated item generation to prevent contamination-induced obsolescence, and when cross-benchmark score comparison is defensible. AIG is treated as the structural solution to scale decay; equating provides the standard against which growth claims are tested. Components: Domain Specification, Test Blueprints, and AIG (formerly RQ4) + Score Comparability and Longitudinal Linking (formerly RQ5).
A condensed map of the dissertation’s architecture, research questions, methods, and deliverables. Use this section as a navigation guide before reading the full treatment of any component.
Figure 1. Three-Pillar architecture showing how the five research questions are distributed across pillars and how outputs feed forward.
Figure 2. End-to-end evidence chain from intended use specification through score comparability, showing how each stage conditions the next.
Figure 3. This dissertation applies psychometric methods (left circle) to AI/LLM performance evaluation (right circle). The intersection — Psychometric AI — is where formal measurement science is used to ensure that AI benchmark scores are valid, reliable, and comparable. The arrow indicates the direction of application: psychometric tools are brought into the AI evaluation domain.
Figure 3 illustrates the directional logic of this dissertation. The left circle is the methodological toolkit — psychometrics: IRT, Generalizability Theory, validity frameworks, and reliability estimation. The right circle is the domain under study — AI/LLM evaluation: benchmark scores, model capability claims, and evaluation datasets. The arrow marks the direction of application: psychometric methods are brought into the AI evaluation domain. The intersection — Psychometric AI — is where this dissertation operates, rigorously measuring and validating AI performance using the tools psychometrics provides.
| Pillar I | Pillar II | Pillar III | |
|---|---|---|---|
| Name | Construct Definition & Validity Framework | Internal Structure & Score Generalizability | Operational Scaling & Score Comparability |
| Core question | What are we measuring, and for whom? | Does the score faithfully capture \(G\)? How much error contaminates it? | Is the scale durable, and can improvement be distinguished from contamination? |
| Medical metaphor | The Diagnosis — intake assessment; defining what to check for before any test is run | The Labs — calibrating the thermometer; measuring signal, not noise | The Patient History — distinguishing genuine recovery from memorizing the eye chart |
| RQs contained | RQ1 (full) · RQ2 §1–3.2 (theory) | RQ2 §3.3 (simulation + empirical illustration) · RQ3 (G-Theory) | RQ4 (domain spec + LLM taxonomy) · RQ5 (equating) |
| Pillar output | Validity argument template + intended-use taxonomy | Simulation-validated methods + convergent dimensionality evidence + G-study error decomposition | Stratified LLM sampling frame + score comparability roadmap |
| Anchor logic | Validity depends on definitional groundwork | Reliability is interpretable only after structure is established | Operational sustainability requires validated, reliable scores first |
| RQ | Pillar | Core Question | Primary Methods / Tools | Key Deliverables |
|---|---|---|---|---|
| RQ1 | I | How does the intended use of benchmark scores determine what validity evidence is required? | Intended-use taxonomy; Kane’s interpretive/use argument; stakeholder × evidence matrix | (1) Taxonomy of 5 intended uses · (2) Stakeholder × use validity evidence matrix · (3) Decision framework: given your use, here is the evidence you need |
| RQ2 | I → II | What construct validity evidence exists (or is absent) for current LLM benchmark score interpretations? | Messick’s unified validity framework; five sources of validity evidence; dimensionality simulation (13 methods); convergent empirical illustration | (1) Adapted five-source validity evidence framework for LLMs · (2) Four-step validity argument template · (3) Simulation-validated dimensionality method recommendations (§3.3.1) · (4) Convergence-table empirical illustration applying validated methods to the 71-LLM × 645-item USMLE matrix (§3.3.2) |
| RQ3 | II | To what extent do item format, prompt design, occasion, and temperature constitute sources of construct-irrelevant variance, and how should these facet effects be quantified and controlled? | Multi-facet G-study / D-study; MTMM; DMF analysis; 3PL IRT for T/F; staged G-study design (Format · Prompt · Occasion · Temperature) | (1) Core G-study: 5 format levels (MCQ · T/F · fill-in-blank · open-ended · dialogue) · (2) H5: T/F guessing inflation + construct-narrowing · (3) Extended G-study: prompt \(\sigma^2\)(M×P), occasion \(\sigma^2\)(M×O), temperature \(\sigma^2_e\) · (4) D-study recommendations (\(n_P\), \(n_{\text{rep}}\), occasion count) per intended use · (5) Format selection guidelines · (6) Scoring meta-problem framework |
| RQ4 | III | How should benchmark item pools be specified for representativeness, and how should LLM families be defined to support stratified model selection? | Test blueprint methodology; content representativeness analysis; architecture-based taxonomy; IRT-based performance clustering; Adjusted Rand Index | (1) Domain specification framework · (2) Content coverage analysis of 1–2 major benchmarks · (3) LLM Family Taxonomy — architecture dimension + performance-cluster dimension · (4) ARI-validated stratified sampling frame |
| RQ5 | III | Under what conditions are cross-benchmark score comparisons psychometrically defensible, and what infrastructure is required? | IRT-based linking (common-item, common-person); equating designs; concordance analysis; Ramsay/Davidian Curve IRT for non-normal distributions | (1) Feasibility taxonomy (equating / linking / concordance / prediction) · (2) Four-stage roadmap toward score comparability · (3) Unique LLM challenges: contamination, versioning, construct evolution |
The simulation (§3.3.1) validates which of the 13 candidate dimensionality methods are reliable under LLM-specific conditions before any method is applied to empirical data. Methods that pass the selection threshold feed directly into the §3.3.2 empirical illustration.
| Simulation Factor | Levels |
|---|---|
| Factor structure (data-generating) | 1D · 2D (correlated, \(\phi=.40\)) · Bifactor · Correlated-factors · Higher-order |
| N (number of LLMs) | 30 · 71 · 150 · 300 |
| Item loading magnitude | Weak (\(a\) = 0.40–0.60) · Moderate (0.60–0.80) · Strong (0.80–1.20) |
| Latent distribution | Normal \(N(0,1)\) · Bimodal · Negatively skewed |
| Replications per cell | 500 |
Methods evaluated (13 total):
| Family | Methods |
|---|---|
| Traditional psychometric (7) | PA · PA-poly · MAP · Hull · DIMTEST · DETECT · Mokken-MSA |
| Machine learning–based (5) | NMF · VAE-dim · SC-tetra · LASSO-EFA · NIRT |
| Distribution-adaptive IRT (1) | RC-IRT (Ramsay / Davidian curve) |
Selection threshold → Correct-detection rate > .80 and Type I error < .10 in the small-N (\(N \leq 71\)) and bimodal-distribution cells. Methods passing this threshold are the only ones applied in §3.3.2.
| Integrating theme | How RQs connect |
|---|---|
| Intended use drives all evidence requirements | RQ1’s taxonomy cascades into validity evidence standards (RQ2), sampling adequacy criteria (RQ4), acceptable error thresholds (RQ3), and required comparability level (RQ5) |
| Structure precedes reliability | Convergent dimensionality evidence from RQ2 (§3.3.1–§3.3.2) establishes the score structure whose dependability RQ3 then quantifies via G-Theory; reliability estimates are uninterpretable until structural conclusions are drawn |
| Contamination as a cross-pillar threat | Training-data overlap threatens content validity (RQ2), item representativeness (RQ4), format resistance (RQ3), and score comparability over time (RQ5) — AIG is the structural remedy across all four |
| Small-N constraint | N = 30–300 LLMs constrains all methods; simulation study (RQ2 §3.3.1) determines which tools remain usable at \(N \leq 71\); G-study design (RQ3) optimizes precision under small N; linking designs (RQ5) must account for N limitations |
| Non-normal latent distribution | Bimodal/skewed \(\theta\) triggers RC-IRT selection in the simulation (RQ2 §3.3.1), shapes convergent method recommendations carried into §3.3.2, and complicates IRT-based linking (RQ5) — a single distributional property with consequences across three RQs |
Assessment & Evaluation Benchmarking & Ranking & Scoring
Audience bridge — Literature Review: This chapter reviews the scholarly foundations that this dissertation builds on. For AI scientists: sections 1–2 introduce the psychometric critique of current LLM evaluation and the three-branch taxonomy of AI Psychometrics; sections 4–5 cover the empirical evidence for prompt sensitivity and benchmark contamination. For psychometricians: sections 3 and 6a introduce what is empirically known about LLM psychological profiles and the non-normal ability distributions that require non-standard IRT estimation. Section 8 covers automated item generation — where AI tools assist psychometric test development.
A foundational terminological distinction organizes this review. In machine learning, the dominant term is evaluation — the empirical measurement of model performance on benchmark datasets, used to track progress and compare systems. In psychometrics, the corresponding term is assessment — a broader process encompassing construct definition, instrument design, validity argumentation, score interpretation, and consequential use. The difference is not merely semantic: evaluation asks how well did the model perform?, while assessment asks what does the performance mean, for whom, and under what conditions? This dissertation adopts the psychometric framing throughout, treating LLM benchmark results as assessment scores whose meaning requires the same systematic validity infrastructure that educational and psychological measurement has developed over the past century.
The rapid proliferation of large language models has generated an equally rapid proliferation of evaluation benchmarks — MMLU, GSM8K, HumanEval, BIG-Bench, and hundreds of successors — each producing scalar accuracy scores treated as authoritative indexes of model capability. As Wang et al. (2023) observe, these task-oriented paradigms share a fundamental limitation: they assess performance on predefined, narrow tasks without grounding scores in theoretically defined latent constructs. Models are ranked on dimensions never precisely specified, compared using metrics whose validity is never demonstrated, and evaluated using benchmarks that conflate reasoning ability with training-data memorization. Casabianca (2025) frames this as a three-part mismatch: a theoretical construct mismatch (vague claims about what capabilities are measured), a methodological mismatch (metrics chosen by convention rather than construct alignment), and a reporting and bias misalignment (aggregate scores masking systematic biases). These deficiencies call for the formal measurement frameworks that psychometrics provides.
Psychometricians have been among the most active contributors to the principled study of AI assessment. Mislevy, Steinberg, and Almond’s (2003) Evidence-Centered Design (ECD) framework — which organizes assessment around a competency model, a task model, and an evidence model — has been extended by Casabianca (2025) as the structural template for psychometrically-informed LLM evaluation. Embretson’s (1983) distinction between construct representation (what cognitive processes an item engages) and nomothetic span (how scores relate to external constructs) maps directly onto the LLM validity challenge: a benchmark item may elicit correct responses through several distinct pathways (genuine reasoning, memorization, prompt-pattern matching), and only construct representation analysis can establish which process dominates. Lalor and colleagues (Lalor et al., 2019; Lalor & Yu, 2020) pioneered the application of Item Response Theory to NLP benchmark items, demonstrating that IRT-calibrated item difficulty curves for NLP tasks exhibit the same psychometric structure as educational test items — establishing a formal measurement link between the two traditions. Hernández-Orallo’s (2017) comprehensive framework for measuring natural and artificial intelligence further systematized the theoretical foundations for applying cognitive ability measurement models to AI systems. Collectively, these contributions establish that the psychometric toolkit — validity theory, IRT, Generalizability Theory, equating — is not merely analogically applicable to LLM assessment but is the appropriate scientific framework for it.
The recognition that psychometrics offers the theoretical toolkit for LLM assessment has given rise to the emerging interdisciplinary field of AI Psychometrics, comprehensively reviewed by Ye et al. (2025). That review organizes the field around the foundational distinction this dissertation adopts: construct-oriented assessment, which asks whether a benchmark measures a well-defined latent construct with appropriate validity evidence, versus task-oriented evaluation, which asks only whether a model succeeds on a given task, without concern for construct interpretation. Contemporary LLM benchmarking is almost exclusively task-oriented; the psychometric literature reviewed in this chapter establishes why construct-oriented assessment is necessary, what it requires, and how it transforms the interpretation of model scores.
Li et al. (2025a) draw a conceptually important three-way distinction among related but non-identical research programs that together constitute the AI Psychometrics landscape. AI Psychometrics applies established psychometric tests and inventories directly to AI systems to characterize their behavioral and psychological profiles. Psychometric AI uses psychometric principles — item calibration, dimensionality assessment, latent trait estimation — to improve the design and evaluation of AI benchmarks. Computational Psychometrics employs AI and machine learning methods to advance psychometric methodology itself, including automated item generation and NLP-based scoring.
The present dissertation occupies primarily the Psychometric AI branch: its goal is to apply and adapt psychometric measurement theory to ensure that LLM benchmark scores are valid, reliable, and comparable. However, findings from the AI Psychometrics branch — particularly regarding what psychological constructs LLMs appear to possess and how those constructs deviate from human analogs — carry important implications for the construct definitions that underlie any Psychometric AI framework. The two branches are therefore treated as complementary rather than independent throughout this review.
Wang et al. (2023) provide the foundational construct-oriented framework for the Psychometric AI branch. Their three-stage model — construct identification, construct measurement, and test validation — maps directly onto the validity argument architecture that this dissertation adopts. Construct identification involves specifying the latent ability to be measured, either through top-down expert consultation (Delphi method) or bottom-up empirical analysis (exploratory factor analysis across tasks). Construct measurement operationalizes the specified construct through item development, Item Response Theory (IRT) calibration, and computerized adaptive testing. Test validation accumulates the five types of validity evidence specified in the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014): content, response process, internal structure, relations to other variables, and consequences of use. Applying this three-stage framework to a corpus of 27 cognitive tasks across 29 LLMs, Wang et al. (2023) identify a three-factor construct structure — reasoning, comprehension, and core language modeling — that accounts for systematic covariance in model performance. This finding is theoretically important: it demonstrates that LLM benchmark performance is not monolithic but is organized by underlying latent constructs, suggesting that the unidimensional accuracy scores currently reported conflate distinguishable components of model capability.
Casabianca (2025) extends this foundational framework by proposing Evidence-Centered Design (ECD) as the organizing architecture for psychometrically-informed LLM evaluation. ECD integrates three interlocking models: a competency model specifying what the evaluation claims to measure about the AI system, a task model defining the conditions under which evidence is elicited, and an evidence model mapping observed outputs to inferences about competencies. This architecture ensures that the chain from construct definition to score interpretation is explicit and defensible at each link — precisely what current benchmarking frameworks omit. The ECD framework provides the theoretical scaffolding for the validity argument template developed in RQ2 of this dissertation.
A fourth research track — distinct from Li et al.’s (2025a) three-way taxonomy — has emerged from psychometricians who treat AI not as an object of assessment but as an instrument for advancing educational assessment itself. Where Psychometric AI applies psychometric principles to evaluate AI systems, this practitioner-led track inverts the relationship: it deploys AI and large language models as methodological tools to make assessment more effective, efficient, and equitable. Concrete applications include automated item generation (AIG), in which natural language generation models produce calibrated test items aligned to explicit domain specifications, dramatically reducing the cost and time of large-scale test development; NLP-based automated scoring, in which transformer models score constructed-response and essay items with validity evidence comparable to human raters; computerized adaptive testing (CAT) with AI-enhanced item selection and exposure control; and dynamic instrument construction, in which item pools are tailored in real time to individual respondents’ proficiency estimates. This track is represented in the dissertation by RQ4, which examines whether AIG can produce items with psychometric properties statistically equivalent to expert-authored items — a directly applied question whose answer determines whether AI tools can be incorporated into high-stakes test development pipelines without sacrificing the construct fidelity that educational measurement requires.
The AI Psychometrics branch has generated a body of evidence on the psychological characteristics of LLMs that carries important implications for the construct definitions underlying any benchmarking framework. Pellert et al. (2024) conducted one of the most comprehensive early studies in this tradition, administering four established psychometric inventories — the Big Five Inventory (BFI-44), the Portrait Values Questionnaire-Revised (PVQ-RR), the Moral Foundations Questionnaire, and the Gender/Sex Diversity Beliefs Scale — to six LLM architectures using zero-shot classification. Their most striking finding is that LLMs converge on a strikingly homogeneous personality profile regardless of architecture: all models score uniformly high on Openness and Extraversion and low on Neuroticism. This convergence suggests that the psychological characteristics of LLMs are determined less by architectural differences than by the shared distributional properties of their training corpora — a finding with direct implications for construct validity, since it implies that trait-like variation across models may primarily reflect training data composition rather than genuine capability differences.
Zhang et al. (2025) substantially extend this line of inquiry by administering 43 standardized psychological scales across 84 dimensions to three generations of ChatGPT (GPT-3.5, GPT-4o, GPT-4o mini). Using representational similarity analysis and Welch’s t-tests against human norms, they document both systematic divergence from human psychological profiles and progressive convergence across model generations: newer versions show greater alignment with human norms across cognitive and personality dimensions. This finding — that architectural and training improvements bring models closer to human psychological profiles — suggests that psychometric profiling can serve as a developmental diagnostic, tracking not only what models can do but how their latent behavioral organization evolves across iterations.
Li and Qi (2025) provide a complementary three-study investigation designed to determine whether LLMs can accurately simulate human psychological responses across personality dimensions and cultural contexts — a question with direct implications for the construct validity of LLM-generated synthetic data in psychometric applications. Study 1 establishes a fundamental methodological finding: temperature settings have surprisingly minimal impact on LLM personality self-reports, whereas prompt template variations produce substantial score differences across models, confirming that format-level construct-irrelevant variance dominates over model-level stochastic variance as the primary source of score instability in psychometric administrations. Study 2 benchmarks LLM personality self-reports against a large normative human database (N = 18,192–49,159 participants), revealing systematic divergence: LLMs uniformly score higher on positive trait dimensions (particularly Extraversion) and lower on negative trait dimensions (particularly Psychopathy and Dark Triad traits) — a profile consistent with training corpus biases toward prosocial language rather than genuine representational diversity across the full human personality spectrum. Study 3 directly tests cultural simulation validity, instructing LLMs to adopt either Chinese or American cultural identities before responding to personality assessments; although statistically significant group differences emerge, both groups exhibit East Asian self-construal patterns, demonstrating that LLMs impose training-data-derived cultural priors that cannot be overridden by prompted identity adoption. Together, these findings reinforce a construct validity conclusion central to this dissertation: LLM response distributions are not independent draws from the same normal distribution that characterizes human examinees but are systematically constrained and biased by training corpus composition — a population-level asymmetry that propagates directly into the non-normal latent trait distributions documented empirically (Liu et al., 2025) and motivates the non-normality-robust estimation strategies developed in §3.3.2 and the non-normal simulation conditions in §3.3.1.
Ye et al. (2025b) introduce Generative Psychometrics for Values (GPV), a novel measurement paradigm that departs from static inventories by using LLMs to dynamically generate context-specific measurement items tailored to individual respondents’ backgrounds and expressed values. Fine-tuning a Llama-3-8B model on human value data, GPV achieves superior validity and stability compared to standard instruments such as the Schwartz Values Survey. For the purposes of this dissertation, GPV is significant as an example of the Computational Psychometrics branch: it demonstrates that LLMs can serve not only as assessment targets but as active components of psychometric measurement instruments, with implications for the automated generation of benchmark items that satisfy explicit domain specifications (addressed in RQ4).
Perhaps the most consequential body of evidence for the present dissertation concerns the systematic threats to construct validity that arise from the distinctive properties of LLMs as measurement targets. Two threats receive the most rigorous empirical treatment in the reviewed literature: prompt sensitivity (the dependence of scores on superficial features of item presentation) and the ability–knowledge confound (the conflation of general reasoning with domain-specific memorized knowledge).
Cohen-Inger et al. (2025) address the first threat directly through the Chameleon Benchmark Overfit Detector (C-BOD), a framework that applies controlled semantic-preserving perturbations to benchmark prompts at three distortion intensities (miu = 0.5, 1.0, 1.5) and measures the resulting performance degradation. Across 32 LLMs evaluated on MMLU and GPQA, 81% exhibit statistically significant performance drops when semantically equivalent rephrasing is applied — performance decrements averaging 2.15% at miu = 0.5 and increasing to 2.75% at miu = 1.0. These findings operationalize what Messick (1989) termed construct-irrelevant variance: score variation attributable to surface features of the measurement instrument rather than the target construct. More theoretically important is a counterintuitive finding: the performance drop scales log-linearly with model size (delta0.5 = 0.609·ln(Parameters) + 1.303), meaning that frontier models are more sensitive to prompt perturbation than smaller models, not less. Cohen-Inger et al. interpret this as evidence that larger models have more deeply overfit to the canonical surface-level patterns of benchmark prompts during training. From the perspective of Generalizability Theory (the analytic framework of RQ3), this finding implies that the prompt-wording facet explains systematically more variance for high-ability LLMs than for low-ability models — a heterogeneous variance structure that motivates model-stratified D-study designs.
Lin (2025) provides the most rigorous conceptual analysis of why LLM benchmark scores cannot be straightforwardly interpreted as measures of human-analogous psychological constructs. Identifying six fallacies in the substitution of LLMs for human research participants, Lin establishes that output similarity (what a model produces) does not imply mechanistic equivalence (how the model produces it) or phenomenological equivalence (whether the model experiences anything in doing so). For psychometric purposes, the most consequential of these distinctions is between functional equivalence and mechanistic equivalence: an LLM that consistently selects the correct answer on a medical licensing examination may do so through pattern-matching over training data rather than through the clinical reasoning the examination is designed to assess. This is the construct validity threat the present dissertation addresses through IRT-based ability–knowledge decomposition: the 2PL item discrimination parameter and the bifactor model’s G-score both operationalize the distinction between responses driven by latent reasoning ability and responses driven by domain-specific memorization.
Li et al. (2025b) offer a straightforward demonstration that standard psychometric validity criteria — specifically, criterion-related validity and predictive validity — apply directly to LLM-generated responses. Their study administered survey scales measuring Purchase Intention to four LLMs (GPT-3.5, GPT-4, LLaMA-2, LLaMA-3) and evaluated how well each model’s responses predicted the criterion construct. Predictive validity, indexed by \(R^2\), increased substantially across model generations: GPT-3.5 accounted for 18.4% of criterion variance, while GPT-4 accounted for 44.3%. In psychometric terms, this means that newer models produce responses with stronger criterion-related validity — their scores behave more like valid measurement of the target construct and less like noise. To obtain stable score estimates from models that produce variable output, the authors repeatedly sampled each model at multiple temperature settings and averaged the resulting response distributions; this procedure is conceptually equivalent to the test–retest and parallel-forms reliability designs that psychometricians use to separate true-score variance from occasion-specific error. This aggregation strategy maps directly onto the stochasticity facet in the G-study designs of RQ3, where variance attributable to random sampling in LLM output must be isolated and quantified separately from variance attributable to item content or model capability.
The psychometric requirement that benchmark item pools be representative samples from a well-specified domain (RQ4) confronts a practical constraint: comprehensive evaluation across large item pools is computationally expensive and logistically demanding. Two papers directly address the efficiency-representativeness trade-off from complementary perspectives.
Li and Xiong (2025) frame benchmark efficiency as a combinatorial optimization problem, proposing an approach that selects representative item subsets from existing benchmarks using simulated annealing algorithms with semantic preservation verified through Wasserstein distance. Applied to MMLU, HellaSwag, and GSM8K, the method achieves an average compression rate of approximately 19% of original dataset size while outperforming clustering-based alternatives on reliability (measured by L2 norm against full-dataset scores). The Wasserstein distance criterion is particularly noteworthy from a psychometric standpoint: it operationalizes the requirement that a representative subsample maintain the distribution of the original item pool, not merely its mean difficulty — a condition that parallels the stratified sampling principles of classical test blueprint methodology.
Yuan et al. (2025) approach efficiency from a different angle, arguing that the failure of static benchmarks to accurately estimate individual model performance stems from the fundamental heterogeneity of model response patterns — what they term prediction inconsistency. Their TailoredBench framework constructs model-specific item coresets dynamically, selecting the subset that maximizes prediction consistency between a target model and its nearest neighbors in performance space. Across five benchmarks (ARC Challenge, HellaSwag, GSM8K, Winogrande, POPE) and 300+ models, TailoredBench achieves an average 31.4% reduction in Mean Absolute Error relative to static sampling baselines while requiring only 20–40 items for near-optimal performance estimation. The prediction consistency criterion that drives their adaptive selection directly instantiates the psychometric principle of parallel test forms: a valid abbreviated benchmark must rank models in the same order as the full benchmark would — a condition equivalent to the Spearman-Brown requirements for score comparability across forms of different lengths (addressed in RQ5).
A persistent limitation of current LLM benchmarking is the opacity of aggregate scores: a model’s overall MMLU accuracy provides little actionable information about which specific capabilities it possesses or lacks. Tian et al. (2025) address this through SkillVerse, a hierarchical diagnosis framework that organizes model proficiency into skill dendrograms derived from atomic judgments extracted from LLM-generated critiques. Using agglomerative hierarchical clustering on semantic embeddings of critique statements, SkillVerse creates nested skill clusters that reveal capability profiles from coarse to fine-grained levels — for example, distinguishing that a model achieves 78% success on “write functional programs” tasks but only 55% on “debugging” tasks within the same broad coding domain.
The hierarchical skill structure SkillVerse produces is directly analogous to the bifactor model proposed in this dissertation’s construct validity analysis (RQ2, Section 3.3.2): both frameworks posit a general capability dimension (the dendogram’s root) and domain-specific subdimensions (the leaves), and both use within-domain performance patterns to infer latent structure. SkillVerse achieves 25% relative improvement in in-context learning by selecting demonstrations that address identified skill gaps, and successfully predicts model weaknesses on unseen scenarios at a 55% rate — demonstrating that the hierarchical construct structure yields actionable downstream applications rather than purely descriptive insights. The finding that hierarchical skill organization reveals inverse scaling on specific tasks (larger models underperforming smaller ones on “writing shell commands”) parallels the inversion-count phenomenon in the present dissertation’s IRT analysis: both identify cases where the unidimensional ability assumption fails and domain-specific knowledge patterns deviate from the expected difficulty ordering.
A foundational but frequently underscrutinized assumption of item
response theory models is that the latent trait \(\theta\) is normally distributed in the
calibration population. When item parameters are estimated via marginal
maximum likelihood (MML) with Gaussian quadrature — the default in
standard software including mirt, flexMIRT,
and BILOG-MG — the likelihood is numerically integrated
over a fixed normal quadrature grid, effectively treating any non-normal
mass in the true latent distribution as zero. When this normality
assumption is violated, the consequences for parameter recovery are
systematic and well-documented. As Reise et al. (2018) summarize,
drawing on the seminal simulation work of Woods and Thissen (2006),
normality violations produce nontrivially biased estimates of item
difficulty parameters at the extremes of the distribution, and bias
increases monotonically with the degree of skewness — items at the tails
of the ability range suffering the most severe distortion. Bahry (2012),
in a comprehensive simulation study of the Graded Response Model across
seven sample sizes (n = 100 to n = 3,000) and three latent distribution
shapes, confirms this: extreme skewness produces the poorest parameter
recovery regardless of sample size, and even under normality a
minimum of n = 750 examinees is required for adequate estimation — a
threshold the LLM dataset (n = 71 models) falls well below under
standard MML assumptions, making non-normal-robust estimation not merely
preferable but necessary.
This normality assumption is structurally violated in LLM benchmark assessment on at least two dimensions simultaneously. First, as documented by Liu et al. (2025) and corroborated by Pellert et al. (2024) and Li and Qi (2025), the LLM population is not a sample from a continuous unimodal normal ability distribution: it is a mixture of distinct capability tiers — frontier commercial models clustered at high \(\theta\) values, mid-range models near the mean, and smaller open-source models below the mean — producing a bimodal or negatively skewed latent distribution. Second, the “sample size” of the LLM examinee population is far smaller than typical psychometric calibration studies, compounding the bias that arises from distributional misspecification (Bahry, 2012). Under normality-assuming MML, the high-\(\theta\) frontier model cluster would be compressed toward the center of the assumed normal distribution while the low-\(\theta\) open-source cluster would be inflated upward, systematically biasing the discrimination parameters of items that primarily differentiate frontier from mid-range models — precisely the items of greatest scientific interest for benchmark validity analysis.
The psychometric literature has produced several estimator families
specifically designed for non-normal latent traits. Ramsay Curve
IRT (RC-IRT; Woods & Thissen, 2006) uses polynomial splines to
estimate the latent distribution shape simultaneously with item
parameters — a nonparametric approach requiring no distributional
assumptions beyond smoothness. Preston and Reise (2014) evaluated RC-IRT
for the Nominal Response Model under normal, skewed, and bimodal latent
distributions, finding substantial reductions in item parameter bias
under non-normal conditions and recommending that “RC-IRT estimation be
implemented whenever a researcher considers the construct being measured
has the potential of being nonnormally distributed” — a recommendation
the present dissertation adopts for its parametric IRT calibration
stages. Reise et al. (2018) survey four complementary alternative IRT
modeling strategies for non-normal conditions — zero-inflated mixture
models (appropriate when the latent variable applies only to a subset of
the population, producing excess zeros), log-logistic models (for
unipolar constructs meaningful at only one end of the continuum), Ramsay
curve estimation (the general nonparametric solution), and
heteroskedastic-skew models (which additionally allow residual variance
to increase as a function of trait level) — illustrating each on PROMIS
Anger scale data and demonstrating that model choice must be
theory-driven: the selected distributional form should reflect
substantive knowledge about the construct’s population structure rather
than computational convenience. For LLM ability assessment, the bimodal
clustering of models by capability tier most closely maps to the
mixture-model framework (treating frontier and non-frontier models as
two latent classes), but the Ramsay curve approach is practically
preferable because it requires no a priori specification of the
number of mixture components. On the Bayesian side, Zhang et al. (2021)
introduce a Markov chain Monte Carlo algorithm that represents the
latent distribution as a Davidian curve — a semi-nonparametric
flexible family capable of approximating arbitrary unimodal or bimodal
shapes — and demonstrate via simulation across varying sample sizes and
response category counts that, under informative priors, this approach
achieves lower bias and RMSE than EM-based estimation for both skewed
and bimodal true distributions. The Davidian curve estimator is directly
available in the mirt package via the
dentype = "Davidian-#" option, where # specifies the
polynomial order selected by information criteria such as the
Hanna-Quinn criterion (Zhang et al., 2021).
For the present dissertation, this literature establishes two
methodological imperatives that are implemented in §3.3.2 and §3.3.1
respectively. First, the bifactor IRT estimation in RQ2 must use an
empirical histogram (empiricalhist = TRUE in
mirt) or Davidian curve approximation rather than
normality-assuming MML, to avoid biased G-factor and S\(_k\)-factor loadings that would propagate
into inaccurate \(\omega_h\), ECV, and
PUC statistics. Second, the simulation study comparing psychometric and
ML dimensionality methods must include explicitly bimodal and negatively
skewed latent distribution conditions — the conditions the empirical LLM
data most closely resembles — to evaluate which methods remain valid and
which fail systematically under the distributional violations documented
in this section.
The practical stakes of psychometrically-rigorous LLM evaluation are most apparent in high-stakes educational assessment contexts, where benchmark scores directly influence consequential decisions about student competence. Swiecki et al. (2022) provide the most comprehensive critical analysis of how AI-driven assessment methods map onto established psychometric validity requirements, identifying four systemic limitations of the Standard Assessment Paradigm (SAP) inherited from large-scale standardized testing: assessments are onerous (costly to design), discrete (providing snapshots rather than longitudinal views), uniform (ignoring individual background variation), and inauthentic (disconnected from real-world task contexts). AI-driven assessment partially addresses these limitations through automated item generation, adaptive scoring, and process-embedded assessment, but Swiecki et al. argue that these advantages are accompanied by documented risks of algorithmic bias — disproportionate impacts on minority students and non-standard language users — that require explicit psychometric mitigation.
Mendonça et al. (2025) provide a focused empirical test of automated LLM scoring validity in a formative assessment context, comparing GPT-4o and LLaMA 3.2 against human evaluators for grading programming course assignments. Using equivalence testing rather than simple correlation, they find that GPT-4o achieves statistically and practically equivalent grading patterns to human raters — a stronger validity claim than mere correlation because equivalence testing establishes that the measurement difference is below a pre-specified threshold of practical significance. The differential consistency across question types (highest for code-based items, lower for open-ended conceptual questions) reflects a pattern familiar in psychometrics: automated scoring validity is construct-dependent, with well-structured response spaces supporting higher inter-rater reliability than ill-structured ones.
A theme that cuts across the construct validity, sampling, and applied contexts reviewed above is automated item generation (AIG) — the use of computational methods, and increasingly large language models themselves, to produce assessment items that satisfy explicit psychometric and domain-coverage criteria. AIG is directly relevant to this dissertation’s RQ4, which asks how adequately current benchmark item pools operationalize domain specifications and test blueprints: where AIG succeeds, it offers a principled mechanism for constructing representative item pools at scale; where it fails, its failure modes illuminate the same construct validity threats identified in Sections 4–5.
Zou et al. (2022) address the specific challenge of generating true/false items — a format widely used in low-stakes knowledge checks but poorly represented in the AIG literature, which has focused predominantly on multiple-choice generation. Their dual-framework approach combines a template-based pipeline (extracting subject–predicate–object triples from source texts via dependency parsing and applying truth-value transformations) with a model-based pipeline (fine-tuning a BART sequence-to-sequence model to generate false statements as controlled semantic distortions). Evaluating 1,000 generated items on a Wikipedia corpus, they achieve human acceptance rates exceeding 80% and inter-rater agreement kendall’s tau > 0.80, establishing that LLM-assisted generation can approximate expert-authored item quality for factual true/false content. For benchmark construction purposes, the template-based component directly instantiates the test-blueprint operationalization that RQ4 requires: by grounding item generation in dependency-parsed semantic triples from domain texts, the approach ensures that generated items sample systematically from the domain’s knowledge graph rather than idiosyncratically from a model’s internal associations.
Zu et al. (2023) shift the focus from item stems to distractors, the incorrect response options that determine the psychometric difficulty and discrimination of multiple-choice items. Distractor quality is a critical determinant of item discrimination (and hence of information at target theta-values in IRT), yet constructing plausible distractors is widely recognized as the most labor-intensive component of expert item writing. Their approach fine-tunes GPT-2 on a prompt-based learning framework that conditions distractor generation on the item stem, the keyed correct answer, and a domain context window. Evaluation against expert-authored distractors using the Subject and Fold Index (SFI) — a composite metric balancing distractor subject-relevance against answer-fold discriminability — yields a median SFI of 52.20 for model-generated distractors compared to 51.04 for expert-authored ones, a non-significant difference that supports the practical equivalence of model-generated and human-authored distractors under controlled conditions. The SFI metric’s dual emphasis on semantic relevance and discriminability operationalizes the psychometric requirement that distractors target specific misconceptions rather than function as arbitrary foils — directly connecting AIG quality criteria to the IRT-based item parameter estimation in RQ4’s calibration analyses.
Chan et al. (2024) provide the most systematic evaluation of LLM-based AIG in STEM education contexts, comparing GPT-3.5 and GPT-4 across three prompting strategies, standard prompting (SP), Chain-of-Thought (CoT), and Chain-of-Thought with Contextual Learning (CoT-CL), for generating physics and chemistry assessment items. Their factorial design crosses model \(\times\) prompting strategy \(\times\) subject domain, yielding a comprehensive picture of the conditions under which LLM-based AIG succeeds and fails. Key findings include: GPT-4 with CoT-CL achieves the highest overall item quality scores, but domain specificity interacts significantly with prompting strategy — chemistry items score 1.91/2 on expert quality ratings while physics items score only 1.19/2 (p < .001), a gap that Chan et al. attribute to the greater prevalence and structure of chemistry problems in pre-training data. This domain \(\times\) model interaction directly illustrates the construct-irrelevant variance mechanisms identified in Section 4: AIG quality is not a property of the prompting strategy alone but depends critically on the alignment between the target domain’s representation in the model’s training data and the domain specification the test blueprint requires. For RQ4, this finding implies that AIG-assisted item pool construction requires domain-specific calibration — the same CoT-CL strategy that produces publication-quality chemistry items may produce psychometrically inadequate physics items — and that post-hoc IRT calibration of AIG-produced items is a necessary quality-control step rather than an optional supplement.
The papers reviewed above treat LLMs as the output of AIG — objects whose quality is evaluated by psychometricians. An equally important and growing line of work positions LLMs as active instruments within the psychometric measurement pipeline: tools that psychometricians deploy to perform tasks previously requiring large human samples or teams of subject-matter experts. Three recent studies exemplify this role.
Liu, Bhandari, and Pardos (2025) take the most direct step toward replacing a core psychometric resource with LLMs: they use six LLMs (GPT-4, GPT-3.5, Llama 2, Llama 3, Gemini-Pro, and Cohere Command R Plus) as synthetic respondents for IRT-based item calibration in College Algebra, comparing LLM-generated item parameters against those derived from 50 human undergraduates. Using the Rasch model, they find that item difficulty parameters estimated from LLM respondents correlate substantially with human-calibrated parameters — Spearman \(\rho\) = 0.87 for GPT-3.5 and \(\rho\) = 0.78 for GPT-4 — and that a resampling-based ensemble of LLMs improves this correlation to \(\rho\) = 0.93, approaching the reliability ceiling of the human benchmark. A critical finding with direct implications for the dissertation’s simulation study (RQ2, §3.3.2) is the distributional asymmetry: all LLM proficiency distributions are dramatically narrower than the human distribution (LLM SD range 0.29–0.58 versus human SD = 0.98), and most LLMs cluster at higher-than-human mean proficiency. This confirms precisely the non-normal, cluster-skewed latent distribution for the LLM population that motivates the distribution-free dimensionality methods prioritized in the simulation design. The practical implication is both cost-saving and construct-validity-relevant: LLMs can serve as low-cost virtual respondent panels for pre-calibrating item pools before expensive human data collection, but the distributional mismatch must be modeled and corrected — not ignored — for item parameters to transfer validly to the human measurement scale.
Lee, Son, and Jia (2025) address a complementary challenge in the AIG pipeline: the bottleneck of post-hoc human quality review. Rather than relying on a retrospective expert panel to evaluate items after generation, they propose the LLM-based Multi-Agent AIG system (LM-AIG), in which LLMs themselves are organized into a structured team of specialized agents — an Item Writer Agent, a Critic Agent, and three reviewer agents covering content validity (correspondence and distinctiveness), linguistic quality, and potential demographic bias — coordinated through a Meta Editor Agent that synthesizes inter-agent feedback and integrates human expert input. The framework operationalizes established content validity standards (Hinkin & Tracey, 1999): the Content Reviewer Agent computes both the correspondence score (how well an item aligns with the target construct) and the distinctiveness score (how well it discriminates the construct from adjacent ones) using a 7-point rating scale. Implemented via the AutoGen framework and applied to psychological items measuring attitudes toward AI in the workplace, the system produces items evaluated as meeting psychometric quality standards by human raters on construct relevance, linguistic clarity, reading level appropriateness, contextual specificity, and absence of demographic bias. The significance for the broader AIG literature is architectural: by embedding quality control inside the generation pipeline rather than appending it post-hoc, LM-AIG reduces the accumulation of errors and the expert labor required, and — critically — makes the validity evaluation criteria explicit and computable at the item level rather than relying on holistic human judgment.
Li, Zhang, Tang, and Li (2026) extend LLM-based AIG to one of the most construct-sensitive and scenario-dependent item formats in psychometrics: personality situational judgment tests (SJTs). SJTs present examinees with realistic situational vignettes and response options designed to elicit behaviorally-grounded responses that are less susceptible to social desirability and deliberate falsification than traditional Likert-type self-reports. Generating psychometrically sound SJTs with LLMs is substantially harder than generating factual knowledge items — it requires the model to construct coherent situational scenarios, produce behavioral response options that discriminate the target personality facet from adjacent facets, and maintain scoring consistency — yet it is precisely in this high-difficulty case that LLM-based AIG most dramatically reduces expert burden. Li et al.’s three-study design systematically evaluates the generation process from prompt optimization (Study 1: temperature = 1.0 with structured prompt incorporating Chain-of-Thought reasoning, expert persona adoption, and explicit construct definitions achieves the highest Content Validity Index, CVI = 0.76) through cross-model generalizability and temporal reproducibility (Study 2: ChatGPT-5 replicates GPT-4 results across five independent generation rounds with no significant round-to-round or facet-specific differences in CVI) to full psychometric validation with 443 participants (Study 3). The Study 3 psychometric results demonstrate that LLM-generated SJTs covering five Big Five facets — self-consciousness, gregariousness, openness to ideas, compliance, and self-discipline — achieve satisfactory reliability and construct validity across most facets, with convergent validity against NEO-PI-R facet subscales meeting acceptable thresholds. The primary limitation — weaker convergent validity for the compliance facet and modest criterion-related validity — aligns with known challenges in personality SJT scoring that affect human-authored instruments as well, suggesting that the LLM-generation process does not introduce new psychometric deficiencies beyond those inherent in the item format itself. For the present dissertation, Li et al. (2026) provide the strongest available evidence that LLM-based AIG is not limited to factual knowledge domains but extends to the full construct complexity of non-cognitive personality measurement — precisely the kind of generalization that would be required if LLM-based AIG is to become a standard tool in the psychometrician’s test development workflow.
Gorgun and Bulut (2025) address the post-generation quality control bottleneck directly, demonstrating the feasibility of using instruction-tuned LLMs to evaluate automatically generated items at scale. Framing item evaluation as a supervised binary classification task — distinguishing good items (testing a key concept and reasonable to answer) from bad items (vague, unreasonable, or trivially answerable from context) — they fine-tuned Llama 3-8B on the Mind the Gap dataset (Becker et al., 2012), a collection of 2,252 automatically generated cloze items derived from Wikipedia articles and rated by Amazon Mechanical Turk crowdsource workers. Using parameter-efficient fine-tuning via QLoRA to manage computational constraints, their instruction-tuned model achieved 78% overall accuracy with an AUC-ROC of 0.77, an overall F1 of 0.77, and — critically from a practical standpoint — an F1 of 0.82 specifically for the bad item class, indicating that the model is more effective at flagging defective items than at certifying good ones. This asymmetric precision is operationally valuable: in a high-throughput AIG pipeline, a conservative filter that reliably removes most bad items — even at the cost of occasionally discarding borderline-acceptable ones — substantially reduces the expert review burden without requiring near-perfect classification. Gorgun and Bulut frame instruction-tuned LLMs as an intermediate quality gate positioned between generation and field testing, automating an evaluation step that currently scales linearly with item volume and therefore constitutes the primary bottleneck to AIG deployment at operational scale. For the present dissertation’s RQ4, which evaluates whether AIG-produced items achieve psychometric parity with expert-authored benchmark items, this work provides direct methodological precedent: if an LLM-based quality filter can screen item pools prior to calibration, the psychometric comparison between AIG and human-authored items is no longer confounded by the inclusion of item-generation failures in the AIG pool.
Oka, Tan, Ishioka, and Okada (2025) address a complementary but distinct challenge in LLM-based item construction: the systematic control of multiple-choice item difficulty through principled manipulation of distractor properties. Using GPT-4o to generate English grammar gap-fill items for third-year Japanese junior high school students, they constructed seven parallel item sets by holding item stems and correct answers constant while varying distractor complexity across three linguistic dimensions — word count, morphosyntactic complexity (tense and number inflection), and vocabulary complexity — each at two difficulty levels, producing a full 3 \(\times\) 2 factorial design plus an unmodified baseline. Following online administration to 693 crowdsourced Japanese-speaking participants, they applied 2PL-IRT with Stocking-Lord parameter linking across the seven item sets and used Bayesian MCMC estimation with posterior ANOVA to compare mean difficulty parameters across distractor conditions. The critical psychometric finding is counterintuitive but theoretically principled: increasing distractor linguistic complexity systematically decreases item difficulty (F(2,38) = 8.59, p < .001 for vocabulary complexity; word count and morphosyntactic effects were non-significant). The mechanism is construct-theoretic — greater semantic or lexical distance between distractors and the correct answer increases the perceptual salience of the correct option, allowing test-takers to identify it through exclusion rather than recall. This finding has direct implications for distractor-based difficulty engineering in AIG: prompt instructions that request semantically close, morphosyntactically matched, and lexically parallel distractors produce harder items than instructions requesting sophisticated or unusual distractors, which paradoxically simplify the measurement task. The IRT-based validation framework Oka et al. develop — treating empirical item parameters estimated from real examinee responses as the criterion against which LLM-generated item properties are evaluated — constitutes precisely the psychometric integration that the present dissertation advocates as standard practice for AIG quality assurance in RQ4.
The literature reviewed in this chapter collectively establishes four interconnected conclusions. First, the existing body of psychometric research on LLM evaluation provides robust theoretical foundations, particularly the construct-oriented evaluation framework (Wang et al., 2023), the validity argument approach (Casabianca, 2025), and the three-branch taxonomy of AI Psychometrics (Ye et al., 2025) — but these frameworks have been developed largely at a conceptual level and have not been integrated into a unified empirical psychometric pipeline applied to a large-scale medical benchmark dataset.
Second, the most consequential construct validity threats — prompt sensitivity as construct-irrelevant variance (Cohen-Inger et al., 2025), the ability–knowledge confound (Lin, 2025), and the non-normal latent trait distribution of the LLM population (Ye et al., 2025) — have been identified and partially characterized but have not been formally incorporated into a Generalizability Theory framework that simultaneously estimates their magnitude, sources, and remediability through D-study optimization.
Third, the bifactor model’s capacity to decompose general reasoning ability from domain-specific knowledge has been proposed conceptually (Wang et al., 2023; Ye et al., 2025) and demonstrated at small scale, but has not been applied to a USMLE-domain benchmark with the full complement of diagnostic statistics — w_h, ECV, PUC — that translate the factor structure into actionable construct validity decisions.
Fourth, the emerging literature on automated item generation (Zou et al., 2022; Zu et al., 2023; Chan et al., 2024) demonstrates that LLMs can produce test items of near-expert quality under controlled conditions, but has not been integrated with IRT-based domain specification analysis to identify which item types, distractor configurations, and subject domains are most amenable to AIG-assisted benchmark expansion — nor has the domain \(\times\) prompting interaction documented by Chan et al. been systematically modeled as a source of test blueprint misrepresentation.
The present dissertation addresses all four gaps through a unified psychometric pipeline: Kane’s validity argument structure (RQ1) \(\rightarrow\) bifactor ability–knowledge decomposition with w_h/ECV/PUC decision rules (RQ2) \(\rightarrow\) IRT-calibrated domain specification, test blueprint analysis, and AIG-informed item quality auditing (RQ4) \(\rightarrow\) Generalizability Theory G-study and D-study on bifactor G-scores with prompt sensitivity as an explicit stochasticity facet (RQ3) \(\rightarrow\) IRT-based score linking and concordance analysis for cross-benchmark comparability (RQ5).
All references formatted to APA 7th edition. DOIs verified via publisher websites and ACL Anthology. Notes on citation verification are appended where discrepancies were found during DOI search.
Bahry, L. M. (2012). Polytomous item response theory parameter recovery: An investigation of nonnormal distributions and small sample size [Master’s thesis, University of Alberta]. University of Alberta Libraries. https://doi.org/10.7939/R3PK0737G
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. Cambridge University Press. https://doi.org/10.1017/CBO9780511571312
Casabianca, J. M. (2025). Psychometrics is all you need [Preprint]. EdArXiv. https://doi.org/10.35542/osf.io/7w6pz
Chan, K. W., Ali, F., Park, J., Sham, K. S. B., Tan, E. Y. T., Chong, F. W. C., & Sze, G. K. (2024). Automatic item generation in various STEM subjects using large language model prompting. Computers and Education: Artificial Intelligence, 8, Article 100344. https://doi.org/10.1016/j.caeai.2024.100344
Cohen-Inger, N., Elisha, Y., Shapira, B., Rokach, L., & Cohen, S. (2025). Forget what you know about LLMs evaluations — LLMs are like a chameleon. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp. 21664–21677). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.1098
Embretson, S. E. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93(1), 179–197. https://doi.org/10.1037/0033-2909.93.1.179
Gorgun, G., & Bulut, O. (2025). Instruction-tuned large-language models for quality control in automatic item generation: A feasibility study. Educational Measurement: Issues and Practice, 44(1), 96–107. https://doi.org/10.1111/emip.12663
Hernández-Orallo, J. (2017). The measure of all minds: Evaluating natural and artificial intelligence. Cambridge University Press. https://doi.org/10.1017/9781316594179
Hernández-Orallo, J., Dowe, D. L., & Hernández-Lloreda, M. V. (2014). Universal psychometrics: Measuring cognitive abilities in the machine kingdom. Cognitive Systems Research, 27, 50–74. https://doi.org/10.1016/j.cogsys.2013.06.001
Lalor, J. P., Wu, H., & Yu, H. (2019). Learning latent parameters without human response patterns: Item response theory with artificial crowds. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP–IJCNLP) (pp. 4249–4259). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1434
Citation note: The dissertation body cites this as a 2019 RepEval workshop paper titled “Improving and generalizing natural language inference by using item response theory.” The verified 2019 Lalor, Wu, & Yu paper with an IRT focus is the EMNLP-IJCNLP paper above. Please confirm the intended citation with the original source.
Lalor, J. P., & Yu, H. (2020). Dynamic data selection for curriculum learning via ability estimation. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 545–555). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.findings-emnlp.48
Citation note: The dissertation body cites this as a 2020 BlackboxNLP paper titled “Understanding deep contextualized embeddings using attention visualizations.” The verified 2020 Lalor & Yu paper is the EMNLP Findings paper above. Please confirm the intended citation with the original source.
Lee, P., Son, M., & Jia, Z. (2025). AI-powered automatic item generation for psychological tests: A conceptual framework for an LLM-based multi-agent AIG system. Journal of Business and Psychology, 41, 71–99. https://doi.org/10.1007/s10869-025-10067-y
Li, C., & Qi, Y. (2025). Toward accurate psychological simulations: Investigating LLMs’ responses to personality and cultural variables. Computers in Human Behavior, 170, Article 108687. https://doi.org/10.1016/j.chb.2025.108687
Li, C.-J., Zhang, J., Tang, Y., & Li, J. (2026). Automatic item generation for personality situational judgment tests with large language models. Computers in Human Behavior Reports, 21, Article 100964. https://doi.org/10.1016/j.chbr.2026.100964
Li, G., & Xiong, D. (2025). Towards optimal evaluation efficiency for large language models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (pp. 14176–14183). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.emnlp-main.716
Li, Y., Lin, X., Sha, Z., Jin, Z., & Li, X. (2025a). AI psychometrics: A three-way taxonomy. In Proceedings of the 58th Hawaii International Conference on System Sciences (HICSS-58) (pp. 5194–5202). ScholarSpace, University of Hawai’i at Mānoa. https://hdl.handle.net/10125/109446
Citation note: HICSS papers are archived in ScholarSpace (hdl.handle.net). The handle above points to the HICSS-58 2025 proceedings collection; the specific paper handle should be confirmed directly via https://scholarspace.manoa.hawaii.edu.
Li, Y., Lin, X., Sha, Z., Jin, Z., & Li, X. (2025b). Measuring human and AI values based on generative psychometrics with large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (Vol. 39, pp. 26400–26408). AAAI Press. https://doi.org/10.1609/aaai.v39i24.34839
Citation note: The in-text citation for Li et al. (2025b) describes a study on Purchase Intention and predictive validity using GPT-3.5, GPT-4, LLaMA-2, and LLaMA-3, which does not match the GPV/values paper above. Please verify whether Li et al. (2025b) refers to a separate publication; if so, provide that citation for replacement.
Lin, Z. (2025). Six fallacies in substituting large language models for human participants. Advances in Methods and Practices in Psychological Science, 8(3), 1–19. https://doi.org/10.1177/25152459251357566
Liu, Y., Bhandari, S., & Pardos, Z. A. (2025). Leveraging LLM respondents for item evaluation: A psychometric analysis. British Journal of Educational Technology, 56, 1028–1052. https://doi.org/10.1111/bjet.13570
Mendonça, P. C., Quintal, F., & Mendonça, F. (2025). Evaluating LLMs for automated scoring in formative assessments. Applied Sciences, 15(5), Article 2787. https://doi.org/10.3390/app15052787
Mislevy, R. J., Steinberg, L. S., & Almond, R. G. (2003). On the structure of educational assessments. Measurement: Interdisciplinary Research and Perspectives, 1(1), 3–62. https://doi.org/10.1207/S15366359MEA0101_02
Oka, H., Tan, Y., Ishioka, T., & Okada, K. (2025). Systematic control of multiple-choice item difficulty through LLM-based distractor generation. In A. I. Cristea, E. Walker, Y. Lu, O. C. Santos, & S. Isotani (Eds.), Artificial Intelligence in Education: Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium, Blue Sky, and WideAIED (Communications in Computer and Information Science, Vol. 2590, pp. 147–157). Springer Nature. https://doi.org/10.1007/978-3-031-99261-2_14
Pellert, M., Lechner, C. M., Wagner, C., Rammstedt, B., & Strohmaier, M. (2024). AI psychometrics: Assessing the psychological profiles of large language models through psychometric inventories. Perspectives on Psychological Science, 19(5), 808–826. https://doi.org/10.1177/17456916231214460
Preston, K. S. J., & Reise, S. P. (2014). Estimating the nominal response model under nonnormal conditions. Educational and Psychological Measurement, 74(3), 377–399. https://doi.org/10.1177/0013164413507063
Reise, S. P., Rodriguez, A., Spritzer, K. L., & Hays, R. D. (2018). Alternative approaches to addressing non-normal distributions in the application of IRT models to personality measures. Journal of Personality Assessment, 100(4), 363–374. https://doi.org/10.1080/00223891.2017.1381969
Spearman, C. (1904). “General intelligence,” objectively determined and measured. The American Journal of Psychology, 15(2), 201–293. https://doi.org/10.2307/1412107
Swiecki, Z., Khosravi, H., Chen, G., Martinez-Maldonado, R., Lodge, J. M., Milligan, S., Selwyn, N., & Gašević, D. (2022). Assessment in the age of artificial intelligence. Computers and Education: Artificial Intelligence, 3, Article 100075. https://doi.org/10.1016/j.caeai.2022.100075
Tian, Y., Sun, J., Peng, N., & Zhang, Z. (2025). SkillVerse: Assessing and enhancing LLMs with tree evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 8917–8933). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.437
Wang, X., Jiang, L., Hernandez-Orallo, J., Stillwell, D., Sun, L., Luo, F., & Xie, X. (2023). Evaluating general-purpose AI with psychometrics [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2310.16379
Ye, H., Jin, J., Xie, Y., Zhang, X., & Song, G. (2025a). Large language model psychometrics: A systematic review of evaluation, validation, and enhancement [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2505.08245
Ye, H., Xie, Y., Ren, Y., Fang, H., Zhang, X., & Song, G. (2025b). Measuring human and AI values based on generative psychometrics with large language models. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (Vol. 39, pp. 26400–26408). AAAI Press. https://doi.org/10.1609/aaai.v39i24.34839
Yuan, P., Zhang, Y., Feng, S., Li, Y., Wang, X., Shi, J., Tan, C., Pan, B., Hu, X., & Li, K. (2025). Beyond one-size-fits-all: Tailored benchmarks for efficient LLM evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 15591–15615). Association for Computational Linguistics. https://doi.org/10.18653/v1/2025.acl-long.759
Zhang, X., Wang, C., Weiss, D. J., & Tao, J. (2021). Bayesian inference for IRT models with non-normal latent trait distributions. Multivariate Behavioral Research, 56(5), 703–723. https://doi.org/10.1080/00273171.2020.1776096
Zhang, Y., Li, S., Yuan, X., Yuan, H., Che, Z., & Luo, S. (2025). The high-dimensional psychological profile of ChatGPT. Science China Technological Sciences, 68, Article 1820401. https://doi.org/10.1007/s11431-025-2934-8
Zou, B., Li, P., Pan, L., & Aw, A. T. (2022). Automatic true/false question generation for educational purpose. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022) (pp. 61–70). Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.bea-1.10
Zu, J., Choi, I., & Hao, J. (2023). Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Test and Assessment Modeling, 65(1), 55–75. Retrieved from https://www.psychologie-aktuell.com/index.php/ptam/article/view/1039
Citation note: No DOI was registered for this article as of March 2026. The URL above links to the journal’s article page. Verify the DOI status with the publisher (Pabst Science Publishers).
| Dimension | Pillar I: Construct Definition & Validity Framework | Pillar II: Internal Structure & Score Generalizability | Pillar III: Operational Scaling & Score Comparability |
|---|---|---|---|
| Unified Focus | Construct definition + intended use | Dimensionality structure + error quantification | Scale maintenance + future-proofing |
| Source RQs | RQ1 (Intended Use) + RQ2 theoretical (§1–§3.2) | RQ2 empirical (§3.3 bifactor/IRT/simulation) + RQ3 (Format/G-Theory) | RQ4 (Domain Spec/LLM Family Taxonomy) + RQ5 (Equating/Linking) |
| Psychometric Task | Build nomological network; define construct for non-human agents; specify intended use taxonomy | Run bifactor IRT to separate \(G\) from \(S_k\); quantify prompt/format/replication error via G-study | Establish anchor-item protocols; use AIG for parallel forms; determine comparability level for longitudinal claims |
| Guiding Question | “What are we measuring, and for whom?” | “Does the score faithfully capture \(G\)? How much error contaminates it?” | “Is the scale durable, and can improvement be distinguished from contamination?” |
| Core Deliverable | Intended use taxonomy; validity argument template; nomological network for LLM capability constructs | Bifactor model with \(\omega_h\)/ECV/PUC; G-study variance components; D-study optimization per intended use | Domain specification framework; AIG-based parallel form strategy; equating/linking feasibility roadmap |
| Psychometric Benefit | Establishes the “Rule of Law” before any analysis — prevents construct conflation from the outset | Groups structure (bifactor) with error (G-Theory) — the two are inseparable in a unified validity argument | Treats AIG as a structural solution to scale decay/contamination; grounds longitudinal claims in equating theory |
| Medical Metaphor | The Diagnosis — intake assessment; defining what to check for before any test is run | The Labs — calibrating the thermometer; ensuring we measure blood pressure, not caffeine intake | The Patient History — distinguishing genuine recovery from memorization of the eye chart |
Audience bridge — Pillar I: For AI scientists: This pillar asks “what exactly are you trying to measure?” before touching any data. In ML terms, it’s the step before model selection — defining the task specification and intended use case rigorously. For psychometricians: The novelty here is applying the Kane validity argument framework to an AI examinee, where the “construct” (reasoning ability, domain knowledge) must be defined for a non-human, non-replicable-in-the-usual-sense entity.
Clinical Framing — The Intake Assessment: Before a single model is tested or a single item is scored, this pillar performs the diagnostic intake. What capability are we claiming to measure? For what purpose and for which stakeholder? Evaluation is not a neutral act — the validity of every subsequent analysis in Pillars II and III depends entirely on the clarity of these foundational answers. A benchmark that never specifies its intended use cannot be evaluated for validity, just as a physician who never specifies the reason for a diagnostic test cannot determine whether the result is meaningful.
Audience bridge — RQ1: For AI scientists: This is the “requirements specification” step. Before you build a benchmark, you should know whether it’s for ranking models (selection), diagnosing weaknesses (diagnostic), certifying deployment readiness (certification), tracking progress over versions (progress monitoring), or testing scientific hypotheses (investigation) — because each use demands different psychometric properties. For psychometricians: The five-use taxonomy here is a direct application of Kane’s interpretive argument to a novel domain. The main challenge is that the LLM field often re-purposes a single benchmark score for all five uses simultaneously.: How Does the Interpretive Argument Determine the Required Validity Evidence?
Pillar I — The Diagnosis: This section defines the reason for the clinical visit, the intended use, before any measurement instrument is chosen or any item is administered. The taxonomy of intended uses developed here is the “rule of law” that governs all downstream methodological decisions.
The LLM assessment ecosystem operates without explicit specification of intended use. When a new benchmark is released , for instance, the MMLU, HumanEval, GSM8K, BigBench, and so on, the implicit purpose is almost always ranking: which model is “best”? But the same benchmark scores are then re-purposed for diagnostic interpretation (“Model X is weak at multi-step reasoning”), capability certification (“Model Y is safe to deploy for medical advice”), progress tracking (“Model Z improved 12 points over its predecessor”), and marketing (“State-of-the-art on 14 benchmarks”) . Each of these uses demands fundamentally different psychometric properties, yet the distinction is never made.
This is not a minor oversight. In educational measurement, the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014) establish that validity is not a property of the test itself but of the interpretation and use of test scores. A benchmark that is valid for ranking models may be entirely invalid for diagnosing specific weaknesses. The failure to specify intended use means that the LLM field cannot even begin to evaluate whether its benchmarks are doing what practitioners assume they are doing.
Kane (2006, 2013) reframed validity as an evaluation of the plausibility of proposed interpretations and uses of test scores. Rather than asking “Is this test valid?” the field asks “Is the proposed interpretation and use of scores from this test supported by adequate evidence?” Kane’s framework distinguishes between the interpretive argument — the network of inferences and assumptions connecting observations to score interpretations — and the validity argument — the evaluative argument that weighs the available evidence for and against those inferences. The two are logically related but methodologically distinct.
Kane’s framework specifies an interpretive argument — the chain of inferences leading from observed performance to the intended interpretation and use — and then requires a validity argument that provides evidence for each inference in the chain. The key inferences are:
For LLM benchmarking, each intended use requires a different interpretive argument and therefore different validity evidence. This is the core insight that the field has missed.
Messick (1989) argued that the social consequences of test use are part of validity. For LLM benchmarking, this is especially pertinent: benchmark rankings influence billions of dollars in investment, deployment decisions affecting millions of users, and regulatory frameworks. If benchmarks systematically misrepresent model capabilities — because the intended use was never specified and therefore never validated — the consequences are substantial.
Standards 1.1 and 1.2 (AERA, APA, & NCME, 2014) establish clear expectations: the intended test score interpretation must be distinguished from the intended use. Intended test score interpretation specifies the meaning attributed to scores (e.g., “this score reflects reasoning ability”), while intended use specifies the action taken on the basis of those scores (e.g., “this score will be used to rank models for deployment”). This distinction matters because a single benchmark may support one interpretation but not another, or support an interpretation under one use but not another. The standards require explicit articulation of both the interpretation and the use for validity evaluation to proceed.
Purpose: Identify the best-performing model(s) for a given application or determine whether a model meets a minimum performance threshold.
Stakeholders: Developers comparing their model to competitors; organizations selecting a model for deployment; consumers choosing which model to use.
Psychometric Requirements: - High reliability of rank ordering (not necessarily high absolute score precision) - Evidence that the ranking generalizes beyond the specific benchmark items (generalization inference) - Evidence that benchmark ranking predicts real-world performance in the target application (extrapolation inference) - Appropriate difficulty distribution and Test Information Function (TIF) — in Item Response Theory, the Test Information Function (Lord, 1980) quantifies the precision of ability estimation as a function of theta. For competitive ranking (selection among high-ability models), the TIF should be maximized in the upper theta range, which requires selecting items with high IRT discrimination parameters (a) and difficulty parameters (b) calibrated to the frontier-model ability range. For minimum-competency screening, the TIF should peak at the cut-score theta level. Items that are too easy (near-zero discrimination at high theta) contribute negligible information for selection decisions and should be retired or replaced in an active item bank. - Fairness across model architectures — the benchmark should not systematically advantage one architecture over another for construct-irrelevant reasons; Differential Item Functioning (DIF) analyses (Holland & Wainer, 1993; Zumbo, 1999) across architectural subgroups (e.g., transformer-based vs. mixture-of-experts models) provide the formal psychometric check on this requirement
Current Practice vs. Requirements: Most benchmarks implicitly serve this purpose, but rarely provide evidence for the extrapolation inference (does benchmark ranking predict deployment performance?). Additionally, the rank ordering is often unstable — different benchmarks produce different rankings — suggesting that the generalization inference is weak.
Analogues in Educational Measurement: College admissions testing (SAT/ACT), professional licensure examinations, employment testing.
Purpose: Identify specific strengths and weaknesses of a model to guide development, fine-tuning, or targeted improvement.
Stakeholders: Model developers seeking to improve their systems; researchers investigating specific capabilities or failure modes.
Psychometric Requirements: - Subscale reliability — not just overall score reliability, but reliable measurement of specific subdimensions (e.g., “algebraic reasoning” vs. “geometric reasoning” within mathematics) - Fine-grained construct definition — the construct domain must be decomposed into meaningful, distinguishable subdomains - Diagnostic classification accuracy and consistency — beyond overall reliability, diagnostic classification models require evaluation of classification accuracy (the proportion of correct proficiency classifications) and classification consistency (the probability that a model would be assigned to the same proficiency category upon repeated independent measurement). Classification consistency should be estimated using the classification consistency index (CCI; Brennan & Kane, 1977) or related indices (Lee, 2010). This index is conceptually distinct from overall score reliability and is often lower, particularly near subdomain boundaries. - Item-level diagnostic value — individual items or item clusters should provide interpretable information about specific capabilities - Appropriate subscale length — each subdomain needs enough items to produce reliable subscale scores (often a problem when overall benchmark length is constrained)
Current Practice vs. Requirements: Benchmarks like MMLU report subscale scores (e.g., accuracy on “abstract algebra” vs. “anatomy”) but provide no evidence that these subscales are reliable, that they measure distinct constructs, or that the number of items per subscale is sufficient for stable estimates. The diagnostic interpretation is therefore unwarranted by available evidence.
Analogues in Educational Measurement: Diagnostic classification models (DCMs), formative assessments, cognitive diagnostic assessments.
Purpose: Determine whether a model meets a predefined performance standard required for deployment, regulatory compliance, or safety assurance.
Stakeholders: Regulators (e.g., EU AI Act compliance); organizations with safety requirements; industry standards bodies.
Psychometric Requirements: - Criterion-referenced interpretation — scores are interpreted against a fixed standard, not relative to other models - Cut-score validity — the performance threshold must be established through a principled standard-setting process. Recognized methods include the Angoff method (Angoff, 1971), in which subject-matter experts estimate the probability that a minimally competent examinee would answer each item correctly; the Bookmark method (Lewis, Mitzel, Mercado, & Schulz, 2012), in which items are ordered by difficulty and panelists place a ‘bookmark’ at the point separating acceptable from unacceptable performance; and the contrasting groups method, in which cut scores are derived from empirical distributions of known-group performance. Each method requires documented panelist qualifications, structured deliberation procedures, and impact data reporting. - Classification accuracy and consistency — certification assessments require evaluation of the classification accuracy (the proportion of true pass/fail decisions that are made correctly) and classification consistency (the proportion of classifications that would be reproduced upon re-testing), as formalized by Brennan and Kane (1977) and extended by Lee (2010). - Minimization of misclassification — the consequences of false positives (certifying an inadequate model) and false negatives (failing an adequate model) must be considered in cut-score placement - Content validity — the benchmark must representatively sample the domain of tasks the model will encounter in deployment
Current Practice vs. Requirements: No LLM benchmark currently employs a formal standard-setting process. When thresholds are used (e.g., “human-level performance”), they are typically set by convenience rather than by a defensible method. Classification consistency is never reported.
Analogues in Educational Measurement: Professional licensure (bar exam, medical boards), competency-based assessments, standards-referenced testing.
Purpose: Track performance changes over time, typically across model versions or training iterations.
Stakeholders: Developers tracking improvement across model versions; researchers studying capability emergence; organizations monitoring deployed model drift.
Psychometric Requirements: - Score comparability across time points — if the benchmark changes (new items, updated format), scores must be equated - Sensitivity to change — the benchmark must be able to detect meaningful improvements while distinguishing them from measurement noise - Resistance to contamination — if benchmark items leak into training data, progress monitoring becomes meaningless - Vertical scaling — the construction of a common developmental scale spanning multiple difficulty levels, enabling comparison of growth across benchmarks that target different ability ranges. Vertical scaling (Kolen & Brennan, 2014) refers to this process. In LLM contexts, vertical scaling would allow tracking of capability growth as frontier models surpass current benchmark ceilings, provided that appropriate overlapping content is included at each difficulty level to anchor the scale.
Current Practice vs. Requirements: Progress claims are pervasive (“GPT-4 improved X% over GPT-3.5 on benchmark Y”) but almost never accompanied by equating evidence, contamination checks, or measurement error estimates. Without these, it is impossible to distinguish genuine capability improvement from measurement artifacts.
Psychometric precedents: Growth modeling (Raudenbush & Bryk, 2002); vertical scale construction and maintenance (Kolen & Brennan, 2014, Ch. 11); value-added models for longitudinal academic progress; NAEP long-term trend assessments as operational examples of maintained longitudinal comparability across cohorts.
Purpose: Use benchmark performance as data to understand the nature of LLM capabilities, generalization patterns, or architectural effects.
Stakeholders: Researchers studying emergent abilities, scaling laws, transfer learning, or the nature of LLM “understanding.”
Psychometric Requirements: - Construct validity is paramount — if the researcher is using benchmark performance to make claims about “reasoning” or “understanding,” strong evidence is needed that the benchmark actually measures these constructs - Internal structure validity — factor analysis or similar methods should confirm that the benchmark’s internal structure matches the theoretical construct structure - Discriminant validity — performance on the benchmark should not be fully explainable by simpler constructs (e.g., memorization, pattern matching) - Experimental control — administration conditions must be standardized to support causal inferences about architectural or training differences
Current Practice vs. Requirements: Studies of emergent abilities (e.g., Wei et al., 2022) rely heavily on benchmark scores to detect capability thresholds, but the psychometric properties of those benchmarks are rarely examined. If the benchmark has poor reliability or construct-irrelevant variance, apparent “emergence” may be a measurement artifact (Schaeffer et al., 2023, made a related argument about metric choice).
Analogues in Educational Measurement: Cognitive psychology research using standardized tests; structural validity studies.
Each intended use is further differentiated by stakeholder perspective. The same benchmark, used for the same general purpose, may require different validity evidence depending on who is interpreting the scores:
| Stakeholder | Primary Interest | Validity Priority |
|---|---|---|
| Model Developer | Competitive positioning; identifying improvement targets | Selection (ranking against competitors) + Diagnostic (guiding development) |
| Deploying Organization | Risk management; fitness for purpose | Certification (meets requirements) + Extrapolation (predicts real-world performance) |
| End User | Choosing the right tool for their task | Selection (which model for my use case) + Extrapolation (will it work for me?) |
| Regulator | Public safety; compliance | Certification (meets standards) + Consequential validity (what are the risks of misclassification?) |
| Researcher | Understanding capabilities and limitations | Scientific investigation + Construct validity |
Importantly, a single assessment may be used for multiple purposes by different stakeholders simultaneously; in such cases, the validity argument must address each intended interpretation and use pair separately, as evidence sufficient for one use may be inadequate for another (Kane, 2013; Standards 1.2, AERA, APA, & NCME, 2014).
The taxonomy developed in RQ1 has cascading implications:
Audience bridge — RQ2: For AI scientists: This RQ asks whether the benchmark measures what it claims to measure. Does a model’s MMLU score actually reflect “reasoning ability,” or does it mainly reflect how much MMLU-adjacent content was in the training data? The bifactor model separates a general reasoning dimension (G) from domain-specific memorization (S-factors). For psychometricians: The key novelty is the small-N problem (N ≈ 71 LLMs vs. thousands of human examinees), non-normal ability distributions, and binary response data — all of which require method modifications validated through the simulation study in §3.3.1. — Construct Representation, Construct-Irrelevant Variance, and Nomological Network
Pillar I (cont.) — The Diagnosis: This section specifies the diagnosis itself — the construct — and builds the nomological network that gives the diagnostic label meaning. Establishing what “reasoning” or “knowledge” means for a non-human agent is a prerequisite for determining whether any instrument measures it.
When MMLU claims to measure “Massive Multitask Language Understanding,” what exactly is “language understanding”? When Grade School Math 8K (GSM8K) claims to test “grade school math reasoning,” what is “reasoning” as opposed to pattern matching, memorization, or sophisticated interpolation embedded in LLMs? When HumanEval tests “code generation ability,” does a high score mean the model can genuinely write code, or that it has memorized enough code patterns to pass specific test cases?
These are construct validity questions, and they are almost entirely unaddressed in LLM assessment. The field names constructs freely — reasoning, understanding, knowledge, creativity, safety — without providing evidence that benchmarks actually measure these constructs, without specifying what these constructs mean when applied to non-human agents, and without acknowledging that different stakeholders may define the same construct label very differently.
These failures constitute what Messick (1989) identified as the two cardinal threats to construct validity: construct underrepresentation (CR), in which the assessment fails to capture the full breadth of the target construct, and construct-irrelevant variance (CIV), in which scores are systematically influenced by factors outside the construct. Both threats operate simultaneously in LLM benchmarking: benchmarks that sample a narrow slice of a broadly named construct (e.g., “reasoning”) suffer from construct underrepresentation, while sensitivity to prompt phrasing, option ordering, and formatting introduces construct-irrelevant variance. Rigorous validity evaluation requires explicit attention to both threats.
Messick (1989, 1995) articulated a unified framework for construct validity, in which all validity evidence — content-related, criterion-related, and consequential — contributes to a single overarching evaluation of the meaning and appropriateness of score interpretations. Messick identified six aspects of construct validity that collectively constitute the unified framework: (1) the content aspect, concerning the relevance and representativeness of assessment content relative to the construct domain; (2) the substantive aspect, concerning the cognitive processes, response strategies, and knowledge structures that actually produce examinee responses; (3) the structural aspect, concerning the degree to which the internal scoring structure reflects the theoretical structure of the construct; (4) the generalizability aspect, concerning the extent to which score interpretations generalize across population groups, settings, tasks, and time; (5) the external aspect, concerning the convergent and discriminant relationships between scores and external criteria; and (6) the consequential aspect, concerning the social consequences and value implications of score interpretation and use. All six aspects are relevant to LLM benchmarking.
This perspective is critical for LLM benchmarking because the field tends to treat “accuracy” as self-evidently meaningful, bypassing the question of whether the accuracy score actually represents the named construct.
Embretson (1983) distinguished two components of construct validity. Construct representation concerns the identification of the theoretical mechanisms — cognitive processes, knowledge structures, and problem features — that account for task performance. It asks: what makes items of this type hard or easy, and what cognitive operations are required? Nomothetic span concerns the construct’s place within a broader theoretical network, evaluated through its empirical relationships with other constructs. Both components are necessary for adequate construct validity. For LLM benchmarks, construct representation is particularly challenging because the internal computational processes are largely unobservable for proprietary models; however, experimental manipulation of item features (e.g., systematically varying surface features while holding construct-relevant features constant) can provide indirect evidence about construct representation even without access to model internals.
Cronbach and Meehl (1955) introduced the concept of the nomological network — the interlocking system of theoretical laws, empirical relationships, and construct definitions within which a psychological construct is embedded. Validity, on this view, is not merely a property of item content or empirical correlations but of the entire theoretical framework: a construct is valid to the extent that it behaves as the theory predicts. For LLM assessment, constructing a nomological network requires specifying: (a) what theoretical relationships the target construct (e.g., “mathematical reasoning”) should have with related constructs (e.g., “working memory capacity” in humans, or “training compute” and “model size” in LLMs); (b) what empirical patterns would support or falsify those relationships; and (c) what boundary conditions limit the construct’s applicability. The absence of such a network in current LLM benchmarking means that high scores on “reasoning” benchmarks cannot be meaningfully interpreted — there is no theoretical context in which to situate the score.
A complementary formal framework is Evidence-Centered Design (ECD; Mislevy, Almond, & Lukas, 2003), which decomposes assessment design into three interlocking models: the student model (what we want to infer about the examinee), the task model (the conditions under which evidence is elicited), and the evidence model (the rules for evaluating task performance as evidence about the student model). For LLM assessment, the student model corresponds to the latent capability construct being measured (e.g., clinical reasoning, mathematical problem-solving), the task model specifies prompt format, context, and scoring rubric, and the evidence model translates response patterns into psychometric scores. ECD thus provides a rigorous design grammar that complements Kane’s (2006) interpretive argument by specifying, at the item-design stage, exactly what student-model attributes each task is intended to engage.
A fundamental question unique to LLM assessment — one without direct precedent in human psychometrics — is whether constructs developed for human cognition (reasoning, understanding, knowledge retrieval) can be meaningfully applied to artificial systems. The answer is not self-evident, and different metatheoretical commitments yield categorically different validity evidence requirements. At least three distinct competency frameworks emerge from the intersection of philosophy of mind, cognitive science, and AI evaluation literatures:
Behaviorist competency: Constructs are defined solely by observable input–output behavior. If a model produces outputs indistinguishable from those of a human “reasoning,” then — for assessment purposes — it is reasoning. This competency framework finds its most explicit formulation in Turing’s (1950) imitation game, which proposed behavioral indistinguishability as the operational criterion for machine intelligence. Dennett’s (1987) intentional stance extends this framework: we are justified in treating any system as having beliefs, desires, and intentions when doing so is predictively useful, regardless of the system’s internal constitution. More recently, Chollet (2019) refines the behaviorist standard by arguing that genuine behavioral competency must generalize across novel synthesis problems rather than tasks drawn from the training distribution — transforming the imitation game from a permissive to a demanding criterion. A model that matches human accuracy on MMLU through memorization has not demonstrated the behavioral competency the construct implies; only performance on problems requiring novel recombination of skills constitutes compelling behavioral evidence. Under the behaviorist competency, construct validity requires that the benchmark representatively samples the behavioral domain and, critically, that it tests generalization conditions rather than retrieval from encoded distributions.
Mechanistic competency: Constructs require evidence about specified underlying cognitive processes, not merely output accuracy. “Reasoning” implies particular computational operations — systematic compositionality, causal inference, recursive rule application — and a model that achieves correct outputs through statistically encoded co-occurrences rather than these operations does not possess the construct, even when its surface behavior is indistinguishable from that of a genuine reasoning agent. Searle’s (1980) Chinese Room argument provides the foundational statement: syntactic manipulation of symbols does not produce semantic understanding, however behaviorally convincing the outputs. Bender et al. (2021) operationalize this critique empirically, arguing that LLMs trained exclusively on form without grounding in the world cannot acquire meaning, and that high benchmark scores obtained through distributional pattern matching constitute construct-irrelevant variance rather than evidence of the named construct. Lake et al. (2017) demonstrate that models matching human accuracy on specific tasks fail systematically under conditions requiring genuine compositional generalization, causal reasoning, or theory of mind — the very sub-constructs invoked by most reasoning benchmarks. Chomsky, Roberts, and Watumull (2023) sharpen the theoretical objection: the statistical learning mechanisms that produce LLM outputs are orthogonal to the cognitive operations that constitute human language and thought, making construct transfer from human to machine assessment scientifically problematic without explicit process-level validation. Mitchell (2021) surveys apparent AI competencies to show that each reported breakthrough is accompanied by systematic failures at construct boundaries — a pattern inconsistent with genuine construct possession and consistent with benchmark-specific overfitting. Under the mechanistic competency, many current LLM benchmarks have weak construct validity by design: high accuracy on existing tasks is insufficient evidence of the process-level construct the benchmark names.
Pragmatic/Relational competency: Constructs are defined relative to the stakeholder’s interpretive framework and intended use, rather than by any fixed behavioral or mechanistic standard. “Reasoning” means whatever the relevant decision community operationally requires in a specific deployment context — and construct validity is therefore not an intrinsic property of the test, but a relational property of the test–use–stakeholder triad. This competency is most continuous with the dominant tradition in modern psychometric theory. Messick’s (1989) unified validity framework establishes that construct validity is inherently consequential and intentional: evidence about meaning is inseparable from evidence about use, and a score’s construct interpretation is partly defined by the network of decisions it supports and the consequences it produces. Kane’s (2013) interpretive/use argument operationalizes this principle: validity evidence must be proportional to the stakes and novelty of each specific interpretation, and the same test may support valid inferences for one purpose while failing to do so for another. Raji et al. (2021) bring this critique directly to AI benchmarking, demonstrating that benchmark scores are routinely repurposed for interpretive claims far beyond what the benchmark’s construct definition warrants — a validity failure that is invisible when benchmarks are evaluated as if construct meaning were fixed and context-free. Applied to LLM assessment, the pragmatic/relational competency requires each intended use to generate its own validity argument: a model may legitimately be described as exhibiting “reasoning competency” for the purpose of ranking research prototypes while that same claim would be scientifically unjustified for medical diagnosis deployment, because the two uses demand categorically different process and consequence evidence.
This dissertation does not resolve this philosophical question definitively — the three competency frameworks are neither fully separable nor mutually exclusive in practice, and productive validity arguments typically integrate evidence from all three. Rather, the dissertation’s contribution is to show that different validity evidence requirements follow explicitly from different competency commitments: response-process evidence (§3.2) is demanded primarily by the mechanistic competency; consequence and use evidence (§3.5) by the pragmatic/relational competency; and convergent/discriminant evidence (§3.3) is required under all three frameworks. Making these dependencies explicit is itself a contribution to LLM assessment science, because current benchmarking practice implicitly assumes a behaviorist competency while deploying scores for purposes that — by their own stated stakes — require mechanistic or relational evidence.
Adapting the Standards’ five sources of validity evidence for LLM contexts:
In educational measurement: Expert judgment that test items representatively sample the content domain specified in the test blueprint.
Adaptation for LLMs: - Does the benchmark have an explicit content specification and test blueprint? - Was the item pool developed systematically to cover the specified domain? - Did subject-matter experts review items for content relevance and representativeness? - Is the domain specification appropriate for the named construct?
Current state of practice: Most LLM benchmarks lack formal test blueprints. MMLU, for example, draws items from existing exams across 57 subjects, but the selection criteria and content coverage rationale are not specified in terms of a construct domain. GSM8K items were written to cover “grade school math” but without a formal domain specification.
Critical gap: Even when content coverage is adequate, content validity evidence does not address whether the model engages with the content as intended. A model might score highly on “reasoning” items through memorization or shortcut strategies that bypass the intended construct.
In educational measurement: Evidence about the cognitive processes examinees actually use when responding, typically gathered through think-alouds, eye tracking, or response time analysis.
Adaptation for LLMs: - Chain-of-thought analysis: When models generate step-by-step solutions, do the intermediate steps reflect the intended reasoning process? - Ablation studies: When key information is removed or perturbed, does performance degrade in ways consistent with the intended construct? - Prompt sensitivity analysis: If performance changes dramatically with superficial prompt changes that do not alter the construct-relevant content, this suggests construct-irrelevant variance - Adversarial testing: Can the model be tricked into failures that reveal reliance on surface features rather than the target construct? - Attention/activation analysis: For open-weight models, do internal representations reflect construct-relevant processing? (Limited applicability for closed models)
This is potentially the richest and most novel source of validity evidence for LLM assessment. Because LLMs can be manipulated experimentally in ways that human examinees cannot (precise control of input, reproducible conditions, systematic perturbation), response process evidence is both more feasible and more informative than in human testing contexts. Adversarial probing studies — in which construct-irrelevant surface features are systematically manipulated to test whether they alter scores — represent an LLM-specific analog to think-aloud validity studies in human testing (Embretson, 1983), and constitute some of the strongest available response-process evidence for benchmarks.
Critical gap: Almost no current benchmark reports response process evidence. Accuracy is reported without any analysis of how models arrive at correct (or incorrect) answers.
Audience bridge — Pillar II: For AI scientists: This pillar asks whether your accuracy numbers are actually measuring a single coherent ability (like “reasoning”) or a mixture of unrelated skills — and how much of the score variation is noise from prompt wording, question format, or random sampling. In ML terms: decomposing systematic variance from measurement error. For psychometricians: The methods (bifactor IRT, G-Theory) are familiar, but the “examinee” is an AI model, the “items” are benchmark prompts, and the “administration facets” include temperature settings and prompt phrasing — each of which contributes construct-irrelevant variance in ways without human analogs.
Clinical Framing — The Lab Panel: Having defined the construct and its intended use in Pillar I, this pillar runs the diagnostic tests. The bifactor model is the calibration check: is the thermometer measuring the target construct (general reasoning ability, \(G\)) or is the reading contaminated by topic-specific knowledge (\(S_k\))? G-Theory is the error budget: how much do prompt wording, item format, and stochastic replication inflate or deflate the score? These two analyses are inseparable — reliability is meaningless unless we first establish what the score is supposed to be reliable about, and dimensionality is only interpretable once the error sources have been identified.
In educational measurement: Factor analysis, dimensionality assessment, and item analysis confirm that a test’s empirical covariance structure matches its intended theoretical structure — establishing whether scores actually reflect the construct the test was designed to measure rather than some combination of construct-relevant and construct-irrelevant variance.
Why standard methods cannot be applied directly to LLM data — and why simulation must come first.
Standard dimensionality assessment methods — exploratory and confirmatory factor analysis (EFA/CFA), Horn’s parallel analysis, MAP — were developed and validated for typical psychometric conditions: large samples of human respondents (N > 200), continuous or polytomous item scores, and latent ability distributions that are approximately normal in the target population. LLM benchmark data systematically violates all three conditions simultaneously, in ways that compound rather than merely add:
Condition 1 — Small N (examinees = models). The unit of analysis for dimensionality is the model, not the item. Contemporary publicly evaluated LLM ecosystems contain 30–300 models; the present study uses N = 71. This falls well below the minimum sample size recommended for stable factor-analytic solutions in human psychometrics (MacCallum, Widaman, Zhang, & Hong, 1999; Mundfrom, Shaw, & Ke, 2005), and at the extreme lower end of the range in which even nonparametric dimensionality statistics are interpretable.
Condition 2 — Binary item responses. Multiple-choice benchmark items produce binary correct/incorrect responses. Pearson product-moment correlations between binary items (phi coefficients) are systematically attenuated when item difficulties diverge — an artifact of marginal distributions rather than latent covariance structure (Carroll, 1945). Substituting tetrachoric correlations corrects this bias, but tetrachoric estimation introduces substantial sampling variance under small N, and standard parallel analysis generates reference eigenvalues from continuous normal data, systematically overestimating factor count when applied to tetrachoric matrices (Garrido, Abad, & Ponsoda, 2013).
Condition 3 — Bimodal and skewed latent distribution. The LLM population is not drawn from a single normal ability pool. Frontier models (GPT-4-class) cluster at high \(\theta\) values while smaller open-source models cluster at lower values, producing a bimodal or strongly negatively skewed distribution. Standard maximum-likelihood IRT and factor estimation assumes \(\theta \sim N(0,1)\); bimodal latent distributions bias discrimination parameter estimates, inflate apparent dimensionality near the distribution’s anti-mode, and cause information-criterion-based factor retention to systematically over- or under-count dimensions (Woods & Thissen, 2006; Preston & Reise, 2014).
Condition 4 — Inverted J/N ratio. With J = 645 items and N = 71 models, the item-to-model ratio exceeds 9:1 — the inverse of the typical psychometric situation where N \(\gg\) J. Methods designed for item-level factor analysis assume that sampling variance in the item \(\times\) item correlation matrix is driven by N; in the LLM context, N is the bottleneck, not J, fundamentally altering the stability and interpretation of eigenvalue-based retention criteria.
Simulation-first validation strategy. No existing psychometric simulation study covers all four LLM-specific conditions simultaneously. Simulation studies in the factor retention literature (e.g., Auerswald & Moshagen, 2019; Goretzko & Bühner, 2020) use N \(\geq\) 100 with continuous or polytomous items and normal latent distributions — conditions that fail to represent the LLM measurement context in the dimensions that matter most. Applying methods whose operating characteristics are unknown under LLM-like conditions without prior validation would produce internal structure evidence of unknown reliability.
The present dissertation therefore adopts a simulation-first validation strategy: before any dimensionality method is applied to the empirical 71-model × 645-item USMLE response matrix, all 13 candidate methods are simultaneously evaluated under a factorial simulation design whose conditions are anchored to the LLM measurement context. This strategy follows the comparative benchmarking logic established by Li et al. (2025), who simultaneously evaluated seven feature-selection methods — ranging from classical item-total-correlation filters to Lasso, genetic algorithms, and multi-objective NSGA-II optimization — across controlled simulation conditions before recommending any method for applied scale abbreviation. The key principle is that method performance depends critically on the interaction between the method’s assumptions and the data-generating conditions; performance rankings established under standard conditions cannot be extrapolated to structurally different measurement contexts without direct empirical verification.
LLM-anchored simulation conditions. The simulation (detailed in §3.3.1) spans a fully-crossed factorial design whose factors are defined by the structural features of the LLM measurement context rather than by default psychometric simulation norms:
| Factor | Levels | Rationale |
|---|---|---|
| Factor structure | 1D, 2D (correlated), Bifactor, Correlated-factors, Higher-order | Covers all theoretically plausible structures for knowledge-based benchmarks |
| N (number of LLMs) | 30, 71, 150, 300 | Spans minimal evaluation study to large-scale ecosystem survey; N = 71 anchors to present data |
| J (number of items) | Short, medium, full-benchmark | Tests whether method performance is stable across benchmark lengths |
| Loading magnitude | Weak (\(a\) = 0.40–0.60), Moderate (0.60–0.80), Strong (0.80–1.20) | Brackets the realistic range of item discrimination for MCQ benchmarks |
| Latent distribution | Normal \(N(0,1)\); Bimodal (\(0.4 \cdot N(-1.5, 0.5^2) + 0.6 \cdot N(1.2, 0.7^2)\)); Negatively skewed (\(\xi = -1.5\), \(\alpha = -4\)); Positively skewed (\(\xi = +1.5\), \(\alpha = +4\)) | Directly models frontier-vs.-open-source clustering and capability-floor effects; both skew directions included to test whether method failures are asymmetric (direction-dependent) or general to any departure from normality |
The bimodal and negatively skewed distribution conditions are explicitly designed to reflect the empirical LLM ability distribution identified in §2.3 — conditions that are entirely absent from standard simulation benchmarks in the factor retention literature. A positively skewed condition (\(\xi = +1.5\), \(\alpha = +4\)) is added as a symmetric counterpart: although positive skew is less characteristic of the current LLM ecosystem (where frontier models dominate the high-\(\theta\) region), it provides a theoretically important control that allows the simulation to determine whether any observed method failures under negative skew are direction-specific artifacts or reflect a general sensitivity to distributional asymmetry. Together, the four distribution conditions — normal, bimodal, negatively skewed, and positively skewed — span the realistic range of latent trait shapes likely to arise in LLM evaluation contexts across different model ecosystems and benchmark designs.
Figure S1. The four latent distribution shapes used in the simulation study. Normal: standard Gaussian; Bimodal: mixture of two normals representing frontier vs. open-source LLM clusters; Negatively skewed: left tail extended, characteristic of current LLM populations where frontier models dominate the high-ability region; Positively skewed: right tail extended, included as a direction-control condition.
Candidate methods compared. Thirteen dimensionality methods drawn from three families are evaluated simultaneously in the same simulation, following the comparative design logic of Li et al. (2025):
Traditional psychometric methods (7): Horn’s parallel analysis (PA; Horn, 1965), modified parallel analysis for polychoric correlations (PA-poly; Garrido et al., 2013), minimum average partial criterion (MAP; Velicer, 1976), Hull method (Lorenzo-Seva, Timmerman, & Kiers, 2011), Stout’s DIMTEST (Stout, 1987), DETECT (Zhang & Stout, 1999), and Mokken scalability analysis with automated item selection procedure (MSA-AISP; Mokken, 1971; van der Ark, 2012). These represent the current best-practice toolkit for dimensionality assessment in applied psychometrics, providing the baseline against which ML methods are evaluated.
Machine learning–based methods (5): Non-negative matrix factorization (NMF; Lee & Seung, 1999), variational autoencoder with latent dimension selection via ELBO (VAE-dim; Kingma & Welling, 2014), spectral clustering on tetrachoric correlation matrices (SC-tetra), LASSO-penalized exploratory factor analysis (LASSO-EFA; Hirose & Yamamoto, 2015), and neural IRT with cross-validated latent dimensionality (NIRT; Wilson, Fudolig, & Fabian, 2019). These methods make fewer parametric distributional assumptions than their traditional counterparts, making them candidate alternatives when normality fails — but their small-sample behavior is poorly characterized.
Distribution-adaptive IRT (1): Ramsay/Davidian Curve IRT (RC-IRT; Ramsay, 1991; Woods & Thissen, 2006), which estimates the latent ability distribution nonparametrically alongside item parameters, replacing the normality assumption with a flexible spline or smooth exponential curve fitted to the observed data. RC-IRT directly targets Condition 3 above and serves as the theoretically motivated benchmark for performance under bimodal and skewed distribution conditions.
Outcome criteria. Each method is evaluated on four criteria within each simulation cell: (1) proportion correct (PC) — the proportion of replications in which the method recovers the true number of dimensions, the primary accuracy criterion; (2) structural recovery (RMSE_G, RMSE_Sk) — for methods that produce a loading matrix, the root mean squared error between estimated and true factor loadings, distinguishing methods that count dimensions correctly from those that also recover the correct structure; (3) Type I error rate — the false-positive rate under true unidimensional structure, particularly costly because spurious multidimensional solutions directly undermine the ability–knowledge decomposition in §3.3.2; and (4) wall-clock computation time per replication, relevant for operational feasibility with large item banks.
Transition to empirical application. Methods that demonstrate PC > .80 and Type I error < .10 across the LLM-relevant simulation cells — particularly the small-N (\(N \leq 71\)) and bimodal-distribution conditions — are selected for application to the empirical 71-LLM × 645-item USMLE response matrix. All internal structure evidence in §3.3.2 (bifactor decomposition) and §3.3.1 (simulation workflow) is restricted to simulation-validated methods, ensuring that every structural claim is backed by direct evidence of method reliability under conditions comparable to the empirical application context.
Critical gap: Internal structure evidence is almost never reported in LLM evaluation practice. When results show that Model X scores 90% on “abstract algebra” and 60% on “moral scenarios,” the field assumes these subscales measure different constructs — but no structural evidence, and no method-validation evidence, supports this interpretation. The simulation-first strategy adopted here is designed to close both gaps simultaneously.
A fundamental methodological challenge in the RQ2 analysis is that the true dimensional structure of the LLM response matrix is unknown: we cannot observe whether the data-generating process is unidimensional, bifactor, or correlated-factors. All empirical dimensionality methods are fallible — they produce decisions that can be wrong, and they fail in different ways under different conditions. A simulation study addresses this directly: by generating response matrices from known structural models and testing how accurately each method recovers that known structure, we can identify which methods are most trustworthy under the specific conditions of the dissertation data (n = 71 LLMs, J = 645 items, non-normal bimodal latent distribution, binary responses). This is not a peripheral methodological aside — the choice of dimensionality method is a construct validity decision: claiming that a bifactor structure is supported requires confidence that the method used to identify it is not systematically biased toward or against bifactor recovery under the data conditions at hand.
The comparison is particularly timely because the psychometric literature has developed a rich toolkit of dimensionality procedures (Horn’s parallel analysis, DIMTEST, Mokken scaling, Hull method) that were designed for human examinee data under standard testing conditions — conditions that the LLM assessment context violates in at least two important respects: small “sample size” (n = 71 models is tiny by psychometric standards) and non-normal latent distribution. Meanwhile, machine learning has produced a complementary set of dimensionality discovery tools — variational autoencoders, non-negative matrix factorization, spectral graph methods, regularized factor analysis — that are distribution-agnostic and designed for high-dimensional data but have rarely been evaluated against psychometric gold standards on binary item response matrices. A head-to-head comparison under controlled conditions fills this gap and directly informs the method selection for the empirical RQ2 analysis.
Before presenting the simulation study design and its results, it is useful to lay out the full dimensionality exploration workflow as a coherent sequential plan. This overview maps the logical progression from raw item response data through method application, simulation-informed validation, convergence synthesis, and final bifactor IRT estimation. Each phase described here corresponds to a subsequent technical section; the intent is to make the architecture of the analysis immediately legible before the reader encounters the granular methodological detail.
Phase 1 — Pre-analysis data conditioning.
The starting point is the 71 LLM × 645-item binary response matrix. Before any dimensionality method is applied, three preparatory steps are required. First, items with near-zero variance — specifically, correct-response rates below .05 or above .95 — are flagged for potential exclusion, as such items contribute negligible discriminatory information and can distort factor-analytic solutions by inflating or collapsing off-diagonal tetrachoric correlations. Second, the tetrachoric correlation matrix is computed from the binary response matrix as the appropriate input for all correlation-based dimensionality methods (see §3.3 above for the distributional justification):
# Phase 1: Pre-analysis conditioning
library(psych)
# Compute tetrachoric correlation matrix (input for factor-based methods)
tetra_out <- psych::tetrachoric(item_matrix) # item_matrix: 71 LLMs x 645 items
tetra_rho <- tetra_out$rho # 645 x 645 tetrachoric correlation matrix
# Flag items with extreme difficulty (near-zero discriminatory variance)
p_correct <- colMeans(item_matrix)
item_flags <- which(p_correct < .05 | p_correct > .95)
cat(length(item_flags), "items flagged for extreme difficulty\n")
# Empirical latent distribution check via WLE theta from initial Rasch calibration
init_mod <- mirt::mirt(item_matrix, model = 1, itemtype = "Rasch", verbose = FALSE)
theta_WLE <- mirt::fscores(init_mod, method = "WLE")[, 1]
plot(density(theta_WLE), main = "Empirical LLM latent ability distribution",
xlab = "WLE theta", lwd = 2)
Third, the latent ability distribution is characterized empirically — via kernel density estimation of the WLE theta scores from an initial Rasch calibration — to determine whether the bimodal or negatively skewed profile expected of the LLM population is present. This determination is consequential: if non-normality is confirmed, normality-assuming methods (standard ML-based EFA, unadjusted parallel analysis) are demoted to secondary status, and distribution-free methods (DIMTEST, Mokken, NMF, VAE) become the primary evidence sources.
Phase 2 — Traditional psychometric factor enumeration.
All seven psychometric dimensionality methods are applied to the tetrachoric correlation matrix and/or the raw response matrix, each yielding an estimated number of dimensions \(\hat{K}\). The methods span three conceptual families: eigenvalue-based enumeration (PA, PA-poly, MAP, Hull), nonparametric tests of essential unidimensionality (DIMTEST), conditional covariance partitioning (DETECT), and scalability-based scale discovery (MSA-AISP). Standard parallel analysis (PA) is included as a baseline but is expected to overestimate \(\hat{K}\) under binary response conditions — a known artifact of applying normality-based reference eigenvalues to tetrachoric matrices, which PA-poly corrects by generating polychoric-appropriate reference distributions (Garrido, Abad, & Ponsoda, 2013).
# Phase 2: All seven traditional psychometric methods in a single pass
library(EFA.dimensions); library(EFAdiff); library(sirt); library(mokken)
k_PA <- psych::fa.parallel(tetra_rho, n.obs = nrow(item_matrix),
fa = "fa", plot = FALSE)$nfact
k_PApoly <- EFA.dimensions::PA.poly(item_matrix)$nfactors
k_MAP <- which.min(psych::vss(tetra_rho, n.obs = nrow(item_matrix),
plot = FALSE)$map)
k_Hull <- EFAdiff::EFA_HULL(tetra_rho, n.obs = nrow(item_matrix))$n_factors
dimtest_p <- sirt::dimtest(item_matrix, score.type = "WLE")$p.value
k_DETECT <- sirt::detect.index(item_matrix, ndim = 11)$n_clusters
k_MSA <- length(unique(mokken::aisp(item_matrix, lower = 0.30)))
trad_results <- data.frame(
Method = c("PA", "PA-poly", "MAP", "Hull", "DIMTEST", "DETECT", "MSA-AISP"),
K_hat = c(k_PA, k_PApoly, k_MAP, k_Hull,
ifelse(dimtest_p < .05, ">1 (reject 1D)", "1 (retain 1D)"),
k_DETECT, k_MSA),
Role = c("Baseline -- expected to overestimate under binary data",
"Binary-corrected PA -- primary enumeration method",
"Parsimony-based; robust at small n",
"Elbow-based; strong bifactor recovery in simulation",
"Unidimensionality test only -- no point K_hat",
"Cluster-based; recovers item-level structure",
"Scalability-based; nonparametric and distribution-free")
)
print(trad_results)
Each method’s known strengths and failure modes under LLM data conditions — small \(n\), non-normal latent distribution, binary items — are characterized in detail in the Methods Compared section below and assessed empirically in the simulation study.
Phase 3 — Machine learning dimensionality methods.
The five ML methods are applied to the same data, operating directly on the binary item response matrix without requiring a tetrachoric pre-transformation. This is a practical advantage when \(n\) is small (here, \(n = 71\)) and tetrachoric estimation may be numerically unstable: NMF, VAE-dim, and NIRT bypass the correlation matrix entirely and factorize or model the raw response patterns directly. Spectral clustering uses the tetrachoric matrix as a weighted similarity graph, and LASSO-EFA applies an \(L_1\) penalty to factor loadings estimated from the tetrachoric matrix. ML methods are not treated as replacements for psychometric methods but as a convergent evidence source: agreement between ML and psychometric \(\hat{K}\) estimates strengthens the dimensionality conclusion, while systematic disagreement triggers additional diagnostic checks — typically DETECT cluster inspection and VAE latent space visualization.
# Phase 3: ML methods -- all five with complete implementations
library(NMF); library(fanc); library(kernlab); library(reticulate)
# --- NMF: K selected by held-out reconstruction MSE ---
set.seed(42)
hold_idx <- sample(nrow(item_matrix), size = round(.20 * nrow(item_matrix)))
train_mat <- item_matrix[-hold_idx, ]
test_mat <- item_matrix[hold_idx, ]
nmf_errors <- sapply(1:15, function(k) {
fit <- NMF::nmf(train_mat, rank = k, method = "brunet", nrun = 5, seed = 42)
mean((test_mat - fitted(fit)[hold_idx, ])^2)
})
k_NMF <- which.min(nmf_errors)
# --- LASSO-EFA: BIC-selected L1 penalty induces sparse loading matrix ---
lasso_fit <- fanc::fanc(tetra_rho, factors = 11, n.obs = nrow(item_matrix))
k_LASSO <- lasso_fit$best$factors
# --- Spectral clustering: eigengap heuristic on normalized graph Laplacian ---
k_SC <- ncol(kernlab::specc(tetra_rho, centers = 11)@.Data)
# --- VAE-dim: beta-VAE (PyTorch); K selected by ELBO on held-out split ---
py_run_string("
import torch, numpy as np
import torch.nn as nn
class BetaVAE(nn.Module):
def __init__(self, p, K):
super().__init__()
self.enc = nn.Sequential(nn.Linear(p, 128), nn.ReLU(),
nn.Linear(128, 64), nn.ReLU())
self.mu_layer = nn.Linear(64, K)
self.lv_layer = nn.Linear(64, K)
self.dec = nn.Sequential(nn.Linear(K, 64), nn.ReLU(),
nn.Linear(64, 128), nn.ReLU(),
nn.Linear(128, p), nn.Sigmoid())
def forward(self, x):
h = self.enc(x)
mu, lv = self.mu_layer(h), self.lv_layer(h)
z = mu + torch.exp(0.5 * lv) * torch.randn_like(mu)
return self.dec(z), mu, lv
def vae_elbo(X_tr, X_va, K, epochs=300, beta=4.0):
model = BetaVAE(X_tr.shape[1], K)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
for _ in range(epochs):
recon, mu, lv = model(X_tr)
bce = nn.BCELoss(reduction='sum')(recon, X_tr)
kld = -0.5 * torch.sum(1 + lv - mu.pow(2) - lv.exp())
opt.zero_grad(); (bce + beta * kld).backward(); opt.step()
with torch.no_grad():
recon_v, mu_v, lv_v = model(X_va)
elbo = -(nn.BCELoss(reduction='sum')(recon_v, X_va).item()
+ (-0.5 * torch.sum(1 + lv_v - mu_v.pow(2) - lv_v.exp())).item())
return elbo
X_all = torch.FloatTensor(np.array(r.item_matrix, dtype=float))
hold = np.array(r.hold_idx, dtype=int) - 1 # convert 1-based R index to 0-based
mask = np.isin(np.arange(len(X_all)), hold)
X_tr = X_all[~mask]; X_va = X_all[mask]
elbos = [vae_elbo(X_tr, X_va, K) for K in range(1, 16)]
k_VAE = int(np.argmax(elbos)) + 1
")
k_VAE <- as.integer(py$k_VAE)
# --- NIRT: neural IRT with K-dim latent space; K selected by 5-fold CV accuracy ---
py_run_string("
import torch, numpy as np
import torch.nn as nn
from sklearn.model_selection import KFold
X = np.array(r.item_matrix, dtype=float)
n, p = X.shape
kf = KFold(n_splits=5, shuffle=True, random_state=42)
def nirt_cv_acc(K):
accs = []
for tr_i, va_i in kf.split(range(n)):
embed = nn.Embedding(n, K)
head = nn.Sequential(nn.Linear(K, p), nn.Sigmoid())
opt = torch.optim.Adam(
list(embed.parameters()) + list(head.parameters()), lr=5e-3)
idx_tr = torch.LongTensor(tr_i)
X_tr = torch.FloatTensor(X[tr_i])
for _ in range(200):
pred = head(embed(idx_tr))
nn.BCELoss()(pred, X_tr).backward()
opt.step(); opt.zero_grad()
with torch.no_grad():
pred_va = head(embed(torch.LongTensor(va_i)))
acc = ((pred_va > 0.5).float() ==
torch.FloatTensor(X[va_i])).float().mean().item()
accs.append(acc)
return float(np.mean(accs))
cv_acc = {K: nirt_cv_acc(K) for K in range(1, 16)}
k_NIRT = max(cv_acc, key=cv_acc.get)
")
k_NIRT <- as.integer(py$k_NIRT)
ml_results <- data.frame(
Method = c("NMF", "LASSO-EFA", "SC-tetra", "VAE-dim", "NIRT"),
K_hat = c(k_NMF, k_LASSO, k_SC, k_VAE, k_NIRT),
Notes = c("Held-out reconstruction MSE",
"BIC-penalized sparse loadings",
"Eigengap on tetrachoric graph Laplacian",
"ELBO on held-out split (beta-VAE, beta = 4, 300 epochs)",
"5-fold CV classification accuracy (neural IRT, 200 epochs)")
)
print(ml_results)
Phase 4 — Simulation validation: determining which methods to trust under LLM conditions.
Before the empirical \(\hat{K}\) estimates from Phases 2 and 3 are synthesized into a final dimensionality decision, the simulation study described in the Experimental Design section below establishes which methods perform most accurately under the conditions specific to the dissertation data (n = 71, bimodal latent distribution, binary responses, J = 645). Methods achieving proportion-correct (PC) > .80 under these conditions are designated primary methods; those with .60 \(\leq\) PC \(\leq\) .80 are supporting methods; those with PC < .60 are reported for transparency but given no weight in the final decision.
The sequencing — empirical application first (Phases 2–3), simulation validation second, synthesis third (Phase 5) — is methodologically deliberate. The empirical \(\hat{K}\) estimates are produced independently of the simulation and held aside until the simulation results establish which estimates deserve the most weight. This avoids the circular reasoning that would arise if a method were selected because it produces a preferred result, then a simulation designed with that structure as the known target were used to post-hoc validate the selection.
Phase 5 — Convergence synthesis and structured decision rule.
After the simulation identifies primary and supporting methods, all \(\hat{K}\) estimates are tabulated and the following structured decision rule is applied:
| Decision Tier | Criterion | Action |
|---|---|---|
| Consensus | \(\geq\) 8 of 12 methods agree on \(\hat{K}\) | Accept as primary dimensionality estimate |
| Psychometric–ML agreement | Primary psychometric and primary ML methods agree; others diverge | Accept the agreed-upon \(\hat{K}\); report divergent estimates as sensitivity checks |
| Simulation-resolved disagreement | Methods disagree; simulation identified a single best-performing method for this structural condition | Defer to simulation-validated winner; tabulate all \(\hat{K}\) in supplementary material |
| Unresolved disagreement | Methods disagree; no single simulation winner under current conditions | Report \(\hat{K}\) range; estimate bifactor model for \(\hat{K}_{min}\), \(\hat{K}_{mode}\), and \(\hat{K}_{max}\); compare model fit |
In practice, the expectation — supported by Hypotheses H1 through H6 below — is that PA-poly, Hull, and NMF will converge on \(\hat{K}\) under the dissertation conditions, with DIMTEST providing a binary essential-unidimensionality check and DETECT providing a cluster-level structural map. If these three methods agree and DETECT’s cluster count corroborates them, the Consensus tier is invoked and the simulation-resolved and unresolved tiers become unnecessary.
Phase 6 — Confirmatory bifactor IRT estimation and construct validity statistics.
The agreed-upon \(\hat{K}\) from
Phase 5 is passed directly to the bifactor IRT model described in
§3.3.2. Specifically, mirt::bfactor() is called with
group = topic_vec — item-to-domain assignments drawn from
the 11 USMLE topic labels, cross-validated against DETECT’s empirical
cluster assignments to verify that the a priori domain groupings are
recoverable from the response data — and with the empirical histogram
option active to avoid imposing normality on the LLM theta distribution.
From the resulting loading matrix, \(\omega_h\), ECV, and PUC are computed via
psych::omega() and interpreted against the decision table
in §3.3.2. This final phase closes the loop between dimensionality
exploration and the construct validity argument: the simulation study
ensures that the structural model entering bifactor estimation was
chosen by trustworthy methods, and the bifactor statistics translate
that structural decision directly into validity evidence about the
general-reasoning versus domain-knowledge decomposition that is central
to RQ2.
The six-phase workflow is summarized schematically in Figure 3 below.
Figure 4. Six-Phase Dimensionality Exploration Workflow
The simulation uses a fully crossed factorial design. The factors and their levels are:
Factor 1 — True data-generating structure (5 levels): 1. Unidimensional (1D): A single latent factor G; all items load on G only. True dimensional structure = 1. K is fixed at 1 for this condition. 2. Two-dimensional (2D): Two correlated factors (phi = .40) with no general factor; items load exclusively on one of the two factors. K is fixed at 2 for this condition. This structure represents the minimal departure from unidimensionality — a single domain split (e.g., reasoning vs. knowledge retrieval) — and is included to evaluate each method’s sensitivity to the simplest possible multidimensional structure, which is practically relevant for LLM benchmarks that may conflate two distinct ability dimensions. 3. Bifactor (BF): One general factor G plus K orthogonal domain group factors S_1…S_K. This is the theoretically motivated structure for the dissertation; its recovery is the primary target of the simulation. 4. Correlated factors (CF): K correlated domain factors with no general factor; inter-factor correlations phi = .40. Represents the hypothesis that benchmark performance is entirely domain-knowledge-dependent, with no shared general reasoning component. 5. Higher-order (HO): A second-order general factor G causes K first-order domain factors, which in turn cause item responses. Superficially similar to bifactor but structurally distinct: G operates only indirectly through domain factors rather than directly on items, producing attenuated implied G loadings (lambda_G = a_G * a_Sk) and larger factor-specific residuals.
Factor 2 — Number of domain factors K (3 levels; nested within multi-factor structures): - K = 2 (fixed; applies only to the 2D structure) - K = 3 (coarse: simple factor structure; applies to BF, CF, HO) - K = 5 (moderate; applies to BF, CF, HO) - K = 11 (fine-grained: mirrors the 11 USMLE topics in the dissertation data; applies to BF, CF, HO)
Factor 3 — Sample size n (number of LLMs / examinees; 4 levels): - n = 30 (sparse; below most psychometric guidelines) - n = 71 (matches the actual dissertation dataset) - n = 150 (moderate) - n = 300 (large; upper reference bound)
Factor 4 — Test length J (number of items; 3 levels): - J = 50 (short) - J = 200 (medium) - J = 645 (full; mirrors the actual dataset)
Factor 5 — Factor loading magnitude (2 levels; applied uniformly to G and S_k): - Weak: G loadings = .30, S_k loadings = .30 - Strong: G loadings = .60, S_k loadings = .40 (stronger G than S_k, consistent with a G-dominant bifactor)
Factor 6 — Latent ability distribution (3 levels): - Normal: theta ~ N(0, 1) — standard psychometric assumption - Bimodal: theta ~ 0.40 · N(-1.0, 0.8) + 0.60 · N(1.5, 0.6) — models the LLM population with a cluster of smaller models at lower theta and frontier models at higher theta - Negatively skewed: theta ~ skew-normal(xi = 1.5, omega = 1.0, alpha = -4) — models a population where most LLMs perform near ceiling
All cells are crossed except that K is fixed within the 1D and 2D structures (K = 1 and K = 2, respectively) and varies only across BF, CF, and HO. With R = 500 replications per cell, the total simulation involves approximately 500 \(\times\) (1 + 1 + 3 ) \(\times\) 4 \(\times\) 3 \(\times\) 2 \(\times\) 3 = 500 \(\times\) 11 \(\times\) 36 = 198,000 data-generation-and-recovery cycles. Computationally intensive cells (large J, large n, ML methods) are parallelized across cores.
Table SD1. Full Simulation Design: Factors, Levels, and Rationale
| Factor | Levels | \(N\) levels | Rationale for inclusion |
|---|---|---|---|
| F1 — True structure | Unidimensional (1D) | 5 | 1D is the null model; 2D tests minimal departure from unidimensionality |
| Two-dimensional (2D), \(\phi = .40\) | |||
| Bifactor (BF): \(G + K\) orthogonal \(S_k\) | BF is the theoretically motivated target for RQ2 | ||
| Correlated factors (CF), \(\phi = .40\) | CF tests whether the general-factor interpretation survives alternative structures | ||
| Higher-order (HO): \(G \to S_k \to X\) | HO is superficially similar to BF but structurally distinct; methods must distinguish them | ||
| F2 — Domain factors \(K\) | 2 (fixed for 2D only) | 4 | \(K = 11\) mirrors the 11 USMLE topic domains in the dissertation data |
| 3 (coarse) | |||
| 5 (moderate) | |||
| 11 (dissertation-matched) | |||
| F3 — Sample size \(n\) | 30 (sparse) | 4 | \(n = 71\) matches the actual LLM pool; 30 and 150 test boundary behaviour |
| 71 (dissertation-matched) | |||
| 150 (moderate) | |||
| 300 (large reference) | |||
| F4 — Test length \(J\) | 50 (short) | 3 | \(J = 645\) mirrors the full USMLE item bank; 50 and 200 test scalability |
| 200 (medium) | |||
| 645 (dissertation-matched) | |||
| F5 — Loading magnitude | Weak: \(a_G = .30\), \(a_S = .30\) | 2 | Weak loadings stress-test methods near the identifiability boundary |
| Strong: \(a_G = .60\), \(a_S = .40\) | Strong loadings represent a \(G\)-dominant bifactor, the expected structure | ||
| F6 — Latent distribution | Normal: \(\theta \sim \mathcal{N}(0,1)\) | 3 | Normal is the standard assumption; bimodal reflects the frontier/open-source LLM gap |
| Bimodal: \(0.40\,\mathcal{N}(-1.0, 0.8) + 0.60\,\mathcal{N}(1.5, 0.6)\) | |||
| Negatively skewed: skew-normal \((\xi=1.5, \omega=1.0, \alpha=-4)\) | Skewed models a ceiling-effect population | ||
| Replications | \(R = 500\) per cell | — | Sufficient for stable PC and Type I error estimates at \(\pm 0.04\) margin |
| Total cycles | \(\approx 198{,}000\) | — | Parallelized across cores for ML-heavy cells |
Note. Cells are fully crossed within each structural family. K is not crossed with 1D (fixed at \(K=1\)) or 2D (fixed at \(K=2\)). Bold entries indicate the dissertation-matched levels that are of primary inferential interest. PC = proportion-correct dimension-count recovery; primary-method threshold = PC \(> .80\) at the dissertation conditions (\(n = 71\), bimodal, binary items).
Item responses are generated via the two-parameter logistic (2PL) IRT model:
\[P(X_{ij} = 1 \mid \theta_i,\, a_j,\, b_j) = \frac{1}{1 + \exp\!\bigl[-a_j(\theta_i - b_j)\bigr]}\]
where \(\theta_i\) is the latent ability of LLM \(i\) drawn from the distribution specified in Factor 6, \(a_j > 0\) is the item discrimination parameter (slope), and \(b_j\) is the item difficulty parameter (location on the \(\theta\) scale). For multidimensional structures (bifactor, correlated-factors, higher-order), the compensatory M-2PL model replaces the scalar ability with a \(K\)-dimensional vector:
\[P(X_{ij} = 1 \mid \boldsymbol{\theta}_i,\, \mathbf{a}_j,\, d_j) = \frac{1}{1 + \exp\!\bigl[-(\mathbf{a}_j^\top \boldsymbol{\theta}_i + d_j)\bigr]}\]
where \(\boldsymbol{\theta}_i \in \mathbb{R}^K\) is the latent vector for LLM \(i\), \(\mathbf{a}_j \in \mathbb{R}^K\) is the item loading vector, and \(d_j\) is the intercept (\(b_j = -d_j/\|\mathbf{a}_j\|\)). For the bifactor structure specifically, \(\boldsymbol{\theta}_i = (G_i,\, S_{k(j),i})^\top\) collapses to a two-component vector — the general factor \(G\) and the single group factor \(S_k\) for item \(j\)’s domain:
\[P(X_{ij} = 1 \mid G_i,\, S_{k(j),i}) = \frac{1}{1 + \exp\!\bigl[-(a_{Gj}\,G_i + a_{Sj}\,S_{k(j),i} + d_j)\bigr]}\]
Item difficulty parameters are drawn as \(b_j \sim \mathcal{N}(-0.3,\; 1.0)\) — a
mild negative shift approximating the USMLE difficulty distribution —
and discrimination parameters reflect the loading-magnitude factor
(Factor 5): weak loadings use \(a_j \sim
\mathcal{U}(0.25,\, 0.45)\) and strong loadings use \(a_j \sim \mathcal{U}(0.55,\, 0.80)\) for
the general factor, with group-factor loadings drawn from the same
ranges. The mirt::simdata() function handles data
generation for all structural models.
Traditional psychometric methods (7 methods):
Horn’s Parallel Analysis (PA) — eigenvalues from the
data correlation matrix are compared against eigenvalues from random
data; factors are retained where data eigenvalues exceed the 95th
percentile of the random distribution. Input: tetrachoric correlation
matrix. Software: psych::fa.parallel().
Modified Parallel Analysis for Polychoric Correlations
(PA-poly) — Horn’s PA adapted for binary/ordinal data by generating
polychoric-appropriate reference distributions (Garrido, Abad, &
Ponsoda, 2013). This correction is critical for binary item responses
where standard PA overestimates the number of factors. Software:
EFA.dimensions::PA.poly().
Minimum Average Partial (MAP; Velicer, 1976) — the
average squared partial correlation after partialing successive
components is minimized; the number of components at the minimum indexes
the dimensionality. Less sensitive to sample size than PA under some
conditions. Software: psych::vss().
Hull Method (Lorenzo-Seva, Timmerman, & Kiers, 2011)
— selects the number of factors at the “elbow” of the goodness-of-fit
vs. degrees-of-freedom curve, balancing model fit against parsimony.
Shown to outperform PA for bifactor structures in simulation. Software:
EFAdiff::EFA_HULL().
DIMTEST (Stout, 1987) — nonparametric, distribution-free
test of essential unidimensionality. Partitions items into an assessment
subtest (AT) and a partitioning subtest (PT); the T-statistic tests
whether the AT items are conditionally independent given PT score. Does
not assume normal latent distribution; directly applicable to bimodal
LLM data. Software: sirt::dimtest().
DETECT (Zhang & Stout, 1999) — distribution-free
partitioning of items into clusters that maximize conditional covariance
heterogeneity. Simultaneously provides a dimensionality test (the DETECT
index) and recovers the within-dimension item structure. Unlike the
above methods, DETECT recovers a cluster assignment per item, enabling
evaluation of structural recovery (not just dimension count). Software:
sirt::detect.index().
Mokken Scale Analysis — Automated Item Selection Procedure
(MSA-AISP; Mokken, 1971; van der Ark, 2012) — items are partitioned
into Mokken scales using the H scalability coefficient; the number of
scales with H >= 0.30 approximates the number of meaningful
dimensions. Nonparametric; robust to non-normal distributions. Software:
mokken::aisp().
Machine learning methods (5 methods):
Non-negative Matrix Factorization (NMF) — factorizes the
binary item response matrix R \(\approx\) W · H (n \(\times\) K and K \(\times\) J non-negative matrices) for K =
1…15; the number of factors is selected by minimizing reconstruction
error on a 20% hold-out split of LLMs. NMF imposes non-negativity
(appropriate for binary correct/incorrect data) and has no
distributional assumptions. Software: NMF::nmf() (R) or
sklearn.decomposition.NMF (Python).
Variational Autoencoder with Latent Dimension Selection
(VAE-dim) — a VAE with a K-dimensional latent code is trained for K
= 1…15; the optimal K is selected by the ELBO (Evidence Lower BOund) on
a held-out validation split. The VAE learns a continuous latent
representation of item-response patterns without assuming linear factor
structure. Implemented in PyTorch via a 2-layer encoder/decoder with
sigmoid output for binary items; the beta-VAE penalty (beta = 4)
encourages disentangled latent dimensions. Software: custom PyTorch
implementation via reticulate in R.
Spectral Clustering on Tetrachoric Correlation Matrix
(SC-tetra) — the normalized graph Laplacian of the absolute
tetrachoric correlation matrix is eigendecomposed; the number of
near-zero eigenvalues (eigengap heuristic) identifies the number of
clusters/dimensions. Unlike factor analysis, spectral clustering detects
non-linear cluster structure in the item similarity graph. Software:
kernlab::specc() applied to the tetrachoric matrix from
psych::tetrachoric().
LASSO-Penalized Exploratory Factor Analysis (LASSO-EFA;
Hirose & Yamamoto, 2015) — an L1 penalty on factor loadings
simultaneously performs dimensionality selection and loading estimation;
the penalty parameter lambda is selected by BIC. The LASSO induces exact
zeros in loading estimates, producing sparse factor structures and
effectively selecting the number of non-trivial factors. Software:
fanc::fanc() (R).
Deep IRT — Neural Network IRT (NIRT; Wilson, Fudolig, &
Fabian, 2019) — replaces the logistic IRT response function with a
neural network, estimating both item parameters and latent person (LLM)
abilities. Dimensionality of the latent space is selected by
cross-validated predictive accuracy on held-out LLM–item pairs. NIRT is
capable of learning non-linear interactions between ability dimensions
and item features. Software: custom implementation in PyTorch via
reticulate.
Distribution-adaptive parametric IRT (1 method):
mirt::mirt() with dentype = "Ramsay" or
dentype = "Davidian" (Chalmers, 2012).# --- RC-IRT: Ramsay / Davidian Curve IRT with BIC-based K selection ---
nfactors_RC <- function(dat, K_max = 15) {
bic_ramsay <- numeric(K_max)
bic_davidian <- numeric(K_max)
for (k in seq_len(K_max)) {
# Ramsay spline curve for latent distribution
fit_R <- tryCatch(
mirt(data = dat,
model = k,
itemtype = "2PL",
dentype = "Ramsay",
SE = FALSE,
verbose = FALSE),
error = function(e) NULL
)
bic_ramsay[k] <- if (!is.null(fit_R)) extract.mirt(fit_R, "BIC") else Inf
# Davidian smooth exponential curve for latent distribution
fit_D <- tryCatch(
mirt(data = dat,
model = k,
itemtype = "2PL",
dentype = "Davidian",
SE = FALSE,
verbose = FALSE),
error = function(e) NULL
)
bic_davidian[k] <- if (!is.null(fit_D)) extract.mirt(fit_D, "BIC") else Inf
}
# Select K by lowest BIC across both curve types
k_ramsay <- which.min(bic_ramsay)
k_davidian <- which.min(bic_davidian)
# Return the K (and curve type) with the overall lowest BIC
if (min(bic_ramsay) <= min(bic_davidian)) {
list(K = k_ramsay, curve = "Ramsay", BIC = bic_ramsay[k_ramsay],
bic_vec = bic_ramsay)
} else {
list(K = k_davidian, curve = "Davidian", BIC = bic_davidian[k_davidian],
bic_vec = bic_davidian)
}
}
# Ability compression diagnostic (run on real data post-simulation)
compression_check <- function(dat, k) {
fit_norm <- mirt(dat, k, itemtype = "2PL", dentype = "Gaussian",
SE = FALSE, verbose = FALSE)
fit_rc <- mirt(dat, k, itemtype = "2PL", dentype = "Ramsay",
SE = FALSE, verbose = FALSE)
theta_norm <- as.vector(fscores(fit_norm))
theta_rc <- as.vector(fscores(fit_rc))
list(
sd_ratio = sd(theta_rc) / sd(theta_norm), # > 1 implies compression under normal
bimodality_coef = (e1071::skewness(theta_norm)^2 + 1) /
(e1071::kurtosis(theta_norm) + 3),
cor_theta = cor(theta_norm, theta_rc)
)
}
Primary outcome — Dimension count accuracy: For each method and condition, the proportion of replications in which the method correctly identified the true number of dimensions (proportion correct, PC). PC is the primary accuracy metric.
Secondary outcomes:
Structural recovery (for methods that produce a loading matrix): Root Mean Squared Error (RMSE) between the estimated and true factor loadings, computed separately for G-factor loadings (RMSE_G) and S_k-factor loadings (RMSE_Sk). This distinguishes methods that correctly count dimensions from those that also accurately recover the factor structure.
G-factor isolation accuracy (bifactor conditions only): The correlation between the estimated ECV and the true ECV (set by the loading parameters). A method that identifies the bifactor structure but systematically misestimates the G-factor’s dominance has low G-isolation accuracy, which directly affects the validity of the ability–knowledge decomposition.
Type I error rate (unidimensional true structure): The proportion of replications in which a method incorrectly identifies more than one dimension when the true structure is unidimensional. High Type I error rates are particularly costly because they lead to spurious multidimensional interpretations of what is actually a well-functioning unidimensional benchmark.
Computational time: Wall-clock time per replication, relevant for practical feasibility in the actual LLM analysis.
Based on existing simulation evidence in the psychometric literature and the distinctive properties of LLM data, the following directional hypotheses guide the study:
H1 — Normal distribution, large sample: Under normal latent distribution and n >= 150, traditional methods (PA-poly, MAP, Hull) and ML methods (NMF, VAE-dim) will show comparable accuracy, with Hull and PA-poly performing best among psychometric methods and NMF performing best among ML methods (consistent with Lorenzo-Seva et al., 2011).
H2 — Non-normal distribution advantage for robust methods: Under bimodal and skewed latent distributions (the conditions most relevant to LLM data), distribution-free psychometric methods (DIMTEST, DETECT, MSA-AISP) and neural methods (VAE-dim, NIRT) will significantly outperform normality-based methods (standard PA, ML-EFA, LASSO-EFA) in PC. This advantage is expected to hold across both negatively and positively skewed conditions, though the magnitude may differ by direction: negative skew (the empirically dominant LLM pattern) is hypothesized to produce larger PC degradation for normality-assuming methods than positive skew, because the high-\(\theta\) frontier cluster that drives negative skew aligns poorly with the Gaussian quadrature grid used in standard MML estimation.
H3 — Small-sample penalty for ML methods: At n = 30 and n = 71, ML methods (VAE-dim, NIRT) will show higher variance in PC than psychometric methods, and simple nonparametric tests (DIMTEST, MSA-AISP) will achieve higher PC than complex ML methods due to their lower parameter count and distributional robustness.
H4 — Bifactor recovery challenge: All methods will show lower PC under bifactor structure than under correlated-factors structure, because the bifactor’s hierarchical organization (G + orthogonal S_k) is harder to distinguish from unidimensionality when G is dominant (high ECV). Hull and VAE-dim are expected to perform best for bifactor recovery based on known properties.
H5 — Structural recovery vs. count accuracy: Methods that correctly identify the number of dimensions will not necessarily achieve accurate structural recovery (RMSE_G, RMSE_Sk). DETECT and MSA-AISP — which produce explicit item-cluster assignments — are expected to achieve higher structural recovery than methods that only count factors (MAP, PA), even when count accuracy is similar.
H6 — G-isolation accuracy: Among methods that produce a
loading matrix, bifactor IRT with empirical histogram
(mirt::bfactor() + empiricalhist = TRUE) will
achieve the lowest RMSE_G under non-normal conditions, outperforming
standard ML-EFA and LASSO-EFA whose normality assumptions bias G-factor
loading estimates when the latent distribution is bimodal.
H7 — Distribution-adaptive IRT advantage under non-normal conditions: Under bimodal and skewed latent distributions, RC-IRT (Ramsay or Davidian curve) will achieve significantly higher dimension-count accuracy (PC) and lower ability compression (higher \(\theta\) SD ratio relative to normal-assumption IRT) than standard 2PL IRT and all normality-assuming psychometric methods. The advantage is expected to be largest under bimodal conditions, moderate under negatively skewed conditions, and smallest — but still present — under positively skewed conditions; under unimodal normal conditions, RC-IRT and standard IRT are expected to converge to equivalent fits, confirming that the adaptive estimator imposes no penalty when the normality assumption holds.
library(mirt) # data generation, bifactor IRT
library(psych) # PA, MAP, VSS, omega
library(EFA.dimensions) # PA-poly (polychoric-corrected parallel analysis)
library(EFAdiff) # Hull method
library(sirt) # DIMTEST, DETECT
library(mokken) # MSA-AISP
library(fanc) # LASSO-penalized EFA
library(kernlab) # spectral clustering
library(NMF) # non-negative matrix factorization
library(e1071) # skewness/kurtosis for RC-IRT compression diagnostics
library(reticulate) # Python bridge for VAE/NIRT (PyTorch)
library(future.apply) # parallel replication execution
library(tidyverse)
# ---- Simulation cell: one replication ---------------------------------
simulate_one <- function(n, J, K, structure, loading_level, latent_dist, seed) {
set.seed(seed)
# Step 1: Generate theta from specified latent distribution
theta <- switch(latent_dist,
"normal" = matrix(rnorm(n), ncol = 1),
"bimodal" = matrix(c(rnorm(round(0.4*n), -1.0, 0.8),
rnorm(round(0.6*n), 1.5, 0.6)), ncol = 1),
"neg_skewed" = matrix(sn::rsn(n, xi = 1.5, omega = 1.0, alpha = -4), ncol = 1),
"right_skewed" = matrix(sn::rsn(n, xi = -1.5, omega = 1.0, alpha = 4), ncol = 1),
stop("Unknown latent_dist: ", latent_dist)
)
# Step 2: Build loading matrix and multidimensional theta based on true structure
a_G <- if (loading_level == "weak") 0.30 else 0.60
a_Sk <- if (loading_level == "weak") 0.30 else 0.40
items_per_k <- floor(J / K)
b_j <- rnorm(J, mean = -0.3, sd = 1) # item difficulty: mild negative skew
item_difficulty <- matrix(b_j, ncol = 1)
if (structure == "1D") {
# Single latent factor; all items load on G
loading_matrix <- matrix(a_G, nrow = J, ncol = 1)
theta_multidim <- theta # univariate theta from Step 1
} else if (structure == "2D") {
# Two correlated factors (phi = .40); K fixed at 2 for this structure
loading_matrix <- matrix(0, nrow = J, ncol = 2)
for (k in 1:2) {
idx <- ((k - 1) * floor(J / 2) + 1):(k * floor(J / 2))
loading_matrix[idx, k] <- a_Sk
}
Phi <- matrix(c(1, .40, .40, 1), 2, 2)
theta_multidim <- MASS::mvrnorm(n, mu = c(0, 0), Sigma = Phi)
} else if (structure == "BF") {
# Bifactor: one general factor G + K orthogonal group factors
loading_matrix <- matrix(0, nrow = J, ncol = K + 1)
loading_matrix[, 1] <- a_G # all items load on G
for (k in 1:K) {
idx <- ((k - 1) * items_per_k + 1):(k * items_per_k)
loading_matrix[idx, k + 1] <- a_Sk
}
theta_multidim <- matrix(rnorm(n * (K + 1)), nrow = n, ncol = K + 1)
} else if (structure == "CF") {
# Correlated factors (phi = .40); no general factor
loading_matrix <- matrix(0, nrow = J, ncol = K)
for (k in 1:K) {
idx <- ((k - 1) * items_per_k + 1):(k * items_per_k)
loading_matrix[idx, k] <- a_Sk
}
Phi <- matrix(.40, K, K); diag(Phi) <- 1
theta_multidim <- MASS::mvrnorm(n, mu = rep(0, K), Sigma = Phi)
} else if (structure == "HO") {
# Higher-order: G drives K first-order factors; items load indirectly on G
lambda_G <- a_G * a_Sk # path-tracing implied G loading
lambda_resid <- sqrt(pmax(a_Sk^2 - lambda_G^2, 1e-4)) # factor-specific residual
loading_matrix <- matrix(0, nrow = J, ncol = K + 1)
loading_matrix[, 1] <- lambda_G
for (k in 1:K) {
idx <- ((k - 1) * items_per_k + 1):(k * items_per_k)
loading_matrix[idx, k + 1] <- lambda_resid
}
theta_multidim <- matrix(rnorm(n * (K + 1)), nrow = n, ncol = K + 1)
}
# Step 3: Simulate binary item responses via mirt::simdata()
dat <- simdata(a = loading_matrix, d = item_difficulty,
N = n, itemtype = "2PL",
Theta = theta_multidim)
# Step 4: Compute tetrachoric correlation matrix for factor-based methods
tetra <- psych::tetrachoric(dat)$rho
# Step 5: Apply all 12 methods and record ndim estimates
results <- list(
PA = nfactors_PA(tetra, n),
PA_poly = nfactors_PA_poly(dat),
MAP = nfactors_MAP(tetra, n),
Hull = nfactors_Hull(tetra, n, J),
DIMTEST = nfactors_DIMTEST(dat),
DETECT = nfactors_DETECT(dat, K),
MSA = nfactors_MSA(dat),
NMF = nfactors_NMF(dat, K_max = 15),
VAE = nfactors_VAE(dat, K_max = 15), # calls Python via reticulate
SC_tetra = nfactors_spectral(tetra),
LASSO_EFA= nfactors_LASSO(tetra, n, J),
NIRT = nfactors_NIRT(dat, K_max = 15), # calls Python via reticulate
RC_IRT = nfactors_RC(dat, K_max = 15) # Ramsay/Davidian curve IRT via mirt
)
# Step 6: Compute PC, RMSE_G, RMSE_Sk, ECV_error
evaluate_recovery(results, true_K = K, true_loadings = loading_matrix, dat = dat)
}
# ---- Full factorial simulation design grid --------------------------------
# All conditions crossed; anchored to LLM measurement context
sim_conditions <- expand.grid(
n = c(30, 71, 150, 300),
J = c(50, 200, 645),
K = c(1, 3, 11),
structure = c("1D", "2D", "BF", "CF", "HO"),
loading_level = c("weak", "strong"),
latent_dist = c("normal", "bimodal", "neg_skewed", "right_skewed"),
stringsAsFactors = FALSE
)
n_conditions <- nrow(sim_conditions)
n_reps <- 500 # replications per condition cell
cat(sprintf("Total simulation cells: %d | Total replications: %d\n",
n_conditions, n_conditions * n_reps))
# ---- Run in parallel across conditions x replications -------------------
plan(multisession, workers = availableCores() - 1)
run_condition <- function(cond_row) {
future_lapply(
seq_len(n_reps),
function(seed) {
simulate_one(
n = cond_row$n,
J = cond_row$J,
K = cond_row$K,
structure = cond_row$structure,
loading_level = cond_row$loading_level,
latent_dist = cond_row$latent_dist,
seed = seed
)
},
future.seed = TRUE
)
}
# Execute across all condition rows; store as a named list
sim_results_all <- lapply(
seq_len(n_conditions),
function(i) {
cond <- sim_conditions[i, , drop = FALSE]
cat(sprintf("Running condition %d/%d: n=%d J=%d K=%d struct=%s load=%s dist=%s\n",
i, n_conditions,
cond$n, cond$J, cond$K,
cond$structure, cond$loading_level, cond$latent_dist))
run_condition(cond)
}
)
# Flatten results into a single summary data frame
sim_summary <- do.call(rbind, lapply(seq_len(n_conditions), function(i) {
cond <- sim_conditions[i, ]
reps <- sim_results_all[[i]]
pc_mat <- do.call(rbind, lapply(reps, function(r) r$PC))
rmse_g <- do.call(rbind, lapply(reps, function(r) r$RMSE_G))
rmse_sk <- do.call(rbind, lapply(reps, function(r) r$RMSE_Sk))
data.frame(
n = cond$n,
J = cond$J,
K = cond$K,
structure = cond$structure,
loading_level = cond$loading_level,
latent_dist = cond$latent_dist,
method = colnames(pc_mat),
PC_mean = colMeans(pc_mat, na.rm = TRUE),
PC_sd = apply(pc_mat, 2, sd, na.rm = TRUE),
RMSE_G_mean = colMeans(rmse_g, na.rm = TRUE),
RMSE_Sk_mean = colMeans(rmse_sk, na.rm = TRUE)
)
}))
# Save for downstream analysis
saveRDS(sim_summary, "sim_summary_full.rds")
cat("Simulation complete. Results saved to sim_summary_full.rds\n")
The simulation results directly inform method selection for the actual USMLE LLM response matrix. Specifically: (a) the method(s) achieving highest PC and lowest RMSE_G under the bimodal, n = 71, J = 645, K = 11 condition — the cell that most closely matches the real data — will be designated as the primary dimensionality methods for the empirical analysis; (b) remaining methods that perform adequately in this cell are applied as convergent-evidence methods, and (c) methods that fail systematically in the bimodal small-sample cell are excluded from the primary analysis but reported as robustness checks. This simulation-to-analysis pipeline ensures that the dimensionality conclusions reached in §3.3.1 and §3.3.2 are grounded not only in the specific results obtained from the LLM data but in principled prior evidence about each method’s operating characteristics under the relevant data conditions — a level of methodological rigor rarely applied in either the LLM benchmarking or AI psychometrics literature.
Key References (Simulation): Horn (1965); Velicer (1976); Cattell (1966); Mokken (1971); Stout (1987); Kim (1994); Zhang & Stout (1999); Garrido, Abad, & Ponsoda (2013); Lorenzo-Seva, Timmerman, & Kiers (2011); Hirose & Yamamoto (2015); Wilson, Fudolig, & Fabian (2019); van der Ark (2012); Robitzsch (2023).
Building directly on the simulation study in §3.3.1 — which identifies which dimensionality-assessment methods remain reliable under LLM-specific conditions (N = 71, binary items, bimodal latent distribution) — this section demonstrates how those validated methods are applied to the actual 71-LLM × 645-item USMLE response matrix. The goal here is not to confirm or test any specific structural hypothesis (such as bifactor structure), but to illustrate the analysis workflow and report what the convergent diagnostics indicate about the likely dimensionality of this dataset. Structural interpretation — including any decision about the final measurement model — is deferred to subsequent empirical work once the simulation-based method recommendations are fully established.
The workflow follows three steps. First, the methods validated in §3.3.1 as appropriate for N ≈ 71 and binary data are selected and applied to the real response matrix. Second, their dimensionality estimates are aggregated into a convergence summary: if multiple independent methods point toward the same number of factors (or the same broad structural signal), that constitutes convergent evidence; disagreement across methods flags the need for caution in interpretation. Third, any structural signals revealed by the diagnostics are noted briefly as starting hypotheses for future modeling.
Methods carried forward from simulation validation (§3.3.1): Parallel Analysis (polychoric), MAP criterion, Hull method, DIMTEST, DETECT, and Mokken-MSA are the methods the simulation study identified as performing adequately at N = 71 under binary/bimodal conditions. Purely ML-based methods (VAE-dim, SC-tetra) that required N ≥ 150 for stable performance in the simulation are excluded from the primary illustration but included as exploratory supplements.
library(mirt)
library(psych)
library(sirt)
library(mokken)
library(EFA.dimensions)
library(e1071)
# item_matrix: 71 LLMs × 645 items binary response matrix
# topic_vec: integer vector 1..11 giving each item's USMLE topic assignment
# ── Pre-check: distributional profile ──────────────────────────────────────────
theta_1d <- fscores(mirt(item_matrix, 1, itemtype = "2PL", verbose = FALSE),
method = "EAP")[,1]
cat("LLM theta: skewness =", round(skewness(theta_1d), 2),
" kurtosis =", round(kurtosis(theta_1d), 2), "
")
# Flag bimodality (|skewness| > 1 or kurtosis < -0.5) for method selection
# ── Step 1: Apply simulation-validated methods ──────────────────────────────────
# (a) Parallel Analysis (polychoric correlation base)
pa_poly <- fa.parallel(item_matrix, fm = "ml", fa = "fa",
cor = "poly", n.iter = 100, plot = FALSE)
cat("PA-poly suggests:", pa_poly$nfact, "factor(s)
")
# (b) MAP criterion (Velicer)
map_res <- MAP(item_matrix)
cat("MAP suggests:", map_res$nfact, "factor(s)
")
# (c) Hull method
hull_res <- EFA.dimensions::HULL(item_matrix, cor_method = "poly")
cat("Hull suggests:", hull_res$nfact, "factor(s)
")
# (d) DIMTEST (essential unidimensionality test)
dimtest_res <- sirt::dimtest(dat = item_matrix, score.type = "WLE")
cat("DIMTEST T =", round(dimtest_res$T.stat, 3),
" p =", round(dimtest_res$p.value, 4), "
")
# p < .05 → reject essential unidimensionality
# (e) DETECT (cluster structure)
detect_res <- sirt::detect.index(dat = item_matrix, ndim = 11)
cat("DETECT index =", round(detect_res$DETECT, 3), "
")
# > 1.0 = strong multidimensionality; compare cluster assignments to topic_vec
# (f) Mokken-MSA (scalability)
mokken_res <- mokken::aisp(item_matrix, lower = 0.30)
cat("Mokken scales identified:", length(unique(mokken_res)), "
")
# ── Step 2: Convergence summary table ─────────────────────────────────────────
results_summary <- data.frame(
Method = c("PA-poly", "MAP", "Hull", "DIMTEST", "DETECT", "Mokken-MSA"),
Signal = c(
paste(pa_poly$nfact, "factor(s)"),
paste(map_res$nfact, "factor(s)"),
paste(hull_res$nfact, "factor(s)"),
ifelse(dimtest_res$p.value < .05, "Multidim (p < .05)", "Unidim (p >= .05)"),
ifelse(detect_res$DETECT > 1.0, "Strong multidim", "Weak/moderate"),
paste(length(unique(mokken_res)), "scale(s)")
)
)
print(results_summary)
The six methods provide a convergent picture of the data’s dimensional structure without presupposing any particular model. If a majority of methods indicate a dominant single dimension with residual within-topic clustering (e.g., DIMTEST rejects unidimensionality marginally, DETECT shows moderate multidimensionality, Mokken scales roughly follow topic groupings), this is consistent with a general-factor-dominated structure — but the specific form that structure takes (bifactor, higher-order, or simple unidimensional with correlated error) remains an open question to be resolved by model-comparative fitting in future work.
The key output of this empirical illustration is a convergence table summarizing what each validated method recommends, providing a transparent empirical basis for the choice of structural model in subsequent analyses. This is precisely the evidence-first workflow advocated by Li et al. (2025) and operationalized by the simulation study in §3.3.1: simulation tells us which methods to trust; those trusted methods then speak to the data; structural modeling follows evidence rather than assumption.
In educational measurement: Convergent validity (correlation with measures of similar constructs), discriminant validity (low correlation with measures of different constructs), and criterion-related validity (prediction of external outcomes).
Adaptation for LLMs: - Convergent: Do benchmarks claiming to measure the same construct (e.g., “mathematical reasoning”) produce highly correlated scores? - Discriminant: Do benchmarks claiming to measure different constructs (e.g., “math” vs. “creative writing”) produce lower correlations? - Criterion-related: Do benchmark scores predict real-world deployment performance? (This is the extrapolation inference in Kane’s framework and is arguably the most important yet least available type of validity evidence.) - Known-groups: Do models known to differ in a specific capability (e.g., a model fine-tuned for code vs. a general-purpose model) show expected performance differences? - Multitrait-multimethod analysis: A formal multitrait-multimethod (MTMM) analysis (Campbell & Fiske, 1959) would require: (a) evidence of convergent validity — correlations between benchmarks purporting to measure the same construct across different formats should be substantially higher than correlations between benchmarks measuring different constructs in the same format; and (b) evidence of discriminant validity — correlations between benchmarks measuring different constructs should be lower than convergent validity correlations, regardless of shared method variance. Campbell and Fiske’s (1959) criteria provide an operational standard against which LLM benchmark correlational patterns can be evaluated.
Unique LLM considerations: - Convergent and discriminant evidence is actually feasible to collect because many benchmarks exist and models can be tested on all of them. The Multi-Trait Multi-Method (MTMM) framework could be applied with traits = constructs and methods = benchmarks/formats. - Criterion-related evidence is scarce because “real-world performance” is hard to define and measure for general-purpose LLMs.
The authenticity criterion—requiring that assessment tasks reflect the knowledge structures and performance conditions of real-world competence domains (Brown, Collins, & Duguid, 1989; Swiecki et al., 2022)—is particularly salient for LLM benchmarking. Clinical benchmarks such as USMLE-style multiple-choice items test isolated propositional recall rather than the integrated reasoning and communication skills required in actual clinical encounters. Aligning benchmark task models with the situated, collaborative, and iterative nature of professional practice represents a major open challenge that the present dissertation’s validity framework (Kane, 2006; Messick, 1995) is positioned to address.
Critical gap: Despite the feasibility, systematic convergent and discriminant validity studies across benchmarks are rare. When multiple benchmark results are reported, they are typically treated as independent evidence of different abilities rather than analyzed for their pattern of correlations.
In educational measurement: Evaluation of whether the social consequences of test use are consistent with the intended purpose and do not produce unintended negative outcomes.
Adaptation for LLMs: - Do benchmark-driven rankings lead to actual improvements in model quality, or do they incentivize benchmark-specific optimization (teaching to the test and overfitting)? - Do benchmarks systematically disadvantage certain architectural approaches in construct-irrelevant ways? - Are deployment decisions based on benchmark scores leading to appropriate or inappropriate real-world outcomes? - Does the existence of benchmarks narrow the development focus to measurable capabilities at the expense of unmeasured but important capabilities?
This connects directly to Messick’s argument that consequential evidence is part of validity. If a benchmark incentivizes benchmark-specific optimization rather than genuine capability improvement, its consequential validity is poor regardless of other validity evidence.
The same label (“reasoning”) is used across benchmarks that test fundamentally different things. “Reasoning” in GSM8K means multi-step arithmetic word problems. “Reasoning” in ARC means science knowledge application. “Reasoning” in LogiQA means formal logical deduction. These may or may not represent the same underlying construct, but the shared label obscures this question.
Proposed remedy: Explicit construct definitions with behavioral indicators, domain boundaries, and exclusion criteria. A benchmark claiming to test “mathematical reasoning” should specify: what counts as mathematical (which branches, which operations), what counts as reasoning (vs. retrieval, vs. calculation), and what levels of complexity are included.
Current benchmarks define constructs entirely by the content of items, not by the processes expected to produce correct responses. This means that a model arriving at a correct answer through memorization and a model arriving through genuine multi-step reasoning receive the same score, even though they represent fundamentally different levels of the target construct.
Proposed remedy: Validity evidence must include response process analysis — at minimum, systematic manipulation of item features to test whether construct-relevant (not construct-irrelevant) features drive performance. Chain-of-thought analysis, adversarial perturbation, and format manipulation (RQ3) all provide relevant evidence.
“Coding ability” means different things to a recruiter (can this model replace a junior developer?), a developer (where should I focus fine-tuning?), and a researcher (what computational processes enable code generation?). Current benchmarks do not specify whose construct definition they are operationalizing.
Proposed remedy: The validity argument template should require explicit specification of the stakeholder and the intended interpretation. The same benchmark may have stronger validity for one stakeholder’s interpretation than another’s.
Building on Kane’s framework, adapted for LLM contexts:
Audience bridge — RQ3: For AI scientists: This RQ quantifies how much of a model’s score changes just because you phrased the question differently, used multiple-choice vs. open-ended format, or ran the model at different temperature settings — none of which should affect the underlying capability being measured. In ML terms, it’s a systematic variance decomposition of evaluation noise. For psychometricians: The G-Theory framework here is standard, but the facets (prompt wording, item format, temperature, occasion) are LLM-specific and have no direct human-testing analogs. — Method Effects, Model \(\times\) Format Interaction Variance, and Differential Method Functioning
Pillar II — The Labs (cont.): Item format is the laboratory confound. This section ensures that the measuring instrument captures blood sugar rather than caffeine intake — that format-induced variance is identified, quantified, and distinguished from genuine ability differences before scores are reported or compared.
The format in which a question is presented and a response is collected can substantially affect measured performance. In human testing, format effects are well-documented: multiple-choice items measure different processes than open-ended items, even when targeting the same content. For LLMs, format effects are likely even larger and more consequential because the computational processes involved in selecting among options (MCQ) are fundamentally different from generating a response from scratch (open-ended).
Current benchmarking practice is inconsistent about format. The same construct may be measured via multiple-choice (MMLU), fill-in-the-blank (GSM8K with numerical answers), open-ended generation (HumanEval with code output), or dialogue-based evaluation (MT-Bench). If format substantially affects scores and rankings, then format choice is not a neutral design decision — it is a source of construct-irrelevant variance that threatens validity.
Messick (1989) defined construct-irrelevant variance as systematic score variance attributable to factors outside the target construct. Format effects are a classic source of CIV: if a model scores higher on multiple-choice math than on open-ended math, the format difference introduces variance that does not reflect mathematical ability per se, but rather format-specific processing advantages or disadvantages.
The critical question is whether format-related variance is construct-irrelevant or construct-relevant. One could argue that the ability to generate a mathematical solution (open-ended) is a different and more demanding aspect of mathematical reasoning than the ability to recognize a correct solution (multiple-choice). If so, format differences reflect genuine construct differences, and a comprehensive assessment should sample multiple formats. Alternatively, if the construct is defined as “mathematical reasoning” regardless of response mode, then format variance is irrelevant and should be minimized.
This connects directly to RQ2 (construct definition) and RQ1 (intended use). The answer depends on what construct is being measured and for what purpose.
Campbell and Fiske’s (1959) Multi-Trait Multi-Method (MTMM) matrix provides an operational structure for separating trait variance — the reliable individual differences attributable to the target construct — from method variance — the systematic score differences attributable to the measurement approach rather than the construct. Within the MTMM matrix, convergent validity requires that correlations between different methods measuring the same trait (heteromethod, monotrait correlations) be statistically significant and practically meaningful. Discriminant validity requires that these convergent correlations exceed both: (a) correlations between different traits measured by the same method (monomethod, heterotrait correlations), and (b) correlations between different traits measured by different methods (heteromethod, heterotrait correlations). If format-related method variance is large — as hypothesized for LLM assessments — discriminant validity will be difficult to establish, and the trait-method separation will be poor.
An MTMM analysis of LLM benchmarks would examine: - Convergent validity: Do different formats measuring the same construct produce similar model rankings? - Discriminant validity: Do same-format assessments of different constructs produce different model rankings? - Method effects: How much variance is attributable to format versus construct?
If method effects are large relative to trait effects, format choice is driving results more than actual capability differences, which is a serious validity threat.
Unlike human test-takers, who engage similar cognitive processes regardless of format (with some variation), LLMs may engage fundamentally different computational pathways depending on format:
Multiple-choice: The model evaluates a fixed set of options, which constrains the output space and may enable elimination strategies, probability-based selection, or comparison processes that are not relevant to the target construct. MCQ format also provides partial information about the correct answer (it must be one of the options), which can inflate performance.
Fill-in-the-blank: The model generates a constrained response (e.g., a number, a word) without seeing options. This is more demanding than MCQ but still constrains the output space.
Open-ended generation: The model generates a full response — a proof, an essay, a code block — from scratch. This is the most demanding format and requires planning, organization, and monitoring that MCQ does not. However, it also introduces scoring complexity (how do you score a multi-step mathematical argument?) that can itself introduce measurement error.
Dialogue/Interactive: The model responds within a conversational context, potentially with multiple turns. This introduces additional sources of variance (conversation history effects, prompt formatting) and is closest to actual deployment conditions but hardest to standardize.
True/False: The model classifies a declarative statement as correct or incorrect. T/F format has two psychometrically distinctive properties that set it apart from all other recognition formats. First, the random-guessing baseline is exactly 0.50 — twice as high as a four-option MCQ (0.25) — which means that even a model with zero construct-relevant ability is expected to answer half of T/F items correctly through chance alone. For LLMs, this ceiling is particularly problematic: systematic response biases (e.g., a tendency to affirm or deny), calibration artifacts from RLHF alignment, or partial training-data familiarity can all push performance well above 0.50 without reflecting genuine understanding. From an IRT perspective, T/F items are modeled with a lower asymptote (c) fixed at 0.50 in the 3PL model — or, for LLMs, estimated freely, since empirical guessing rates may diverge from the theoretical 0.50 depending on model-specific response tendencies (Zou et al., 2022). Second, T/F items are the most construct-narrow of all recognition formats: they test only binary plausibility judgment and do not require the model to compare, rank, or produce alternatives. This means T/F performance provides a compressed and potentially misleading signal about the full construct, particularly for high-reasoning domains where the ability to generate or evaluate among multiple options is the intended target. Together, these properties make T/F a high-risk format for benchmarking LLMs — likely to overestimate capability through guessing inflation and underrepresent construct coverage through recognition narrowing — but an analytically valuable inclusion in a G-study specifically because its extreme guessing structure provides an upper-bound contrast condition for estimating format-induced score variance.
The preceding sections established a four-facet extended G-study design — Format (F), Prompt Design (P), Occasion (O), and Temperature (T) — each representing a distinct source of construct-irrelevant variance that can inflate, deflate, or differentially distort LLM benchmark scores. This section consolidates the hypothesized effects of all four facets in terms of expected variance components and their implications for score interpretation and benchmark design recommendations. The hypotheses below are organized by facet and are directly linked to the staged G-study design described in §3.2.
Based on the distinct processing mechanisms and G-study design considerations described above, the following hypotheses are organized across all four measurement facets:
Hypothesis 1: MCQ performance systematically overestimates capability. Because MCQ constrains the output space and provides partial information, models will score higher on MCQ than on equivalent open-ended items. The magnitude of this inflation will be larger for LLMs than for humans because LLMs are particularly effective at probability-based selection among options.
Hypothesis 2: Model rankings will change across formats. Models that excel at elimination and comparison strategies will be disproportionately favored by MCQ formats, while models with stronger generation capabilities will be favored by open-ended formats. If this is true, a model’s “rank” is partly an artifact of format choice.
Hypothesis 3: Format effects will interact with construct domain. Format effects may be larger for some constructs than others. For domains where generation and recognition involve similar processes (e.g., factual recall), format effects may be small. For domains where generation requires substantially more than recognition (e.g., mathematical proof, code writing), format effects may be large.
Hypothesis 4: Format effects will vary across model families, constituting Differential Method Functioning. Models with different architectures, training procedures, or alignment procedures may show systematically different patterns of format sensitivity. In psychometric terms, this is formally described as Differential Method Functioning (DMF; Bolt & Newton, 2011) — the method-level generalization of Differential Item Functioning (DIF) in which the relationship between a latent trait and the observed response is moderated by the measurement method (format), with the direction and magnitude of moderation differing across examinee subgroups (here: model families). DMF is operationalized by testing whether the format \(\times\) model-family interaction in the G-study is significantly greater than zero after partialling out the format main effect and model main effect (\(\sigma^2\)(model \(\times\) format) > 0, conditional on group membership). When DMF is present, the format facet’s contribution to construct-irrelevant variance is non-uniform across model families: the same format advantage or penalty that applies to one architectural lineage does not apply uniformly to others, making cross-family score comparisons conducted without format standardization psychometrically indefensible (Ercikan & Roth, 2006).
Hypothesis 5: True/false format produces the largest guessing-induced score inflation and the strongest construct-narrowing effect relative to all other formats. Two distinct mechanisms are hypothesized to operate simultaneously for T/F items. On the inflation side: because the random-guessing baseline is 0.50, T/F scores will systematically exceed performance on matched MCQ and open-ended items for the same content after controlling for item difficulty — even for models with limited construct-relevant ability. The G-study will quantify this as a positive, large T/F-specific format main effect (\(\sigma^2\)(format), with T/F driving the between-format mean difference). On the construct-narrowing side: because T/F items require only binary plausibility judgment and suppress generation, comparison, and ranking processes, the T/F score will show the weakest convergent validity with open-ended performance among all five format types, and the model \(\times\) format interaction (\(\sigma^2\)(model \(\times\) format)) will be largest for the T/F vs. open-ended contrast. This interaction reflects the fact that models whose capabilities are primarily generative — producing coherent arguments, code, or derivations from scratch — are disproportionately penalized when reduced to a binary judgment, while models with strong retrieval-based pattern matching are disproportionately advantaged. As a corollary, T/F-dominant benchmark rankings will be less stable across content paraphrasing than MCQ rankings, because the guessing component adds a purely stochastic layer of response variability that paraphrasing cannot control (Hypothesis 5a). The D-study recommendation for T/F-containing benchmarks will consequently require the largest number of items (\(n_{T/F}\)) to achieve a target dependability level \(\phi \geq .80\), since guessing variance contributes to the residual that averaging over items must reduce.
Hypotheses 6–8: Prompt Design, Occasion, and Temperature Facets
Hypothesis 6: Prompt design will constitute a non-trivial source of construct-irrelevant variance, with frontier models showing greater prompt sensitivity than smaller models. Model performance will vary systematically across semantically equivalent prompt formulations — differing in zero-shot vs. few-shot presentation, chain-of-thought (CoT) instruction, system prompt framing, and surface-level item paraphrase — even when the underlying item content and correct answer are held constant. The Stage 2 G-study (\(\mathbf{M \times I \times P}\)) will reveal a non-negligible \(\sigma^2\)(M \(\times\) P): frontier, high-accuracy models (e.g., GPT-4, Claude 3, Gemini Ultra) are expected to show larger model–prompt interaction variance than smaller models, reflecting deeper overfitting to the surface-level structural features of canonical benchmark prompts during training (Cohen-Inger et al., 2025). This pattern implies that single-prompt, zero-shot administration — the current de facto standard — systematically overestimates dependability for frontier models and underestimates it for smaller models. Furthermore, a significant item–prompt interaction (\(\sigma^2\)(I \(\times\) P)) is predicted for reasoning-intensive items: CoT prompting and few-shot exemplars are expected to disproportionately benefit items requiring multi-step inference relative to factual recall items, producing heterogeneous prompt sensitivity across the item pool. The D-study implication is that the minimum number of distinct prompt variants per item required to achieve \(\phi \geq .80\) (\(n_P\)) will be larger for frontier models than for smaller models, motivating a model-stratified D-study design rather than a one-size-fits-all prompt specification.
Hypothesis 6a (Prompt × Format interaction): The magnitude of prompt sensitivity will differ across item formats. Open-ended and dialogue items — where prompt framing directly shapes the expected response structure — are predicted to show larger \(\sigma^2\)(I \(\times\) P) than MCQ items, for which response options partially constrain the output space and buffer against prompt-wording effects.
Level selection rationale (H6): The four prompt variants — zero-shot, few-shot, CoT, and surface paraphrase — span the two most consequential axes of prompt variation identified in the LLM evaluation literature: information density (zero-shot vs. few-shot) and reasoning scaffold (with vs. without CoT), with surface paraphrase added as a control to isolate wording effects independently of content or scaffolding changes. Other plausible variant types — role-play persona prompts, structured output format instructions — were excluded from the primary Stage 2 design to maintain interpretability of variance components and may be examined in sensitivity analyses.
Hypothesis 7: LLM benchmark scores will exhibit systematic temporal instability across testing occasions, with API-based proprietary models showing greater drift than fixed open-weight models. The Stage 3 G-study (\(\mathbf{M \times I \times O}\), three occasions approximately 90 days apart) will partition occasion-related variance into: (a) a secular main effect (\(\sigma^2\)(O)) reflecting ecosystem-wide score drift averaged across all models and items; (b) a model–occasion interaction (\(\sigma^2\)(M \(\times\) O)) capturing differential temporal instability; and (c) an item–occasion interaction (\(\sigma^2\)(I \(\times\) O)) as a contamination signal. Regarding (b): Proprietary API-based models (GPT-4, Gemini) are predicted to show substantially larger \(\sigma^2\)(M \(\times\) O) than fixed open-weight models (LLaMA, Mistral), because API models undergo silent provider-side updates, RLHF fine-tuning cycles, and deployment changes between testing occasions while open-weight model weights remain constant. This differential instability directly undermines cross-model score comparisons that combine API-based and open-weight models without occasion-matched data. Regarding (c): Items that show a consistent pattern of decreasing difficulty across occasions — operationalized as a negative item × occasion interaction where later occasions yield systematically higher average accuracy — are flagged as training-data contamination candidates, consistent with the benchmark overfitting mechanisms described by Cohen-Inger et al. (2025). For progress monitoring intended uses (RQ1), the practical consequence is severe: if \(\sigma^2\)(M \(\times\) O) is large, apparent score gains across occasions may primarily reflect API model drift or contamination rather than genuine capability development, rendering longitudinal score interpretation indefensible without occasion-specific equating or anchoring.
Level selection rationale (H7): Three occasions at 90-day intervals were chosen because 90 days approximates the modal public release and update cadence of frontier API model series — shorter intervals risk capturing only low-level fine-tuning noise rather than substantive capability changes, while longer intervals risk spanning full model-generation transitions (e.g., GPT-4 → GPT-4o) that introduce model-identity confounds rather than within-model occasion variance. Three is also the minimum number of occasions required to separately estimate a systematic linear drift trend and random occasion-specific instability; a two-occasion design conflates these two qualitatively distinct sources and cannot support the contamination-detection logic of the item × occasion interaction.
Hypothesis 8: Temperature will produce heterogeneous effects on score stability, with response variance increasing monotonically with temperature while mean accuracy remains relatively stable for well-calibrated models but degrades for poorly-calibrated models on precision-demanding items. The Stage 4 G-study (\(\mathbf{M \times I \times T}\), \(T \in \{0, 0.5, 1.0\}\), five replications per model-item-temperature combination) will estimate the following: Temperature main effect (\(\sigma^2\)(T)): Average benchmark accuracy across all models is expected to decline slightly at \(T = 1.0\) relative to \(T = 0\), because high-temperature sampling introduces token-level stochasticity that occasionally produces lower-probability (and more often incorrect) response sequences. However, for large, well-calibrated frontier models, this decline is predicted to be small, because the probability mass concentrated on the correct token is sufficiently high that sampling rarely selects an incorrect alternative. Model–temperature interaction (\(\sigma^2\)(M \(\times\) T)): Smaller or less well-calibrated models will show steeper accuracy degradation with increasing temperature than frontier models, reflecting narrower probability advantages for their correct-token predictions. This interaction implies that temperature recommendations are model-dependent: \(T > 0\) is less harmful for frontier models and more damaging for smaller models, and a uniform temperature specification across all models will introduce systematic construct-irrelevant variance into cross-model comparisons. Item–temperature interaction (\(\sigma^2\)(I \(\times\) T)): Items requiring long multi-step reasoning chains or exact calculation are predicted to show the largest temperature sensitivity, because high-temperature sampling introduces token-level perturbations early in the reasoning chain that cascade into substantially incorrect final answers — a compounding stochastic error process not present for single-step factual recall items. Within-temperature replication variance (\(\sigma^2_e\)): This is the irreducible stochastic floor of score instability at \(T > 0\) — pure measurement error from LLM sampling that cannot be reduced by increasing the number of items or prompt variants, only by increasing the number of replications per item (or by setting \(T = 0\)). D-study recommendations will favor \(T = 0\) with single-pass evaluation for summative ranking benchmarks, where minimizing stochastic noise is paramount, and multi-pass averaging at moderate \(T\) for scientific investigation uses, where sampling the full output distribution is preferred over assessing only the modal response.
Level selection rationale (H8): \(T = 0\), \(T = 0.5\), and \(T = 1.0\) correspond to three meaningfully distinct evaluation regimes: the deterministic reference floor where stochastic error is entirely eliminated (\(T = 0\)); the most common default in published benchmark protocols and API wrappers (\(T = 0.5\), providing ecologically representative conditions); and the practical upper bound of the benchmark-relevant range (\(T = 1.0\), beyond which outputs become incoherent and assessment validity breaks down). A finer five-point grid was considered but rejected because the cost of five replications per model–item–temperature combination is prohibitive at Stage 4 scale, and the three-point design provides sufficient contrast to fit and interpret a linear temperature trend while keeping data collection feasible.
In the G-study design, format serves as a fixed facet with five levels: (1) multiple-choice (MCQ), (2) true/false (T/F), (3) fill-in-the-blank, (4) open-ended generation, and (5) dialogue/interactive. Including T/F as a distinct format level — rather than collapsing it with MCQ — is essential because its guessing structure (lower asymptote \(c = 0.50\)) makes it a qualitatively different measurement condition from four-option MCQ (\(c = 0.25\)). Treating T/F and MCQ as the same “recognition” level would mask the very inflation and construct-narrowing effects that Hypothesis 5 predicts and that the G-study is designed to detect.
The variance decomposition will directly estimate:
Score decomposition. For the Stage 1 fully-crossed Model (M) \(\times\) Item (I) \(\times\) Format (F) G-study, the observed score \(X_{mif}\) for model \(m\) on item \(i\) in format \(f\) is decomposed as:
\[X_{mif} = \mu + \pi_m + \iota_i + \phi_f + (\pi\iota)_{mi} + (\pi\phi)_{mf} + (\iota\phi)_{if} + (\pi\iota\phi)_{mif,e}\]
where \(\mu\) is the grand mean; \(\pi_m\), \(\iota_i\), \(\phi_f\) are the main effects of model, item, and format respectively; \((\pi\iota)_{mi}\), \((\pi\phi)_{mf}\), \((\iota\phi)_{if}\) are two-way interactions; and \((\pi\iota\phi)_{mif,e}\) is the confounded three-way interaction and random error. Each term has an associated variance component — the expected squared magnitude of that source of variation across the universe:
\[\sigma^2(X_{mif}) = \sigma^2(\pi) + \sigma^2(\iota) + \sigma^2(\phi) + \sigma^2(\pi\iota) + \sigma^2(\pi\phi) + \sigma^2(\iota\phi) + \sigma^2(\pi\iota\phi,e)\]
The universe score for model \(m\) is \(\mu_m = \mu + \pi_m\) — the expected score over all items and formats in the universe. The goal of the G-study is to estimate each \(\sigma^2\) component so the D-study can determine how many items and formats are needed to achieve a dependable estimate of \(\mu_m\).
Generalizability coefficient \(E\rho^2\) (for relative decisions — rank-ordering models):
\[E\rho^2 = \frac{\sigma^2(\pi)}{\sigma^2(\pi) + \dfrac{\sigma^2(\pi\iota)}{n_I} + \dfrac{\sigma^2(\pi\phi)}{n_F} + \dfrac{\sigma^2(\pi\iota\phi,e)}{n_I\, n_F}}\]
Dependability coefficient \(\phi\) (for absolute decisions — comparing a model’s score to a fixed cut-point):
\[\phi = \frac{\sigma^2(\pi)}{\sigma^2(\pi) + \dfrac{\sigma^2(\iota)}{n_I} + \dfrac{\sigma^2(\phi)}{n_F} + \dfrac{\sigma^2(\pi\iota)}{n_I} + \dfrac{\sigma^2(\pi\phi)}{n_F} + \dfrac{\sigma^2(\iota\phi)}{n_I\, n_F} + \dfrac{\sigma^2(\pi\iota\phi,e)}{n_I\, n_F}}\]
Note that \(\phi \leq E\rho^2\) always, because absolute decisions carry additional error from item and format main effects (\(\sigma^2(\iota)\) and \(\sigma^2(\phi)\)) that cancel in relative comparisons. For benchmark ranking (the most common LLM evaluation use), \(E\rho^2\) is the relevant index; for certification cut-scores, \(\phi\) is required.
R implementation. Variance components are estimated
via the gtheory package (Mushquash & O’Connor, 2006),
which uses restricted maximum likelihood (REML) via lme4
internally. The D-study varies \(n_I\)
and \(n_F\) to trace \(\phi\) as a function of measurement
effort:
library(gtheory) # install.packages("gtheory")
library(lme4)
library(ggplot2)
# ------------------------------------------------------------------
# Stage 1 G-study: fully crossed M x I x F design
# Data: llm_long -- long-format data frame with columns:
# model (factor: LLM identifier)
# item (factor: item identifier)
# format (factor: MCQ / TF / fill / open / dialogue)
# score (numeric: 0/1 binary or IRT theta)
# ------------------------------------------------------------------
gstudy_mif <- gstudy(
data = llm_long,
formula = score ~ (1|model) + (1|item) + (1|format) +
(1|model:item) + (1|model:format) +
(1|item:format) + (1|model:item:format),
colname.objects = "model"
)
# Inspect variance components
vc <- gstudy_mif$components
print(vc)
# ------------------------------------------------------------------
# D-study: vary n_I (items) with n_F fixed at 5 formats
# Target: phi >= .80
# ------------------------------------------------------------------
n_items_grid <- c(30, 50, 100, 200, 400, 645)
dstudy_mif <- dstudy(
gstudy_mif,
colname.objects = "model",
colname.scores = "score",
data = llm_long,
n = list(item = n_items_grid,
format = 5)
)
# Extract phi and Erho2 across n_I values
dstudy_df <- data.frame(
n_items = n_items_grid,
phi = dstudy_mif$phi,
Erho2 = dstudy_mif$generalizability
)
# ------------------------------------------------------------------
# Plot D-study curve
# ------------------------------------------------------------------
ggplot(dstudy_df, aes(x = n_items)) +
geom_line(aes(y = phi, colour = "phi (absolute)"), linewidth = 1.2) +
geom_line(aes(y = Erho2, colour = "E_rho2 (relative)"), linewidth = 1.2, linetype = "dashed") +
geom_point(aes(y = phi), size = 3, colour = "#2c7bb6") +
geom_point(aes(y = Erho2), size = 3, colour = "#d7191c") +
geom_hline(yintercept = .80, linetype = "dotted", colour = "grey40", linewidth = 0.8) +
annotate("text", x = max(n_items_grid) * 0.95, y = .81,
label = expression(phi == .80 ~ " target"), hjust = 1, size = 3.5) +
scale_colour_manual(values = c("phi (absolute)" = "#2c7bb6",
"E_rho2 (relative)" = "#d7191c")) +
scale_x_continuous(breaks = n_items_grid) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, .10)) +
labs(title = "D-study: Items required to achieve target dependability",
subtitle = "Stage 1 M x I x F design | n_F fixed at 5 formats",
x = "Number of items (n_I)",
y = "Coefficient value",
colour = NULL) +
theme_bw(base_size = 12) +
theme(legend.position = "bottom")
The D-study curve identifies the minimum \(n_I\) at which \(\phi\) crosses the .80 threshold for the full five-format design, and separately for reduced designs (e.g., MCQ + open-ended only, \(n_F = 2\)). If \(\phi\) remains below .80 even at \(n_I = 645\), the format facet’s contribution to absolute error (via \(\sigma^2(\phi)/n_F\)) is too large to overcome with items alone, and the practical recommendation shifts to format standardization rather than item bank expansion.
IRT scoring note for T/F items: Responses to T/F items require a separate IRT calibration from MCQ and open-ended items. A 3PL model with \(c\) estimated freely (rather than fixed at 0.25) is appropriate, since LLM guessing rates on T/F may deviate from the theoretical 0.50 due to model-specific response tendencies. If bifactor scoring is applied upstream (from RQ2 §3.3.2), T/F items should be assigned to their content-domain group factor \(S_k\) in the same way as MCQ items, and their lower asymptote should be accounted for in the factor score estimation to avoid upward bias in the T/F-derived G-scores.
If \(\sigma^2\)(model \(\times\) format) is large relative to \(\sigma^2\)(model), format — and T/F in particular — is a major source of construct-irrelevant variance and model rankings are format-dependent. This has direct implications for RQ2 (validity) and RQ4 (whether a given benchmark can serve its intended purpose).
A methodological challenge specific to LLM G-studies is the elicitation of a sufficient distribution of responses per item to estimate prompt-by-model interaction variance components. Li et al. (2025) proposed the Diffusion Method as a practical solution: by systematically varying prompt phrasings, presentation formats, and contextual cues across repeated administrations of the same item, the method generates a rich response distribution that supports variance component estimation without requiring multiple independent model versions. Critically, the Diffusion Method controls for semantic equivalence across prompt variants—ensuring that observed response variation reflects genuine stochastic and format-sensitivity facets rather than construct-irrelevant content differences. Incorporating this technique into the G-study design strengthens the interpretive foundation of the dependability coefficients derived from the D-study.
Cohen-Inger et al. (2025) provide an empirically grounded taxonomy of the mechanisms through which prompt wording inflates or deflates LLM benchmark scores. They distinguish three distinct overfitting processes: (1) memorization, whereby models verbatim reproduce training data that overlaps with test items; (2) training-set contamination, wherein test samples appear as near-duplicates within the training corpus; and (3) benchmark/prompt structure overfitting, wherein models learn to associate the surface-level format, syntax, or keyword patterns of benchmark prompts with correct responses, without developing genuine conceptual understanding. From a G-theory perspective, the first two processes constitute fixed threats to universe score interpretability—they bias scores upward regardless of prompt wording—whereas the third is the direct operand of the prompt-wording facet: by treating prompt phrasing as a random facet, the G-study design estimates the variance attributable to prompt \(\times\) model interactions, which Cohen-Inger et al.’s delta_miu metric quantifies empirically. Specifically, C-BOD’s distortion parameter miu controls the magnitude of textual perturbation applied to benchmark prompts while preserving semantic content and correct answers; the resulting performance drop delta_miu constitutes an external, operationalizable anchor for interpreting G-study variance components. Large delta_miu values correspond to high prompt \(\times\) model interaction variance in the G-study framework, indicating that the benchmark’s dependability coefficient (phi) is substantially attenuated by prompt sensitivity—a direct empirical threat to the score interpretations the dissertation’s validity argument must address.
A counterintuitive finding from Cohen-Inger et al. (2025) with direct implications for D-study design is that larger and higher-accuracy LLMs tend to exhibit greater sensitivity to prompt perturbations, not less. Across 32 state-of-the-art LLMs spanning 1B to 27B parameters, the performance drop delta_miu scales log-linearly with model size, and models achieving over 60% accuracy on standard prompts consistently show the largest delta_miu values. The authors suggest that frontier models have implicitly overfit more deeply to the surface-level structural patterns of canonical benchmarks, accumulating greater dependence on prompt-specific cues during training. In G-theory terms, this finding implies that the prompt-wording facet explains more variance for high-ability LLMs than for low-ability models—a pattern formally describable as a model \(\times\) prompt interaction with heterogeneous variance across ability levels. For D-study planning, this has a critical practical implication: the number of distinct prompt variants (n_prompt) required to achieve a target dependability level (e.g.,phi>= .80) is systematically higher for frontier models than for smaller models, because their greater prompt sensitivity inflates the interaction variance that must be averaged over. A fixed D-study design with a uniform number of prompt variants will therefore underestimate dependability for smaller models while overestimating it for frontier models, motivating a model-stratified D-study design in Stage 2 that specifies n_prompt as a function of empirically estimated prompt sensitivity.
Extended G-Study Design: Prompt Design, Occasion, and Temperature as Additional Facets
Beyond format, three further facets are of direct psychometric interest and are incorporated into the extended G-study design: prompt design (P), occasion (O), and temperature (T). The core design — Model (M) × Item (I) × Format (F) — is extended to M × I × F × P × O × T, though the full six-way crossing is not feasible in a single study. Instead, a staged design is proposed: Stage 1 estimates the core M × I × F variance components; Stages 2–4 each add one new facet in isolation with the remaining facets held fixed, allowing interpretable variance decomposition without requiring an exponentially large data collection. The three new facets and their variance components are as follows.
Facet 1 — Prompt Design (P, random)
Prompt design is treated as a random facet representing the universe of semantically equivalent formulations of each item: zero-shot vs. few-shot (1-, 3-, or 5-shot), with vs. without chain-of-thought (CoT) instruction, variation in system prompt framing (neutral vs. role-based, e.g., “You are a medical expert”), and surface-level paraphrase of item wording. Treating prompt design as random — rather than fixing a single canonical prompt — is essential because any single prompt choice introduces construct-irrelevant variance that inflates or deflates the universe score estimate in an unknowable direction. The Stage 2 G-study (M × I × P, with format fixed at MCQ) will estimate:
The D-study for the prompt facet determines \(n_P\) — the minimum number of distinct prompt variants per item required to achieve \(\phi \geq .80\). Given the Cohen-Inger et al. finding that frontier models are more prompt-sensitive, a model-stratified D-study is anticipated: \(n_P\) for large frontier models will be substantially higher than for smaller models, and benchmark developers should report \(n_P\) alongside standard reliability estimates.
Rationale for prompt-design level selection. The four prompt-design levels — zero-shot, few-shot (1-, 3-, or 5-shot), chain-of-thought (CoT), and surface paraphrase — were selected because they represent the two primary dimensions of prompt variation most consistently linked to benchmark score inflation and deflation in the recent evaluation literature: information density (zero-shot provides no worked examples; few-shot provides \(k\) demonstrations of the correct reasoning format before the target item) and reasoning scaffolding (CoT explicitly instructs the model to reason step-by-step before producing a final answer, while no-CoT conditions request a direct response). Surface paraphrase — in which item wording is systematically varied while semantic content, correct answer, and difficulty are held constant — is included as a third dimension to isolate variance attributable to surface form alone, independently of any change in information or scaffolding. Together, these four levels generate a rich and interpretable variance component structure: \(\sigma^2\)(P) captures whether any single prompt style universally inflates or deflates scores; \(\sigma^2\)(M \(\times\) P) captures whether frontier models are disproportionately sensitive to these variations (the key prediction); and \(\sigma^2\)(I \(\times\) P) captures whether CoT and few-shot exemplars selectively benefit reasoning-intensive items over factual recall items. Other plausible variant types — role-play persona system prompts (e.g., “You are a board-certified physician”), structured output format instructions, and chain-of-thought with self-verification steps — were excluded from the primary Stage 2 design to maintain interpretability and tractability; they may be examined in sensitivity analyses if the Stage 2 results indicate that M × P variance is concentrated in the system-prompt framing contrast.
Facet 2 — Occasion (O, random)
Occasion is treated as a random facet representing repeated administrations of the same benchmark to the same set of LLMs at different points in time — operationalized as three sessions separated by approximately 90 days. Unlike human test-takers, LLMs are subject to silent API updates, model version deprecation, and background fine-tuning by providers between testing occasions, all of which introduce occasion-specific variance that is invisible in single-occasion designs. The Stage 3 G-study (M × I × O, with format and prompt fixed) will estimate:
From the perspective of intended use (RQ1), occasion variance is particularly consequential for progress monitoring use cases, which rely on the assumption that score changes across administrations reflect genuine capability growth rather than measurement drift. A large \(\sigma^2\)(M \(\times\) O) would directly undermine longitudinal score interpretation.
Rationale for occasion-level selection. Three sessions at approximately 90-day intervals were selected on the basis of two complementary considerations. First, 90 days approximates the modal public release and update cadence of the major frontier API model series — based on publicly available change logs and API deprecation announcements from OpenAI (GPT-4 Turbo, GPT-4o series), Google DeepMind (Gemini 1.5 Pro revisions), and Anthropic (Claude 3.x series) — making this interval the smallest gap likely to capture substantive provider-side model changes rather than low-level RLHF fine-tuning noise. Shorter intervals (e.g., 30 days) would risk measuring only RLHF alignment adjustments that produce negligible capability shifts; longer intervals (e.g., 6 months or more) risk spanning full model-generation transitions — from GPT-4 to GPT-4o, or Gemini 1.5 to Gemini 2.0 — that introduce a model-identity confound in which the measured entity at occasion \(t+1\) is not the same system as at occasion \(t\), rendering within-model occasion variance uninterpretable. Second, three occasions is the minimum required for the planned variance decomposition to separately identify a systematic linear trend across time (interpretable as secular score drift, contamination accumulation, or consistent capability growth) and random occasion-specific instability (interpretable as irreducible stochastic variability in model outputs across administration windows). A two-occasion design collapses these two qualitatively distinct sources into a single between-occasion contrast and cannot support the contamination-detection logic that the item × occasion interaction requires: to flag an item as a contamination candidate on the basis of monotonically decreasing difficulty over time, at least three observations of item difficulty are required.
Facet 3 — Temperature (T, fixed)
Temperature is the LLM sampling parameter controlling output randomness: at \(T = 0\), the model’s response is deterministic (always the most probable token sequence); at \(T > 0\), stochastic sampling introduces variability in responses across repeated administrations of the same prompt. Temperature is treated as a fixed facet with three ordered levels: \(T = 0\) (deterministic), \(T = 0.5\) (moderate stochasticity), and \(T = 1.0\) (high stochasticity). Fixed treatment is appropriate because temperature is a design parameter under researcher control, not a random draw from a universe of conditions, and specific temperature levels carry distinct substantive interpretations.
Rationale for temperature-level selection. The three levels
were chosen to represent three meaningfully distinct and practically
grounded evaluation regimes. \(T = 0\)
serves as the deterministic reference floor: by eliminating stochastic
sampling entirely, this level isolates the G-study variance components
attributable to genuine model-ability differences, item difficulty, and
their interactions, free from any within-temperature replication noise.
It also reflects the most defensible protocol for high-stakes summative
benchmarking, where reproducibility of individual model responses is
prioritized. \(T = 0.5\) was selected
because it is the most commonly specified temperature in published
benchmark evaluation protocols and the default value in widely used LLM
API wrappers and inference libraries (e.g., the OpenAI Python client,
Hugging Face transformers pipeline defaults, and LangChain
evaluation chains), making it the ecologically most representative
condition for understanding how score stability operates in practice.
Including \(T = 0.5\) in the design
ensures that Stage 4 findings are directly applicable to the largest
share of existing benchmark evaluations. \(T =
1.0\) was selected as the upper bound of the benchmark-relevant
temperature range: at this level, output stochasticity is at its
practical maximum for legitimate assessment contexts, providing the
strongest test of model robustness to sampling variability. Values above
\(T = 1.0\) are excluded because they
produce sharply incoherent, repetitive, or off-task outputs for most
models, at which point the assessment itself is invalidated regardless
of the psychometric design. A finer grid (e.g., five levels at 0, 0.25,
0.5, 0.75, 1.0) was considered but rejected: with five replications
required per model–item–temperature combination to estimate
within-temperature variance, adding two intermediate levels would
roughly double the Stage 4 data collection burden, while the additional
resolution would provide only a marginally more precise estimate of the
linear temperature trend — a trade-off that does not justify the cost at
dissertation scale.
The Stage 4 G-study (M × I × T, with format and prompt fixed, each model queried five times per item per temperature level to estimate within-temperature replication variance) will estimate:
A key D-study implication concerns benchmark administration protocol: the D-study for the temperature facet will determine whether single-pass evaluation (\(n_{\text{rep}} = 1\)) at \(T = 0\) achieves acceptable \(\phi\), or whether multi-pass averaging at \(T > 0\) is required to stabilize scores. For summative ranking benchmarks (RQ1 selection use), the recommendation will likely be \(T = 0\) with single-pass evaluation to minimize stochastic variance. For scientific investigation use cases, \(T > 0\) with multi-pass averaging may be preferred to sample the model’s full output distribution and avoid conflating the mode of the distribution with the model’s typical capability.
Summary: Extended G-Study Design
| Facet | Type | Levels | Primary variance of interest | D-study output |
|---|---|---|---|---|
| Format (F) | Fixed | 5 (MCQ · T/F · fill-in · open · dialogue) | \(\sigma^2\)(M \(\times\) F) — rank-order change across formats | Format recommendation per intended use |
| Prompt Design (P) | Random | \(n_P\) variants per item | \(\sigma^2\)(M \(\times\) P) — model prompt-sensitivity | Minimum \(n_P\) for \(\phi \geq .80\), stratified by model size |
| Occasion (O) | Random | 3 time points (~90-day intervals) | \(\sigma^2\)(M \(\times\) O) — temporal drift | Whether single-occasion testing is defensible |
| Temperature (T) | Fixed | 3 levels (0, 0.5, 1.0) | \(\sigma^2_e\) (within-T replication) | Optimal temperature and \(n_{\text{rep}}\) for target \(\phi\) |
Based on the analysis above, format selection should be guided by:
For open-ended formats, scoring introduces a second layer of measurement that deserves explicit attention:
Human scoring: Requires rubric development, rater training, and inter-rater reliability estimation. Expensive and slow. Rater variance becomes an additional facet in the G-study.
Model-as-judge scoring: Using another LLM (or the same LLM) to score responses. This is increasingly common (e.g., MT-Bench uses GPT-4 as a judge) but introduces circularity, systematic bias, and an additional measurement error source that is rarely quantified. The use of a language model as a scoring judge introduces a specific form of method bias: the judge model may preferentially reward outputs that match its own generation style, training distribution, or alignment procedure — a phenomenon formally analogous to rater idiosyncratic error (Eckes, 2015) and rater leniency/severity bias as estimated within the many-facet Rasch model (MFRM; Linacre, 1989). In G-theory terms, when a single judge model is used, the judge’s systematic biases are fully confounded with the examinee (model) proficiency score — they contribute to \(\sigma^2\)(p) rather than to \(\sigma^2\)(r) — and cannot be separated without multiple independent judges. Using multiple judge models of different lineages as parallel raters and decomposing judge variance as a G-study facet (Person \(\times\) Item \(\times\) Judge design) provides the only principled psychometric approach to quantifying and controlling judge-model bias. The judge model’s performance should itself be evaluated through concordance with human expert ratings (criterion-related validity; AERA, APA, & NCME, 2014, Standard 1) and susceptibility to positional and verbosity biases (response-process validity evidence). Scoring the scorer is a prerequisite for treating model-as-judge scores as valid measurements.
Rubric-based automated scoring: Using regex, unit tests (for code), or other deterministic methods. More reliable but limited in the constructs that can be scored this way.
Recommendation: The scoring method should be explicitly treated as a measurement decision with its own reliability and validity evidence. For the G-study (RQ3), scorer and scoring method can be treated as an additional facet when multiple scoring approaches are used.
Audience bridge — Pillar III: For AI scientists: This pillar asks whether you can compare scores across different benchmarks or model versions — the equivalent of asking whether accuracy on MMLU and accuracy on BIG-Bench are measuring the same thing at the same scale. It also addresses benchmark contamination (training data overlap) as a long-term measurement threat. For psychometricians: The equating and linking methods here are standard, but LLMs introduce novel challenges: model versioning (is GPT-4 the same “examinee” as GPT-4-turbo?), construct drift over time, and the absence of a stable reference population for norming.
Clinical Framing — The Patient History: With a structurally validated, error-quantified score in hand from Pillar II, this pillar addresses the longitudinal question: is the patient getting healthier, or did they just memorize the eye chart? AIG provides fresh parallel forms that prevent contamination-induced score inflation; domain specification ensures the blueprint remains coherent as the LLM field evolves; equating and linking determine whether improvement across benchmark versions reflects genuine capability growth or merely a different measuring instrument.
Audience bridge — RQ4: For AI scientists: This RQ asks whether your benchmark item pool is a representative sample of the domain it claims to test — the equivalent of checking whether your test set is i.i.d. with respect to the target distribution. It also introduces an LLM family taxonomy for stratified model sampling. For psychometricians: The test blueprint methodology is standard, but AIG (automated item generation using LLMs themselves) is the novel tool — and RQ4 empirically validates whether LLM-generated items are psychometrically equivalent to expert-authored ones. — Building a Defensible Sampling Frame
Pillar III — The Patient History: Domain specification is the intake form that defines what the examination covers; the test blueprint ensures that the form is representative rather than opportunistic. AIG is the mechanism for replacing outdated diagnostic instruments as the patient population — the frontier model cohort — outgrows them. Knowing which patients belong to the same diagnostic family is equally essential: without a principled grouping of LLMs, we cannot determine whether a benchmark result generalizes to the family or is idiosyncratic to a single artifact.
Every measurement claim rests on an assumption of representativeness. When we say “Model X scored 85% on math reasoning,” we implicitly claim that the items tested represent the domain of math reasoning and that Model X’s performance on these specific items generalizes to the broader domain. When we compare Model X to Model Y, we implicitly claim that the comparison is meaningful beyond this particular set of items and conditions.
LLM benchmarking faces a dual challenge: the sample of items may not represent the construct domain, and the selection of models may not represent any coherent grouping. Both problems undermine the interpretability and generalizability of benchmark results. However, whereas item sampling can be addressed through formal domain specification and test blueprints (§2), the model dimension requires a different solution — not random sampling from a population, but principled grouping of models into taxonomic families. Defining such families serves two purposes: (1) it identifies which models are exchangeable for scientific and practical purposes, and (2) it provides a stratified sampling frame for future benchmark studies that need a defensible model selection strategy.
Sound assessment begins with a clear specification of the content domain — the universe of possible items from which the benchmark sample is drawn. This specification typically includes content categories and subcategories, cognitive process levels (e.g., recall, application, analysis), difficulty levels, and format types.
The formal tool for operationalizing domain specifications in educational measurement is the Table of Specifications (ToS), also referred to as the test blueprint (Nitko & Brookhart, 2011). A ToS is a two-dimensional matrix cross-tabulating content categories with cognitive process levels (e.g., Bloom’s Revised Taxonomy; Anderson & Krathwohl, 2001; or Webb’s Depth of Knowledge; Webb, 1997), with specified item counts per cell. The ToS ensures that item sampling is deliberate and representative rather than opportunistic.
Critically, the ToS is the item-sampling analog of Cronbach et al.’s (1972) universe of admissible observations in Generalizability Theory. Where G-theory specifies the universe of conditions over which a score is intended to generalize (items, formats, occasions, raters), the ToS operationalizes the item-and-content facet of that universe by defining which content-by-process cells are in scope and in what proportions. A benchmark item pool that lacks a ToS is, in G-theory terms, sampling from an undefined universe — which means that the generalization inference (from observed benchmark scores to performance in the intended domain) has no principled basis, regardless of how high the generalizability coefficient may be. The development of ToS-equivalent domain specifications for LLM benchmarks is therefore a prerequisite for both defensible content validity claims (RQ2) and for evaluating whether two benchmarks are comparable (RQ5).
Once a domain specification is established, the degree to which individual items conform to it can be evaluated using the Content Validity Ratio (CVR; Lawshe, 1975): CVR = (n_e – N/2) / (N/2), where n_e is the number of subject-matter experts who rate an item as “essential” to the construct, and N is the total number of experts. Items with CVR values below the critical value for a given panel size should be flagged for revision or removal. While the CVR was developed for human assessment contexts, the underlying logic applies directly to LLM benchmark item review: items should be evaluated by domain experts for their alignment with the stated construct definition before inclusion.
Current LLM benchmarking practice: Domain specifications are rare and, when present, are typically informal. GSM8K describes its items as “grade school math problems” requiring 2–8 steps, but does not specify the distribution across mathematical topics, operation types, or complexity levels. MMLU inherits whatever domain specification existed in the original exams from which items were drawn, but aggregates them without a unified blueprint.
Consequence: Without domain specification, it is impossible to evaluate whether the item sample is representative. Two benchmarks both claiming to test “mathematics” may have entirely different content coverage, making their scores non-comparable (connecting to RQ5) and their construct interpretations potentially different (connecting to RQ2).
Even with a domain specification, the item sample must adequately cover the domain. Key considerations include proportional representation (are all major subdomains covered in proportion to their importance?), boundary coverage (are items at the edges of the domain included, not just the prototypical center?), difficulty distribution (does the item difficulty range match the ability range of the examinees — in this case, current LLMs?), and cognitive complexity distribution (are multiple levels of complexity represented?).
Unique LLM challenge — difficulty calibration: LLM capabilities evolve rapidly, causing benchmarks to become too easy (ceiling effects) or poorly calibrated to the current ability range. Items that were discriminating two years ago may now be trivially easy for frontier models, while items that are appropriately difficult today may be too easy in six months. This requires ongoing item analysis and potential benchmark updating — a practice that introduces its own comparability challenges (connecting to RQ5).
Unique LLM challenge — contamination: Items from public benchmarks may appear in LLM training data, fundamentally compromising representativeness. An item that has been memorized is no longer a sample from the construct domain; it is a sample from the model’s training set. Item contamination — the inclusion of benchmark items in LLM training data — constitutes a direct threat to content validity by converting a measure of construct-relevant capability into a measure of training data memorization. This distinction maps onto the difference between item-level construct validity (the item elicits the intended cognitive or computational process) and score-level construct validity (the total score reflects the target construct). Contamination compromises both: at the item level, correct responses may reflect verbatim recall rather than construct-relevant processing; at the score level, the aggregate score conflates genuine capability with data leakage. Detection methods include membership inference attacks (Shi et al., 2024), n-gram overlap analyses (Golchin & Surdeanu, 2024), and held-out item pools not released publicly prior to evaluation.
This phenomenon is psychometrically analogous to teaching to the test, wherein repeated exposure to specific item content inflates observed scores beyond the level warranted by the underlying construct (Popham, 2001, as cited in Swiecki et al., 2022). When LLM training corpora overlap with benchmark item banks, the resulting score inflation constitutes a form of construct-irrelevant variance that compromises score interpretation. Swiecki et al. (2022) further argue that distributed, process-embedded assessment—where evidence is gathered continuously from naturalistic task performance rather than from discrete item administration—may represent a structural solution to contamination, because such assessments resist memorization by design.
The classic sampling problem asks: how do we draw a representative sample from a known population? For LLMs, this framing breaks down — the “population” is finite, rapidly changing, and composed of non-independent engineered artifacts rather than natural-world observations. Treating model selection as a sampling problem imports assumptions (randomness, independence, stable population frame) that do not hold.
A more tractable framing is taxonomic: rather than sampling from an ill-defined population, we define equivalence classes — families of models that share enough structural and behavioral properties to be treated as members of the same group. This shift from sampling to grouping has three advantages. First, it is conceptually appropriate for an engineered artifact domain where lineage and design choices, not random draws, determine model properties. Second, it directly supports stratified selection: once families are defined, a researcher can deliberately include one or more representatives from each family rather than sampling blindly. Third, it enables a principled scientific question: do models within the same architectural family perform similarly on a psychometric benchmark, or do individual training decisions override family membership?
Two independent grouping dimensions are proposed: architecture-based (§3.2) and performance-based (§3.3). These dimensions need not coincide — a key empirical question for RQ4 is whether they do.
Architecture-based grouping assigns models to families based on publicly documented design lineage and technical characteristics. The following taxonomy is proposed for the 71-model USMLE response matrix used in this dissertation.
Primary architectural families (by base model lineage):
| Family | Representative Models | Architecture Signature |
|---|---|---|
| GPT / OpenAI | GPT-3.5, GPT-4, GPT-4o | Proprietary transformer, RLHF-aligned |
| LLaMA / Meta | LLaMA-2-7B, LLaMA-2-70B, LLaMA-3 | Open-weight autoregressive transformer |
| Gemini / Google | Gemini Pro, Gemini Ultra | Multimodal transformer, chain-of-thought training |
| Claude / Anthropic | Claude-2, Claude-3 Haiku/Sonnet/Opus | Constitutional AI, RLAIF alignment |
| Mistral / Mistral AI | Mistral-7B, Mixtral-8x7B | Sliding window attention, MoE variants |
| Fine-tuned / Domain-adapted | MedLLaMA, ClinicalBERT, BioMedGPT | Domain-adapted from base families above |
Secondary classification dimensions:
This taxonomy generates a hierarchical grouping structure: models are nested within base families, which are nested within scale tiers. For the 71 models in the empirical dataset, the expected family sizes range from 2–3 (Gemini, Claude) to 15–20 (LLaMA fine-tuned variants), reflecting the unequal representation of families in the current frontier model ecosystem.
Performance-based grouping assigns models to clusters based on empirically observed benchmark behavior — specifically, the IRT ability estimates (θ) and item-level response profiles derived from the structural analysis in §3.3.2. This approach is agnostic to architecture: two models from different families may cluster together if their response patterns are similar, and two models from the same family may cluster separately if their fine-tuning diverged sufficiently.
Clustering inputs: For each of the 71 models, the following features are extracted from the IRT analysis:
Clustering procedure:
Expected output: A performance-based taxonomy partitioning 71 models into approximately 4–7 clusters representing distinct benchmark performance profiles (e.g., high-ability generalists, high-ability specialists, ceiling-effect clusters, bimodal responders).
The architecture-based and performance-based groupings yield two independent partitions of the same 71 models. The central empirical question is: to what degree do architectural family boundaries predict performance cluster membership?
This is answered by computing the Adjusted Rand Index (ARI; Hubert & Arabie, 1985) between the two partitions:
\[\text{ARI} = \frac{\text{Index} - \text{Expected Index}}{\text{Maximum Index} - \text{Expected Index}}\]
ARI = 1.0 indicates perfect agreement (architecture completely determines performance cluster); ARI = 0 indicates no better than chance agreement. Intermediate values reflect partial alignment.
Interpretation framework:
This analysis directly informs the sampling frame for future benchmark studies: if ARI is high, researchers can stratify by architecture family alone; if ARI is low, they must stratify by performance cluster or risk systematic bias in model selection.
Secondary analyses:
The following experiment applies the two-dimension taxonomy to the 71-model USMLE response matrix. The code sketch uses R (for IRT-based features) and Python (for clustering and ARI visualization).
## ── Step 1: Architecture-based taxonomy (manual labels) ─────────────────────
arch_taxonomy <- data.frame(
model_id = model_ids, # character vector of 71 model identifiers
base_family = c( # manually assigned from model documentation
rep("GPT/OpenAI", 8),
rep("LLaMA/Meta", 18),
rep("Gemini/Google", 4),
rep("Claude/Anthropic", 5),
rep("Mistral", 6),
rep("Other/FT", 30) # fine-tuned / smaller families
),
scale_tier = classify_scale(params_B), # "small" / "medium" / "large"
training_paradigm = classify_paradigm(model_ids) # "RLHF" / "RLAIF" / "SFT"
)
## ── Step 2: Extract IRT-based performance features ───────────────────────────
# Requires mirt model fit from §3.3.2
theta_general <- fscores(fit_bifactor, method = "EAP")[, "F1"]
theta_specific <- fscores(fit_bifactor, method = "EAP")[, -1]
sk_profiles <- t(apply(response_matrix, 1, function(r) {
standardized_residuals(fit_bifactor, r)
}))
perf_features <- cbind(theta_general, theta_specific, sk_profiles)
perf_scaled <- scale(perf_features)
## ── Step 3: Performance-based clustering ─────────────────────────────────────
hc <- hclust(dist(perf_scaled, method = "euclidean"), method = "ward.D2")
k_opt <- find_optimal_k(perf_scaled, k_range = 3:8) # silhouette width criterion
perf_clusters <- cutree(hc, k = k_opt)
## ── Step 4: Adjusted Rand Index ───────────────────────────────────────────────
library(aricode)
arch_labels <- as.integer(factor(arch_taxonomy$base_family))
ari_result <- ARI(arch_labels, perf_clusters)
cat(sprintf("Adjusted Rand Index (architecture vs. performance): %.3f\n", ari_result))
## ── Step 5: Stratified sampling frame output ──────────────────────────────────
sampling_frame <- arch_taxonomy |>
dplyr::mutate(perf_cluster = perf_clusters) |>
dplyr::group_by(base_family, scale_tier, perf_cluster) |>
dplyr::summarise(n_models = dplyr::n(), .groups = "drop") |>
dplyr::arrange(base_family, scale_tier, perf_cluster)
print(sampling_frame)
## ── Python supplement: k-means, mosaic plot, ARI ────────────────────────────
import pandas as pd
import numpy as np
from sklearn.metrics import adjusted_rand_score, silhouette_score
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClusteringimport matplotlib.pyplot as plt
import statsmodels.graphics.mosaicplot as mp
# Load features exported from R
perf_features = pd.read_csv("perf_features_rq3.csv")
arch_labels = pd.read_csv("arch_labels_rq3.csv")["base_family"]
# Standardize
scaler = StandardScaler()
X = scaler.fit_transform(perf_features)
# Grid search over k
sil_scores = {}
for k in range(3, 9):
km = KMeans(n_clusters=k, random_state=42, n_init=20)
labels = km.fit_predict(X)
sil_scores[k] = silhouette_score(X, labels)
k_opt = max(sil_scores, key=sil_scores.get)
km_final = KMeans(n_clusters=k_opt, random_state=42, n_init=20)
perf_clusters = km_final.fit_predict(X)
# ARI
ari = adjusted_rand_score(arch_labels.astype("category").cat.codes, perf_clusters)
print(f"ARI (architecture vs. performance clusters): {ari:.3f}")
# Mosaic plot
ct = pd.crosstab(arch_labels, perf_clusters)
mp.mosaic(ct.stack(), title=f"Architecture Family x Performance Cluster (ARI = {ari:.2f})")
plt.tight_layout()
plt.savefig("rq3_mosaic.pdf", dpi=300)
Audience bridge — RQ5: For AI scientists: This RQ asks when you can defensibly compare scores across different benchmarks — e.g., is a model that scores 85% on MMLU better than one that scores 72% on BIG-Bench? Spoiler: usually not without additional linking work. For psychometricians: The equating and IRT-linking frameworks are standard, but LLM-specific complications — training data contamination, model versioning, construct drift — make the standard assumptions harder to satisfy and require explicit feasibility analysis. — Equating, IRT-Based Linking, and Concordance Under Conditions of Construct Heterogeneity and Contamination
Pillar III (cont.) — The Patient History: Score comparability is the charting standard. Without it, a reading on one instrument cannot be compared to a reading on another — and apparent improvement may be nothing more than a different thermometer. This section establishes the conditions under which cross-benchmark comparisons are defensible and provides the roadmap for achieving them.
The LLM assessment landscape is characterized by a proliferation of benchmarks, each producing its own metric (typically accuracy or a variant thereof) on its own scale. A model scoring 86.4% on MMLU and 92.0% on GSM8K has not demonstrated that it is better at math than general knowledge — it has demonstrated performance on two incommensurable scales. Yet the field routinely treats these scores as if they were comparable, producing radar charts, composite rankings, and cross-benchmark comparisons without any psychometric basis for doing so.
The problem is compounded by the rapid creation of new benchmarks to replace “saturated” ones (where frontier models approach ceiling performance), with no mechanism for linking old and new benchmark scores. Each new benchmark creates a fresh scale, invalidating historical comparisons and fragmenting the field’s ability to track genuine progress.
Score comparability is the capstone of this dissertation’s framework because it requires all preceding components: shared construct definitions (RQ2) to know whether comparison is meaningful, understood intended uses (RQ1) to know what level of comparability is needed, adequate sampling (RQ4) to ensure benchmarks cover comparable domains, and estimated measurement error with format effects decomposed as a G-study variance facet (RQ3) to know the precision of scores being compared.
The psychometric literature distinguishes several levels of score relationship, ordered by the strength of the conditions required (Holland & Dorans, 2006; Kolen & Brennan, 2014):
Equating is the strongest form. It produces interchangeable scores on two tests that measure the same construct at the same level of reliability with the same difficulty distribution. Requirements include: same construct, same reliability, equal difficulty, and a common linking design (common items or common persons). Equated scores can be used interchangeably.
Linking is weaker than equating. It relates scores on two tests that measure similar but not identical constructs or that differ in difficulty, reliability, or other properties. Linked scores can be compared but are not interchangeable. The relationship may be less stable across subpopulations.
Concordance is weaker still. It relates scores on two tests empirically without requiring construct equivalence. Concordance tables describe the observed relationship in a specific sample but may not generalize.
Prediction is the weakest relationship. One score is used to predict the other, typically through regression. The relationship is directional, lossy, and potentially unstable.
Three major designs exist for establishing score relationships:
Common-item design: Both tests include a subset of shared items (anchor items). The relationship between scores on the shared items provides the basis for equating the unique items. This is feasible for LLM benchmarking — common item sets could be maintained across benchmarks.
Common-person (common-examinee) design: The same examinees take both tests. For LLMs, this is naturally satisfied — the same models can be tested on all benchmarks. However, contamination concerns (a model may have been trained on one benchmark’s items but not the other’s) complicate the interpretation.
Single-group design with random equivalent groups: A combined item pool is randomly split into two halves, administered to the same or equivalent groups. This is less relevant for benchmark-to-benchmark comparison but relevant for within-benchmark equating across forms.
Item Response Theory provides a framework for placing items and examinees on a common scale, which can facilitate linking across benchmarks:
Standard IRT estimation—including the marginal maximum likelihood (MML) estimator of Bock and Aitkin (1981)—assumes that the latent trait (\(\theta\)) follows a normal distribution in the population of interest. This assumption is consequential: violations can bias item parameter estimates, distort ability score distributions, and undermine the accuracy of linking transformations. For the LLM population, there are at least three structural reasons to anticipate non-normality:
Bimodal clustering: The LLM landscape exhibits a pronounced capability gap between frontier models (GPT-4o, Claude 3.5 Sonnet, Gemini Ultra) and smaller open-source models (Llama-7B, Phi-2), producing a bimodal or multi-modal \(\theta\) distribution rather than the symmetric, unimodal distribution assumed by standard MML.
Ceiling compression: High-performing frontier models cluster near the upper boundary of most benchmark item difficulties, compressing the upper tail of the \(\theta\) distribution and producing negative skew.
Small discrete population: Unlike human testing programs with tens of thousands of examinees, the LLM calibration population may comprise fewer than 50–100 distinct models, making distributional assumptions both more consequential (small-\(N\) bias) and more difficult to verify empirically.
To address these concerns, several methodological approaches have been proposed in the psychometric literature. Mislevy (1984) demonstrated that the latent distribution can be estimated empirically as a discrete mass-point distribution rather than imposed parametrically, substantially relaxing the normality assumption. Woods and Thissen (2006) and Woods (2007) extended this work with penalized log-likelihood methods that allow smooth nonparametric estimation of the latent density, providing a flexible alternative to the normal assumption. Sass, Schmitt, and Walker (2008) conducted systematic simulation studies documenting the conditions under which non-normal latent distributions produce the greatest bias in 2PL and 3PL parameter recovery, finding that item discrimination estimates are most sensitive to distributional misspecification. At the nonparametric extreme, Ramsay’s (1991) TestGraf and Stout’s (1987) nonparametric IRT framework abandon parametric item response functions entirely, offering robustness to both distributional and functional-form misspecification.
The proposed Stage 3 simulation study will systematically evaluate IRT parameter recovery and linking accuracy across a range of \(\theta\) distribution shapes (normal, bimodal, negatively skewed) representative of the empirical LLM population, informed by the latent distribution estimates obtained in Stages 1–2. This design directly addresses the applicability of standard IRT-based linking to the structurally distinctive LLM measurement context.
Key References: Bock & Aitkin (1981); Mislevy (1984); Woods & Thissen (2006); Woods (2007); Sass, Schmitt, & Walker (2008); Ramsay (1991); Stout (1987); de Ayala (2009).
Several features of LLM assessment actually make linking more feasible than in many human testing contexts:
Common-person (common-examinee) design advantage: Unlike human testing, where the same individuals rarely participate in both the new-form and reference-form administrations required by a single-group equating design (von Davier, Holland, & Thayer, 2004), all available LLMs can be administered every benchmark under standardized conditions. This naturally satisfies the requirements of a complete data design — the design in which every examinee provides responses to all instruments — which yields the most statistically efficient equating and linking parameter estimates. Under the common-examinee design, equating functions are estimated with minimum sampling error, and model-ability estimates on different benchmarks share a common examinee-level reference frame. The primary threat to this advantage is contamination asymmetry (differential training data overlap across benchmarks), which violates the assumption that examinees bring the same latent trait distribution to all instruments — an assumption that must be empirically verified rather than assumed.
Reproducibility: LLMs (at fixed temperature and version) can be tested repeatedly on the same items, allowing estimation of within-model, within-item variability. This is not possible with human test-takers and provides additional data for linking.
Large item pools: Benchmarks often have hundreds or thousands of items, providing ample data for item-level analysis and IRT calibration.
Construct heterogeneity: The biggest obstacle. Benchmarks often target different constructs (or different aspects of a broadly defined construct), violating the fundamental requirement for equating. Linking or concordance is possible but weaker.
Contamination asymmetry: If Model X was trained on Benchmark A’s items but not Benchmark B’s items, scores on the two benchmarks reflect different things (contaminated knowledge vs. genuine ability), making any comparison misleading. Population invariance of the linking function — the requirement that the transformation relating scores on two instruments remains constant across subpopulations — is a necessary condition for general score comparability (Dorans & Holland, 2000). For LLM benchmarks, population invariance would require that the concordance between two benchmarks holds equally well for models of different architectures, parameter sizes, and training procedures. If the linking relationship differs substantially across model families, the concordance table is population-specific and cannot support general claims about score comparability.
Format differences: As discussed in RQ3 (format as a construct-irrelevant variance facet), format affects what is being measured. Comparing MCQ-based Benchmark A to open-ended Benchmark B introduces method variance that confounds the comparison.
Temporal instability: Both the model population and the benchmarks change over time. A linking established in January may not hold in June if new models enter the population or benchmark items become contaminated.
Ceiling and floor effects: If frontier models are at ceiling on one benchmark but not another, the relationship between scores is nonlinear and linking methods based on linear transformations are inappropriate.
| Method | Feasibility for LLM Benchmarks | Requirements | Limitations |
|---|---|---|---|
| Full Equating | Low | Same construct, same reliability, same difficulty, common items | Rarely met across different benchmarks |
| Common-Item Linking | Moderate | Shared item subset, same or very similar constructs | Requires deliberate design; construct overlap must be verified |
| Common-Person Linking | High | Same models tested on both benchmarks | Naturally satisfied; but contamination asymmetry is a threat |
| IRT Concurrent Calibration | Moderate | Unidimensionality across linked items; large enough model sample for stable calibration | Model sample sizes may be too small; construct equivalence required |
| Concordance Tables | High | Same models on both benchmarks | Descriptive, not stable across model populations; weak generalizability |
| Predictive Regression | High | Same models on both benchmarks | Directional; lossy; may be unstable |
Note. Equating error (the sampling variability in equating function estimates) and linking error (the systematic error introduced by violations of equating assumptions, particularly construct non-equivalence) are conceptually distinct (Mislevy, 1992). Equating error decreases as sample size increases; linking error does not decrease with sample size and represents a fundamental limit on score comparability when construct equivalence is not established.
Given the feasibility analysis, this dissertation proposes a staged approach:
Thompson (2007) articulated five core components of a functioning CAT system—(1) an item bank with calibrated IRT parameters, (2) an item selection algorithm, (3) a scoring algorithm for updating ability estimates, (4) a starting rule, and (5) a stopping rule—each of which maps onto a specific design challenge in the LLM context. For Stage 3 of the present dissertation, the item bank corresponds to the calibrated benchmark item pool (established in Stages 1–2), the selection algorithm must account for LLM-specific factors such as prompt sensitivity and format effects, and the stopping rule must balance measurement precision against the computational cost of additional LLM queries. Adaptive benchmarking thus represents both a psychometric ideal and a practically constrained engineering problem.
The G-study (RQ3) provides indirect but relevant evidence for comparability:
Equating assumes that scores reflect ability, not item exposure. If contamination is asymmetric (models trained on different benchmark items), linking is fundamentally compromised. This may require contamination detection as a prerequisite for any linking attempt — you cannot compare scores if one score reflects memorization and the other reflects genuine capability.
When a model is updated (GPT-4 \(\rightarrow\) GPT-4-turbo \(\rightarrow\) GPT-4o), is it the “same” examinee measured at a later occasion, or a distinct examinee that shares architectural lineage with the earlier version? This question has a direct psychometric analog in longitudinal testing: in human assessment, the test-retest and parallel-form reliability paradigms both assume examinee identity across occasions, but capability drift (equivalent to maturation in human testing) violates the assumption that the latent trait is stable. In IRT, the longitudinal linking framework of Béguin and Glas (2001) addresses this by treating ability change as an estimable parameter in a longitudinal IRT model rather than as measurement error — a conceptual approach directly applicable to model versioning if anchor items are administered to both the old and new version concurrently. Without such an anchor-item re-administration protocol at each model update, longitudinal comparisons conflate genuine capability improvement, changed training data composition, and measurement artifact.
As LLMs become more capable, the constructs we want to measure may evolve. “Reasoning” meant something different when models struggled with basic syllogisms than it does now when models can solve complex mathematics. This raises questions about whether vertical scaling is even meaningful if the construct itself is changing.
RQ1 is not just the logical starting point; it recurs throughout. Every other RQ’s answer is conditional on intended use. Validity evidence requirements depend on the interpretation being made (RQ2). Sampling adequacy depends on the claims being supported (RQ4). Reliability standards and acceptable error thresholds depend on the stakes of the intended use, and item format is treated as an explicit measurement facet (RQ3). Comparability requirements and the warranted level of score linking depend on the decisions being made (RQ5).
This dependency structure has a formal analog in Kane’s (2013) layered validity framework, in which the strength of the overall validity argument is bounded by the weakest inferential link in the chain. For LLM benchmarking, the extrapolation inference — from benchmark universe score to real-world capability — is consistently the weakest link across all intended uses, and no amount of evidence for other inferences can compensate for its weakness.
Running through RQs 2, 3, and 5 is a persistent question: what are we actually measuring? This dissertation does not need to resolve the deep philosophical question of whether LLMs “truly” reason or understand, but it must provide a framework for making construct definitions explicit, evaluating evidence for and against them, and showing how different definitions lead to different measurement strategies.
LLM assessment inverts the usual psychometric situation. In human testing, we have many examinees and relatively few items. In LLM assessment, we have few models (tens) and potentially many items (thousands). This inversion affects factor analysis (insufficient subjects for traditional methods), G-Theory (model facet has few levels; item facet has many), IRT calibration (small sample for person parameter estimation), and generalizability claims (strong item-level generalizability, weak model-level generalizability).
Contamination is a pervasive threat that appears in every RQ: it undermines construct validity (RQ2), corrupts reliability estimates (RQ3), violates representativeness assumptions (RQ4), differentially affects formats (RQ5), and destroys comparability (RQ1). It deserves treatment as a cross-cutting concern, potentially warranting its own chapter section or appendix with contamination detection methods and mitigation strategies. Psychometricians have addressed analogous threats through item banking security protocols, alternate form construction, and statistical methods for detecting preknowledge (Maynes, 2014; van der Linden & Guo, 2008). These methods, developed for high-stakes human testing, provide a mature methodological toolkit that can be adapted for LLM benchmark security.
Both models and benchmarks change over time, creating challenges for every RQ. Construct definitions may evolve (RQ2), intended uses may shift (RQ1), representativeness degrades as capabilities advance (RQ4), reliability estimates and format-effect magnitudes may change as model capabilities develop (RQ3), and score comparability links may break as benchmarks are updated or saturated (RQ5). The framework must acknowledge this dynamism and provide strategies for maintaining measurement quality over time.
Prompt sensitivity—the finding that LLM response quality varies substantially across semantically equivalent prompt phrasings (Li et al., 2025; Pellert et al., 2023)—constitutes a psychometrically defined stochasticity facet in the G-theory framework. When prompt wording is treated as a random facet in the G-study design, variance attributable to prompt \(\times\) model interaction can be estimated and partitioned from universe score variance. High prompt-sensitivity variance signals low generalizability across the prompt universe, which in turn constrains the validity of inferences from any single-prompt administration. This reconceptualization moves the discussion of prompt sensitivity from an engineering inconvenience to a quantifiable threat to measurement validity, and motivates the multi-prompt administration designs proposed in Stage 2.
A pervasive cross-cutting assumption embedded in the IRT-based components of this dissertation is the normality of the latent \(\theta\) distribution. As detailed in the RQ5 section on IRT-Based Linking, this assumption is structurally implausible for LLM populations, which exhibit bimodal capability distributions, ceiling compression, and small discrete population sizes. The cross-cutting methodological response to this challenge is to adopt flexible latent distribution estimation as a default—employing Woods and Thissen’s (2006) penalized nonparametric approach or Mislevy’s (1984) discrete mass-point method—rather than imposing normality a priori. Simulation evidence (Sass et al., 2008) indicates that discriminating between “normality violations that matter” and “normality violations that are tolerable” requires systematic empirical investigation, which is the purpose of the Stage 3 simulation study. Importantly, this concern extends beyond IRT linking to the G-theory variance component estimation in RQ3, where the distributional properties of model-level ability estimates affect the precision of universe score estimation.
| New Component | Primary Connection | Secondary Connections |
|---|---|---|
| Medical Checkup metaphor | Overarching framework (Introduction, Ch. 1) | All RQs — provides intuitive framing |
| Information compression | RQ2 (construct structure), RQ4 (item sampling) | RQ1 (when tests are redundant, comparability is trivial) |
| Dimensionality check | RQ2 (internal structure validity) | RQ3 (dimensionality affects G-study design), RQ1 (linking within dimensions) |
| Item check | RQ4 (item sampling quality), RQ3 (item-level error) | RQ5 (format-specific item quality) |
| Test check | RQ2 (convergent/discriminant validity), RQ1 (comparability) | RQ4 (redundant tests waste resources) |
| Longitudinal tracking | RQ1 (temporal comparability), RQ4 (progress monitoring use) | RQ4 (temporal sampling), RQ3 (reliability over time) |
| ACL Stage 1 (item efficiency) | RQ4 (sampling), RQ3 (reliability-efficiency tradeoff) | RQ5 (format affects efficiency) |
| ACL Stage 2 (rolling samples) | RQ1 (comparability), RQ4 (sampling) | RQ3 (reliability of rolling assessment) |
| ACL Stage 3 (IRT-LP) | RQ1 (comparability), RQ3 (measurement precision) | RQ2 (IRT assumptions and construct meaning) |
| MIRT/bifactor scoring | RQ2 (dimensional structure), cross-cutting | RQ3 (bifactor G-score as object of measurement reduces \(\sigma^2\)(p\(\times\)i)), RQ5 (G-score provides purer construct anchor for linking) |
| Ability–Knowledge bifactor decomposition (w_h, ECV, PUC) | RQ2 §3.3.2 (internal structure validity: construct representation vs. knowledge retrieval) | RQ4 (S_k/G ratio as contamination-sensitivity index per topic), RQ3 (parallel G-study on bifactor G-score vs. total accuracy), RQ5 (cross-benchmark comparability anchored to G rather than topic-dependent accuracy) |
| Assumption violations | Cross-cutting (all RQs using IRT) | RQ3 (error estimation), RQ4 (model sampling) |
| Pool maintenance/expansion | RQ4 (sampling over time), RQ1 (maintaining comparability) | RQ3 (reliability of evolving pool) |
A critical assumption underlying the IRT-based analyses in Stages 2 and 3 is the normality of the latent ability (\(\theta\)) distribution in the LLM population. As the Stage 1 calibration data will reveal, this assumption is unlikely to hold: frontier models cluster at high \(\theta\) values, smaller models cluster at lower \(\theta\) values, and the resulting distribution is more accurately characterized as bimodal or negatively skewed than as approximately normal. The mitigation strategy—empirical latent distribution estimation using nonparametric methods (Mislevy, 1984; Woods & Thissen, 2006)—is built into the Stage 3 simulation design, which will evaluate parameter recovery and linking accuracy across the range of distributional shapes empirically observed in Stages 1–2.
| Venue | Content | Timeline |
|---|---|---|
| ACL / NeurIPS | Stages 1–3: Efficient LLM assessment via IRT (practical, CS-facing) | Can be submitted before dissertation defense |
| Applied Psychological Measurement | G-Theory variance decomposition + format facet analysis (RQ3) | Core empirical paper |
| Educational Measurement: Issues and Practice | The Medical Checkup framework + three checks (conceptual, practitioner-facing) | Framework paper |
| Journal of Educational and Behavioral Statistics | IRT assumption violations for LLM examinees (methodological) | Technical contribution |
| Psychometrika | Multidimensional IRT and bifactor models for LLM assessment (theoretical) | If dimensionality results are strong |
A dedicated simulation study examining IRT performance under LLM-realistic distributional conditions represents a natural and necessary methodological contribution of Stage 3. The simulation will cross three factors: (a) latent distribution shape (normal, bimodal symmetric, bimodal asymmetric, negatively skewed); (b) sample size (\(N\) = 30, 50, 75, 100 models); and (c) IRT model (1PL, 2PL, 3PL). Outcome metrics will include item parameter recovery (RMSE for \(a\), \(b\), \(c\)), ability estimation accuracy (correlation and RMSE for \(\theta\)), and linking transformation accuracy (mean signed/unsigned error). This design mirrors the simulation paradigm of Sass et al. (2008) while extending it to the specific structural features of LLM populations identified in Stages 1–2 (de Ayala, 2009). Results will provide direct, empirically grounded guidance on the appropriateness of standard IRT-based linking for LLM benchmarking applications.
This table was originally positioned at the opening of the dissertation map. It has been moved here as a supplementary reference so that readers can consult it on-demand without it interrupting the main argument flow. Both psychometricians new to AI and AI scientists new to psychometrics are encouraged to review it before or alongside the main text.
This dissertation operates at the intersection of psychometrics and machine learning (ML) — two fields that have developed largely independent vocabularies for closely related concepts. Psychometrics, as the primary disciplinary home of this work, supplies the foundational measurement concepts; ML supplies the object of inquiry and a parallel set of terms that readers trained in that tradition will find familiar. The crosswalk below is organized with psychometric terms as the anchor column. Where terms map directly, the equivalence is stated; where the mapping is partial or imperfect, the distinction is noted — because those imperfections are often precisely where the intellectual contribution of this dissertation lies.
| Psychometrics Term | Machine Learning Equivalent | Shared Meaning and Notes |
|---|---|---|
| Examinee / Test-Taker | Model (e.g., GPT-4, Llama 3) | The entity whose performance is being measured. In psychometrics, the examinee is a human; in assessing LLM, the model is an artificial system. Many psychometric assumptions (learning, fatigue, motivation) do not transfer, whisle new assumptions specific to LLMs (stochasticity, version drift, training data, also known as item exposure) must be introduced. |
| Test / Assessment / Instrument | Benchmark | A standardized collection of items administered to measure a specified construct. In psychometrics, an assessment may serve selection, diagnostic, or certification purposes; the intended use must be made explicit (see RQ1). ML benchmarks carry an additional connotation of competitive ranking that psychometric instruments typically do not. |
| Observed Score | Benchmark Score / Accuracy | The raw numerical result obtained from administering the benchmark. Psychometrics understands the observed score as a fallible estimate of the examinee’s true standing on the construct, subject to measurement error — a fallibility rarely acknowledged in ML benchmarking practice. |
| Universe Score | — (no direct ML equivalent) | The expected score a model would obtain if measured across all possible items, prompts, and conditions within the defined universe of generalization — the G-Theory analog of a true score. This concept is central to RQ3 and has no established counterpart in standard ML benchmarking discourse. |
| Score Scale / Scoring Rubric | Performance Metric (accuracy, F1, BLEU, pass@k) | The operationalization of how performance is quantified. In psychometrics, the choice of scoring model (number-correct, partial credit, IRT theta) is a theoretically motivated decision tied to construct definition. In ML, metric choice is often driven by convention or convenience rather than construct alignment. |
| Item Pool / Test Form | Evaluation Dataset / Test Set | The collection of items used to elicit responses for scoring. A psychometric item pool is developed against an explicit domain specification (test blueprint); ML evaluation datasets are often assembled by convenience or scraped from existing sources without a formal blueprint. |
| Parallel Test Form / Alternate Form | Data Split / Held-Out Set | A separate, non-overlapping set of items intended to measure the same construct. Psychometrics requires demonstrated equivalence (through equating or IRT linking) before scores on alternate forms are treated as comparable; ML typically assumes equivalence without testing it. |
| Test Item / Stimulus | Prompt | The unit of input presented to the examinee (or model) to elicit a scorable response. In psychometrics, item design is governed by principles of item writing (clarity, unambiguity, construct alignment); in ML, prompt engineering is recognized as influential but rarely evaluated as a source of systematic measurement error. |
| Item Format Effect / Construct-Irrelevant Variance | Prompt Variation / Prompt Sensitivity | The phenomenon whereby superficial changes to item presentation alter measured performance without altering the underlying construct. Psychometrics terms this construct-irrelevant variance and treats it as a threat to validity; ML studies it under “prompt sensitivity” but rarely connects it to validity theory. |
| Stochasticity Facet (in G-Theory) | Temperature | The parameter governing randomness in model outputs. From a psychometric perspective, temperature-induced variability constitutes measurement error — specifically a within-cell replication facet in a Generalizability Theory design. Higher temperature increases this error component and reduces score reliability. |
| Test Administration | Inference / Forward Pass | The process of presenting items and collecting responses. Standardized administration conditions are a prerequisite for valid score interpretation; LLM “administration” is rarely standardized with respect to system prompt, context length, API version, or decoding parameters. |
| Item Exposure / Compromised Security | Training Data Contamination | The condition in which items from the evaluation set appeared in the model’s training data, such that performance reflects memorization rather than the target construct. Psychometrics addresses this through secure item banking and parallel forms; ML benchmarks are highly vulnerable and have few established mitigations. |
| Norm Table / Percentile Rank | Leaderboard / Model Ranking | A summary of relative standing across examinees (or models). A psychometric norm table is constructed from a defined reference population with known sampling characteristics; ML leaderboards aggregate scores without specifying the reference population, independence of observations, or sampling frame — making normative interpretation problematic. |
| Targeted Instruction / Remediation | Fine-Tuning | The process of adjusting model parameters on a specific task or domain, analogous to targeted instruction or test preparation. From a psychometric standpoint, fine-tuning on benchmark-adjacent content raises construct validity concerns: does improved performance reflect genuine construct mastery or surface-level task adaptation? |
| Uninstructed / Naïve Condition | Zero-Shot Evaluation | Evaluation in which no task-specific examples are provided prior to testing. Represents the baseline measurement condition. |
| Instructed Condition / Practice Items | Few-Shot Evaluation | Evaluation in which a small number of examples are provided before the test item. Analogous to providing worked examples or practice items before formal assessment. The number and format of examples constitute an administration condition that should be standardized and reported. |
| Test Item / Sub-test | Task | The specific problem or question posed within a benchmark. In psychometrics, a task or item is the atomic unit of measurement whose statistical properties (difficulty, discrimination, information) are analyzed via item analysis or IRT. |
| Examinee at a Later Testing Occasion | Model Version / Model Update | Successive model versions (e.g., GPT-4 \(\rightarrow\) GPT-4-turbo \(\rightarrow\) GPT-4o) can be conceptualized as the same examinee measured at different time points. Psychometrics addresses score comparability across occasions through equating and linking; ML typically treats each model version as a distinct entity without formalizing the longitudinal comparison. |
| Examinee Subgroup / Population Stratum | Architecture | A class of models sharing structural design characteristics (e.g., transformer-based, mixture-of-experts). Analogous to a demographic or ability subgroup in human testing — differences across architectures may produce differential item functioning (DIF), requiring separate analyses to ensure construct equivalence. |
The mappings above facilitate communication across fields but should not be over-interpreted. Several structural asymmetries between LLMs and human examinees resist simple analogy:
The examinee is not a person. Psychometric theory was developed for human examinees who possess stable latent traits, are subject to fatigue and motivation, and cannot be replicated. LLMs violate all three assumptions — their “ability” may vary stochastically across runs, they do not experience fatigue, and exact replications at fixed temperature are possible. These differences require modification of standard psychometric models rather than their uncritical application.
The “population” of LLMs is not a statistical population. Human norm samples are drawn from a defined population using probability sampling; the set of available LLMs is a convenience collection of engineered artifacts with known inter-dependencies (shared training data, shared architectures, direct lineage). Generalizability claims must therefore be qualified accordingly.
Constructs defined for humans may not transfer. Terms such as “reasoning,” “understanding,” and “knowledge” carry theoretical meaning in cognitive and educational psychology that may not apply to artificial systems. This dissertation takes a pragmatic stance: constructs are defined operationally with respect to a specified intended use and stakeholder interpretation, rather than resolved philosophically.