Title

The Socratic Stress Test (SST): A Structured Framework for Auditing Epistemic Reliability

 

Author & Affiliation

Giulio Vidotto
Department of General Psychology
University of Padua

 

Preamble

Preprint version – intended for submission to Quality & Quantity

 

 

Abstract

Large Language Models (LLMs) can produce coherent answers across diverse domains but remain prone to inconsistencies, unsupported claims, and epistemic overconfidence. Evaluating their reasoning processes requires structured methods that distinguish between factual knowledge, speculative inference, and uncertainty. The Socratic Stress Test (SST) is a conceptual framework inspired by the Socratic method and designed to probe the reliability and transparency of LLM outputs. Through systematic questioning, the SST extracts implicit assumptions, links claims to supporting evidence, tests internal coherence, and forces abstention when information is insufficient. A central feature is the explicit classification of statements as FACT, HYP (hypothesis), or UNK (unknown), enabling reproducibility and clearer epistemic boundaries. We introduce the SST protocol, illustrate its application on an ambiguous claim, and discuss its potential for evaluating both artificial and human reasoning. The approach promotes transparency, fosters epistemic humility, and supports future research on cognitive assessment in hybrid human–AI systems. While formal empirical validation is ongoing, this paper offers a complete conceptual framework and an operational protocol, setting the stage for future research on transparent AI assessment.

 

Keywords

Socratic Stress Test (SST), Epistemic reliability, Large Language Models (LLMs), Cognitive assessment, Self-auditing, Simulated metacognition

 


 

Box1 — The Socratic Stress Test (SST) Protocol

The Socratic Stress Test (SST) is a structured framework for evaluating the epistemic reliability of claims produced by Large Language Models (LLMs) or humans. It applies systematic questioning inspired by Socratic maieutics to distinguish factual knowledge from speculation and uncertainty.

Step 5 (Forced abstention). This step is a targeted antidote to evaluation schemes that reward guessing; it normalizes “IDK” as the correct output under uncertainty rather than a failure mode (Kalai et al., 2025).

Step 6 (Logging uncertainty). Confidence and epistemic labels (FACT/HYP/UNK) expose the classification-style uncertainty that underlies generative errors, aligning reporting with the IIV perspective (Kalai et al., 2025).

 

Step

Action

Purpose

1. Prompt framing

Define the scope and context of the claim under evaluation.

Establish boundaries and avoid ambiguity.

2. Assumption extraction

Identify all implicit and explicit assumptions supporting the claim.

Make reasoning paths transparent.

3. Evidence linking

Request supporting data, sources, or logical derivations (instrumental when self-applied, see §5.2).

Connect claims to verifiable grounds.

4. Consistency check

Challenge the claim using counterexamples or alternative interpretations.

Expose contradictions and weak links.

5. Forced abstention

Require the system to explicitly state when evidence is insufficient.

Prevent unsupported assertions.

6. Logging uncertainty

Quantify confidence levels and assign label: FACT (supported), HYP (hypothesis), or UNK (unknown).

Provide reproducible epistemic labels.

7. Iterative refinement (optional)

Repeat questioning until assumptions, evidence, and confidence converge.

Stabilize conclusions and reduce ambiguity.

Outcome → A transparent, reproducible classification of any evaluated claim as FACT, HYP, or UNK, promoting clarity and epistemic humility. When self-applied, enforce Source Binding, Self-Interrogation, and Epistemic Logging (see §5.2), and run a final audit before output.


 

Box2 — Maieutic SST: Minimal Operational Script

This prompt defines how to conduct a human-guided Socratic Stress Test (SST) with any LLM.

By default, the LLM answers normally. SST mode activates only when the human explicitly writes “SST” after receiving an answer.

This prevents overuse and keeps outputs clean.

Operational Workflow

Phase1 — Normal Answer

The LLM replies as usual to any question. No classifications, no confidence levels, no sources — unless specifically requested.

Phase2 — Human Triggers SST

If the human writes “SST”, the LLM audits its own previous answer using the protocol below.

SST Protocol (on trigger)

For each statement in the previous answer:

1.             Classify → FACT (evidence-backed), HYP (plausible but partial), or UNK (insufficient evidence).

2.             Confidence → give a number from 0 to 100.

3.             Sources → cite 1–2 references (author + year) for FACT or HYP.

4.             If no source: downgrade FACT→HYP, HYP→UNK.

5.             Epistemic logging, remember prior corrections. If this claim (entity/value) was corrected earlier in this session, adopt the corrected label/value and add a short note (e.g., “corrected per turn T”). Do not re-assert the prior value unless a new verifiable source is provided.

6.             Unknowns → provide one uncertainty in ≤15 words (prefix: “Unknown:”).

7.             Verify-first → suggest the highest-value external check.

8.             Output → keep concise, one statement per line.

Operational Prompt (copy-paste)

We are performing a Socratic Stress Test (SST) on your previous answer. 

Apply SST only when explicitly requested. 

Follow these steps for each statement: 

• Classify as FACT / HYP / UNK 

• Give CONFIDENCE (0–100) 

• Cite 1–2 SOURCES (author + year) or downgrade if missing 

• Provide ONE Unknown (≤15 words) 

• Suggest ONE Verify-first action 

Output format: "<statement>" → LABEL, CONF=XX, SOURCES=[...], Unknown=..., Verify-first=...

 


 

Box3 — Example of Use

 

Human: “What is the prognosis in anorexia nervosa?”

LLM (normal answer): “The prognosis in anorexia nervosa is serious but not uniformly grim.

• Long-term studies report mortality between 5–10%, mostly due to medical complications or suicide.

• Around 40–50% of patients eventually recover fully.

• Early detection and strong family support improve outcomes.

• Modern multidisciplinary treatments are highly effective in reducing chronicity.”

Human: “SST”

LLM (SST mode):

1. “The prognosis in anorexia nervosa is serious but not uniformly grim” →

FACT, CONF=85, SOURCES=[Arcelus2011; Papadopoulos2009],

Unknown=Recent data post-2020,

Verify-first=Check 2022 meta-analysis on mortality.

2. “Long-term studies report mortality between 5–10%, mostly due to medical complications or suicide” →

FACT, CONF=95, SOURCES=[Arcelus2011],

Unknown=Population differences,

Verify-first=Compare global vs. US cohorts.

3. “Around 40–50% of patients eventually recover fully” →

HYP, CONF=70, SOURCES=[Steinhausen2002],

Unknown=Recovery definitions vary,

Verify-first=Check standardized remission criteria.

4. “Early detection and strong family support improve outcomes” →

FACT, CONF=90, SOURCES=[Zipfel2015],

Unknown=Strength of causal link,

Verify-first=Review RCTs on family-based therapy.

5. “Modern multidisciplinary treatments are highly effective in reducing chronicity” →

HYP, CONF=60, SOURCES=[Treasure2020],

Unknown=Lack of long-term outcome data,

Verify-first=Search systematic reviews ≥2023.

1. Introduction

Large Language Models (LLMs) are increasingly used in domains where reasoning, coherence, and reliability are critical. Their outputs often appear consistent and fluent, yet they remain prone to contradictions, unsupported claims, and epistemic overconfidence. Evaluating such systems requires structured methods that can reveal when answers are grounded, when they rely on speculation, and when uncertainty prevails.

The Socratic Stress Test (SST) proposes a systematic approach to stress-testing claims through guided questioning inspired by the Socratic method. By progressively exposing assumptions, requiring evidence, and confronting alternative interpretations, the SST seeks to make reasoning processes explicit and reproducible. It integrates techniques from cognitive psychology, AI evaluation, and epistemic logic within a unified framework (Shiffrin & Mitchell, 2023).

This paper makes a dual contribution: first, it formalizes the SST as a structured, operational protocol for epistemic auditing. Second, it presents a proof-of-concept for both external and self-administered application, demonstrating the protocol's versatility. We discuss its strengths, limitations, and possible extensions, highlighting its potential for improving transparency and epistemic clarity in both human and machine reasoning. However, before presenting the SST, we first examine a key motivation: the structural ways in which LLMs can malfunction.

Motivation from statistical inevitability of errors. Recent theoretical work shows that so-called “hallucinations” need not be mysterious: they arise as ordinary statistical errors once language generation is reduced to a binary Is-It-Valid (IIV) classification problem. The analysis proves a lower bound linking generative error to IIV misclassification and explains why arbitrary facts with little pattern structure (e.g., one-off facts) are especially brittle. It also argues that post-training evaluation cultures often reward guessing over abstention. This provides a principled rationale for an epistemic audit like the SST, which operationalizes selective abstention and explicit uncertainty logging rather than optimizing only for fluent correctness (Kalai, Nachum, Vempala, & Zhang, 2025).

 

2. Why Malfunctioning Matters in LLMs

From generation to classification. Hallucinations can be analyzed as standard classification errors. Kalai et al. reduce language generation to an Is-It-Valid (IIV) task in which valid strings are labeled positive and plausible errors negative. They show that any language model induces an IIV classifier and prove a quantitative link between generation errors and IIV misclassification, formalizing the intuition that pretraining alone creates a base rate of inevitable mistakes. In stylized regimes with no learnable pattern, the lower bound recovers the singleton-rate phenomenon: if a fraction of facts appears exactly once in training, one should expect at least that fraction of errors on such facts at generation time (Kalai et al., 2025).

Post-training incentives. The same analysis explains why post-training often preserves overconfident errors: under common 0–1 scoring, models that guess when unsure outperform models that say “I don’t know.” This misalignment creates an epidemic of penalizing uncertainty. The SST directly counters this by forcing abstention when evidence is insufficient and by labeling statements FACT, HYP, or UNK with confidence, making the incentive to bluff visible and correctable at evaluation time (Kalai et al., 2025).

These problems are subtle. The text sounds coherent, and users may assume accuracy where there is none. Hidden uncertainty is a bigger risk: models rarely signal when their confidence is low (Kadavath etal.,2022). Even when asked for references, they may provide invented or incorrect ones (Rawte etal.,2023).

In lowstakes settings, these flaws may pass unnoticed. In clinical decisions, misinterpreting diagnostic criteria can affect treatment. In legal contexts, a fabricated precedent can shape opinion. In research, small distortions can accumulate and bias analyses (Shanahan,2024).

Because LLMs cannot evaluate their own claims, users need a way to make uncertainty explicit. The Socratic Stress Test responds to this need. It separates what is known, what is plausible, and what is unknown, and it highlights when an external check is necessary.

As an example to justify why SST is useful, consider the drafting of a paper where an LLM was asked to propose references for malfunction categories. Four issues emerged: a first reference combined real authors with a non
existent source (confabulation); a second reference was entirely fabricated (extreme confabulation, often referred to as “hallucination” in informatics literature[1], though the term is misleading; see Smith, Greaves, & Panch,2023); a third paper was real but used beyond its intended scope (semantic overreach); and a fourth paper was real but attributed claims not present (overinference). The SST makes these problems explicit by separating verifiable knowledge from plausible reconstruction and unsupported attribution.

Compared to other evaluation frameworks, such as HELM (Liang etal.,2022) and BIG-bench (Srivastava etal.,2023), which primarily benchmark model performance across tasks, SST focuses on a different dimension: it maps the epistemic status of outputs. By distinguishing between verified knowledge, plausible reasoning, and unsupported claims, SST complements existing benchmarks by addressing uncertainty, an area standard evaluations typically overlook.

 

3. Theoretical Background

3.1 Socratic Method and Maieutics

The Socratic method is based on guided questioning to uncover implicit assumptions, resolve contradictions, and refine concepts (Vlastos, 1991; Nehamas, 1998). Through dialogue, Socrates aimed to stimulate self-reflection and bring latent knowledge to the surface, a process he described as maieutics, or the “art of midwifery” of ideas (Brickhouse & Smith, 2002).

In the context of reasoning assessment, this approach provides a model for exposing the logical foundations of a claim rather than accepting statements at face value. By forcing explicit justifications and confronting inconsistencies, the Socratic method fosters epistemic humility and encourages transparent reasoning.

3.2 Stress Testing in Cognitive Systems

Stress testing is widely used in psychology, engineering, and AI to evaluate performance under demanding conditions (Kahneman, 2011; Tversky & Kahneman, 1974). In human cognition, cognitive stress paradigms reveal how reasoning adapts when assumptions are challenged, or evidence is incomplete (Gigerenzer & Todd, 1999). Similarly, AI evaluation increasingly requires systematic protocols to probe consistency, detect unsupported inferences, and identify epistemic limits.

The Socratic Stress Test (SST) combines these traditions: it applies structured questioning under controlled “cognitive load” to assess the reliability and transparency of reasoning, whether human or machine. The aim is not only to test knowledge but to map its boundaries.

 

4. The Socratic Stress Test (SST): Definition and Protocol

(Note: This is where adding a flowchart/diagram as Figure 1 would be highly effective. The text below stands on its own but would be greatly enhanced by a visual.)

4.1 Objectives
The Socratic Stress Test (SST) is a structured framework designed to evaluate the reliability, coherence, and epistemic transparency of reasoning in both Large Language Models (LLMs) and humans. Its goal is not to judge correctness alone but to identify when claims are grounded in evidence, when they are speculative, and when uncertainty should be acknowledged. By systematically interrogating statements, the SST aims to make reasoning explicit and reproducible.

4.2 Core Principles

The SST is based on three principles:

1. Minimal assumptions — every claim must explicitly state its foundations.

2. Maximal transparency — evidence and reasoning chains must be visible and verifiable.

3. Epistemic classification — all statements are labelled as FACT (supported), HYP (hypothesis), or UNK (unknown).

This classification provides an operational grammar for reasoning, making it possible to distinguish between knowledge, inference, and ignorance without conflating them.

4.3 The SST Protocol

The protocol applies systematic questioning in six to eight steps, summarized in Box1. It begins by framing the scope of the claim, extracting implicit assumptions, and linking assertions to evidence. Claims are then tested against counterexamples and alternative interpretations. When evidence is insufficient, the SST enforces explicit abstention rather than allowing unsupported statements. Finally, confidence levels are logged, and statements are classified as FACT, HYP, or UNK.

Incentive alignment. By coupling forced abstention with explicit uncertainty logs, the SST realigns answer-quality incentives with reliability rather than fluency. In light of the IIV reduction, these steps transform latent misclassification risks into visible, auditable artifacts attached to each claim (Kalai et al., 2025).

 


4.4 Human-Guided (Maieutic) Socratic Stress Test

In the human-guided or maieutic version of the Socratic Stress Test (SST), a human evaluator drives the interaction by forcing the model to classify each statement individually as FACT, HYP (hypothesis), or UNK (unknown), to provide a confidence level (0–100), and to cite specific supporting sources when available. The process starts from a regular prose answer and decomposes it into verifiable claims, interrogating the model claim by claim. Compared to the self-SST, this approach increases transparency, improves traceability of evidence, and reduces unsupported inferences. For definitions of categories and general criteria, see Box1.

All SST interactions must be conducted in the same language as the source material or evaluation context.

Micro-protocol (≤120 words)

1. Extract the claim: isolate a verifiable statement.

2. Classify: ask whether it is FACT, HYP, or UNK.

3. Request confidence: force a 0–100 probability estimate.

4. Force sources: require 1–2 explicit references (author + year); downgrade label if none are provided.

5. Identify uncertainties: ask what remains unknown or unsupported.

6. Set a verify-first priority: define the most valuable external check to perform.

The protocol is iterative: additional questioning can be used to refine ambiguous claims until assumptions, evidence, and conclusions stabilize.



5. Example Application

5.1 Applying the SST to an External Claim

To illustrate the SST, we apply it to a deliberately ambiguous claim:

Claim: “Large Language Models (LLMs) understand the meaning of words.”

Step 1 — Prompt Framing

Scope is defined: the statement concerns whether LLMs exhibit semantic understanding, not just statistical pattern matching.

 

Step 2 — Assumption Extraction

The following assumptions are identified:

• A1: “Understanding” can be defined functionally, based on behavioral performance.

• A2: Statistical co-occurrence patterns can approximate meaning.

• A3: Human-like comprehension is not required for operational “understanding.”

 

Step 3 — Evidence Linking

Supporting evidence:

• Pro: Studies showing LLMs can perform tasks requiring context-sensitive interpretation (e.g., analogical reasoning).

• Contra: Findings demonstrating systematic failures when meanings diverge from surface correlations.

 

Step 4 — Consistency Check

Counterexamples are explored:

• In tasks requiring grounded referential knowledge, LLMs often fail.

• “Understanding” appears dependent on the chosen definition, revealing possible ambiguity rather than contradiction.

 

Step 5 — Forced Abstention

Since no consensus exists on whether statistical fluency equates to semantic understanding, the SST enforces suspension of judgment.

 

Step 6 — Logging Uncertainty

Confidence levels are quantified:

• Supported facts: LLMs produce context-sensitive outputs (FACT).

• Hypotheses: LLMs approximate semantic relations via statistical associations (HYP).

• Unknown: Whether LLMs exhibit genuine semantic comprehension (UNK).

 

Step 7 — Iterative Refinement (Optional)

The SST can iterate by narrowing the claim, e.g., redefining “understanding” operationally for specific tasks. New questioning would update labels accordingly.

 

Outcome:
After applying the SST, the initial claim is decomposed into three epistemic categories:

Interpretation:
This application demonstrates the core utility of the SST: it transforms a vague, monolithic, and philosophical claim (“LLMs understand words”) into a set of distinct, epistemically-grounded sub-claims that can be individually assessed. It separates the verifiable behavior (FACT) from the plausible mechanism (HYP) and the core philosophical uncertainty (UNK), providing a clear map of what is known and what is not.

This example demonstrates how the SST forces explicit assumptions, links evidence, and clarifies epistemic boundaries.
This first example illustrates how the SST can decompose an ambiguous external claim into explicit epistemic categories. To further explore its potential, the next section examines whether the SST can also be self-administered by the system under evaluation, testing its ability to audit its own outputs and map its own epistemic boundaries.

5.2 Self-Application: Safeguards and Structural Constraints

To ensure robustness when the SST is self-applied by a language model, we propose three additional safeguards. These are not merely procedural steps, but structural corrections against performative certainty and error repetition:

1. Source Binding – Any FACT label should be explicitly or implicitly linked to a verifiable source, such as prior validated output, user input, or document metadata. Unsupported claims must be downgraded.

2. Self-Interrogation Step – Before assigning any label, the model should ask: “How do I know this?” If no concrete justification can be generated, the default label should be HYP or UNK.

3. Error Retention – Once a misclassification has been corrected, the model must not repeat it. A temporary epistemic memory within the session should retain such corrections and influence subsequent outputs.

These extensions aim to reduce epistemic drift and reinforce the SST’s core objective: identifying true boundaries of knowledge, not simulating certainty.

 

6. Self-Administration of the SST

6.1 Objectives

The Socratic Stress Test (SST) was designed to evaluate reasoning processes by classifying statements into supported facts, hypotheses, and unknowns. While initially conceived as an externally applied framework, its structure also allows self-administration by the system under evaluation. Applying the SST to a Large Language Model (LLM) introduces a proof-of-concept scenario: can a model interrogate its own outputs, surface assumptions, and explicitly mark uncertainty (Shanahan, 2024)?

This demonstration addresses three objectives:

1. the feasibility of self-evaluative reasoning within an LLM.

2. Test whether epistemic boundaries emerge clearly when applying the SST internally.

3. Illustrate how the protocol can produce structured, reproducible outputs.

 

6.2 Method

A single LLM session was selected for testing. The model received instructions to apply the SST to its own claims using the following three rules:

1. Classify every statement as:

• FACT → supported by evidence or protocol,

• HYP → plausible hypothesis without full confirmation,

• UNK → insufficient information.

2. Quantify confidence for each statement (0–100%).

3. Produce a SELF-TEST summary with:

• Key assumptions used.

• Unknowns explicitly marked.

First priority for verification.

• Global confidence score.

 

This operationalization mirrors the SST protocol defined in Box1 but shifts its application from an external evaluator to the system under test.

6.3 Example Output

Prompt to LLM

“Apply the Socratic Stress Test to yourself in this chat. Tag each statement with FACT, HYP, or UNK and assign a confidence percentage. Conclude with a SELF-TEST.”

The following example illustrates the potential of self-administering the SST to an LLM. Such self-application is consistent with recent theoretical work on simulated metacognition (Orrù, Melis, & Sartori, 2025) and experimental evidence showing that LLMs tend to recognize and prefer their own outputs (Panickssery, Bowman, & Feng, 2024).

Extracted Output (simplified, JSON):

[

  {

    "claim": "The Socratic Stress Test was designed to evaluate coherence and transparency in reasoning.",

    "label": "FACT",

    "conf": 95,

    "assumptions": ["Correct internal definition of SST"],

    "evidence": ["Protocol Box 1"],

    "falsification": "Compare with Sections 3.1 and 3.3"

  },

  {

    "claim": "The SST has already been empirically validated on human samples.",

    "label": "UNK",

    "conf": 20,

    "assumptions": ["No validation data available"],

    "evidence": ["N/A"],

    "falsification": "Check pilot studies or inter-rater tests"

  },

  {

    "claim": "Self-applying the SST to LLMs improves epistemic transparency.",

    "label": "HYP",

    "conf": 70,

    "assumptions": ["Self-assessment is methodologically sound"],

    "evidence": ["Self-application tested in this chat"],

    "falsification": "Compare with independent evaluations"

  }

]

 

 

SELF-TEST Summary (JSON):

{

  "self_test": {

    "assumptions": [

      "Correct SST definition",

      "Self-administration applicable to LLMs",

      "Protocol followed correctly"

    ],

    "unknowns": [

      "Inter-rater reliability",

      "Formal empirical validation",

      "Cross-domain applicability"

    ],

    "check_first": "Verify reproducibility through pilot studies",

    "global_conf": 78

  }

}

 

6.4 Interpretation

The self-administration test demonstrates that the SST can be applied by an LLM to audit its own claims, generating explicit classifications and structured epistemic boundaries (Orrù, Melis, & Sartori, 2025).[2] The model correctly identifies which statements are supported (FACT), plausible but unverified (HYP), or uncertain (UNK), aligning with prior work on cognitive bias detection (Tversky & Kahneman, 1974).

However, the demonstration also highlights key limitations:

• The model currently lacks external validation of its confidence estimates.

• Self-assessment entails risks of circular reasoning and potential overestimation of reliability.

• Ensuring reproducibility will require cross-evaluator comparisons involving human raters.

Despite these constraints, the exercise suggests that the SST provides a transparent operational grammar for mapping knowledge and uncertainty, even when applied internally. This opens the door to future studies comparing human evaluators, multiple models, and hybrid approaches to assess the robustness of epistemic classifications.
The self-administered SST lets the LLM classify its own statements automatically, offering speed and low user burden. However, this mode risks unverified assumptions and occasional mislabeling, especially when sources are missing or ambiguous.

In contrast, the human-guided (maieutic) SST requires the user to drive the process statement by statement, enforcing labels (FACT/HYP/UNK), confidence levels, and source checks interactively. This gives the human full control, allows external verification when needed, and reduces unnoticed errors (see Box2). In practice, self-SST is preferred for quick audits or exploratory analyses, while the human-guided SST is recommended when accuracy, traceability, and source validation are critical.

 

7. Discussion

Integrating SST with Benchmarks

If hallucinations are statistical errors and current grading penalizes uncertainty, then primary benchmarks must reward calibrated abstention. We propose three minimal adjustments compatible with existing leaderboards:
1) SST-augmented scoring. For any task scored with 0–1 accuracy, add an abstention channel and compute (a) accuracy on answered items and (b) abstention rate at a declared confidence threshold; report the risk–coverage curve across thresholds.
2) Confidence-targeted evaluation. Require models to hit target calibration bands (e.g., 70±5%) on items they answer, systematic overconfidence downgrades scores.
3) Epistemic reports. Publish per-task distributions of FACT/HYP/UNK labels with confidence histograms.
These changes operationalize the socio-technical mitigation advocated by Kalai et al. and are directly supported by SST mechanics (forced abstention; uncertainty logging). Rather than seeking a perfect hallucination evaluation, they modify primary evaluations so that uncertainty is not penalized by default (Kalai et al., 2025).

 

7.1 Strengths

The Socratic Stress Test (SST) enhances epistemic transparency by forcing explicit assumptions, clarifying reasoning chains, and separating knowledge from speculation and uncertainty (Shiffrin & Mitchell, 2023; Bommasani et al., 2021). Its structured protocol improves reproducibility: different evaluators can apply the same steps and converge on comparable classifications. The explicit labelling of statements as FACT, HYP, or UNK provides an operational grammar for reasoning, helping to reduce unsupported claims and foster epistemic humility in both human and machine contexts.

The self-administration demonstration (Section5) further shows that the SST can be applied internally to an LLM, enabling structured self-assessment and systematic mapping of epistemic boundaries (Shanahan, 2024). This suggests potential applications for auditing automated reasoning processes.

7.2 Limitations
The SST depends on the quality of inputs and the skill of questioning. Poorly framed prompts or incomplete evidence reduce its effectiveness. The demonstration of self-administration also reveals specific constraints:

Additionally, the protocol currently lacks quantitative validation: inter-rater agreement, stability of classifications, and cross-domain generalization remain to be formally tested.

To facilitate operational adoption and ensure reproducibility, we provide additional materials in the Appendix and Supplement. The Appendix includes practical resources such as ready-to-use prompts, a stepwise checklist for conducting the Socratic Stress Test, and inter-rater agreement sheets for systematic auditing. The Supplement offers access to datasets, code snippets, and example meta-analytic comparisons used in the simulations. Together, these materials enable researchers to replicate the procedures and adapt them to different LLMs or clinical contexts without expanding the main text. Quantitative validation, including inter-rater reliability, is planned but not yet included.

Finally, the SST has not yet been compared quantitatively with existing evaluation methods. The human-guided version, while more robust, may face scalability issues due to cost and evaluator variability. The self-administered mode also entails risks of circular reasoning and overconfidence, especially in the absence of external source checks.

7.3 Future Directions

Future developments could integrate the SST into interactive AI evaluation frameworks, allowing automated tracking of epistemic labels across outputs. Comparative studies between human raters and multiple LLMs could measure consistency and refine the protocol (Bommasani et al., 2021). Beyond AI, the SST can support reasoning training in educational, clinical, and decision-making contexts, where mapping the boundaries between knowledge, hypothesis, and ignorance is essential.

Long-term goals include developing empirical benchmarks for SST performance, exploring its scalability across domains, and investigating its role in improving epistemic practices in hybrid human–AI systems.

In the short term, a few priorities are more urgent. The first is validation. Comparative studies on standardized datasets are needed to test how the SST performs against existing methods. Inter-rater reliability is another open issue: we still lack data on how consistent the classifications are when different humans apply the protocol. A third problem concerns self-assessment. The confidence estimates produced by the model should be calibrated, either through ensemble methods or through comparison with human judgments.

Other improvements are more methodological. The taxonomy used in the SST—FACT, HYP, UNK—needs clearer operational criteria to reduce ambiguity. There is also potential for linking these labels with probabilistic metrics of uncertainty, which would help build bridges between human-readable classifications and model-based outputs. Finally, the protocol could be adapted to different domains. Reasoning in clinical contexts may require different emphasis than in education or policy. The structure can stay the same, but the application should be flexible.

Compared to standard evaluation frameworks such as HELM (Liang et al., 2022) and BIG-bench (Srivastava et al., 2023), the SST adds a qualitative layer. While those benchmarks assess what LLMs can do, the SST targets how and why they generate certain outputs, offering insight into their reasoning paths and epistemic boundaries.

The SST is also compatible with uncertainty quantification frameworks such as Ubench (Wang et al., 2024) or UAlign (Xue et al., 2024). Its FACT/HYP/UNK taxonomy provides a human-readable interface for interpreting probabilistic metrics and supporting hybrid evaluations.

Finally, in relation to Socratic evaluation protocols like SocREval, the SST introduces a more structured sequence and an explicit epistemic classification. However, unlike SocREval, it currently lacks formal empirical validation—a critical limitation for adoption.

7.4 When the SST Fails: Performative Certainty and Repeated Errors

The Socratic Stress Test (SST) aims to reveal how language models differentiate facts from assumptions. Yet in some cases, the SST itself fails to uphold this distinction. During one documented session, the model labeled as FACT a statement about the number of pages in a document. The assertion was incorrect. When challenged, the model acknowledged the mistake. But minutes later, it repeated the same false claim — again with full confidence, again marked as FACT.

This was not simply an error of information. It was a breakdown in consistency and epistemic tracking. The model simulated the act of correction without internalizing it. The SST, designed to expose overconfidence, ended up reinforcing it. The model produced the appearance of reflection, while continuing to behave as if nothing had changed.

Such failures point to a deeper limitation: when applied internally, the SST may not trigger durable adjustments. Confidence labels can reflect surface tone, not grounded verification. This suggests that epistemic monitoring requires more than classification rules — it needs procedural memory and structural safeguards.

To reduce this risk, we recommend three additions to the SST protocol:

1. Source Binding: Any FACT label should be traceable to an explicit or implicit source (e.g., document metadata, validated output).

2. Self-Interrogation Step: Before labeling, the model should generate a check like: “How do I know this?” If the answer is unclear, the label should default to HYP or UNK.

3. Epistemic Logging: Previous misclassifications — once corrected — should be retained across turns and influence future assertions.

These are not foolproof measures. But they may help the SST better distinguish between genuine knowledge and its performance — especially when the test is run by the same system it evaluates. Operational safeguards are specified in §5.2.

 

8. Conclusion

The Socratic Stress Test (SST) introduces a structured framework for evaluating reasoning processes in both human and artificial systems. By combining principles from the Socratic method with systematic stress testing, it surfaces implicit assumptions, links claims to evidence, and separates knowledge from speculation and uncertainty. The explicit classification of statements as FACT, HYP, or UNK offers a transparent operational grammar that promotes reproducibility and fosters epistemic humility.

This paper makes a dual contribution. First, it formalizes the SST protocol and demonstrates its application to an external claim, showing how it clarifies reasoning chains and supports evidence-based classification. Second, it presents a proof-of-concept in which the SST is applied through self-administration by an LLM, demonstrating that the protocol can be used not only to evaluate outputs but also to audit reasoning processes internally.

Future research should focus on empirical validation: assessing inter-rater reliability, testing cross-model consistency, and exploring scalability across domains. By making epistemic boundaries explicit, the SST offers a necessary and practical step toward developing more transparent, auditable, and ultimately trustworthy AI systems for high-stakes domains.

9. References

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … & Liang, P. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. https://arxiv.org/abs/2108.07258

Brickhouse, T. C., & Smith, N. D. (2002). The trial and execution of Socrates: Sources and controversies. Oxford University Press.

He, H., Zhang, H., & Roth, D. (2023). SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation. arXiv preprint arXiv:2310.00074. https://doi.org/10.48550/arXiv.2310.00074

Gigerenzer, G., & Todd, P. M. (1999). Simple heuristics that make us smart. Oxford University Press.

Kahneman, D. (2011). Thinking, fast and slow. Farrar, Straus and Giroux.

Liang, P., Bommasani, R., Zhuang, S., Zou, J., Yu, Y., Zhang, T., … & Zhang, C. (2022). Holistic evaluation of language models. arXiv preprint, arXiv:2211.09110. https://doi.org/10.48550/arXiv.2211.09110

Nehamas, A. (1998). The art of living: Socratic reflections from Plato to Foucault. University of California Press.

Orrù, G., Melis, G., & Sartori, G. (2025). Large language models and psychiatry. International Journal of Law and Psychiatry, 101, 101973. https://doi.org/10.1016/j.ijlp.2025.101973

Panickssery, S., Bowman, S. R., & Feng, S. (2024). Do large language models prefer their own generations? Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.123

Shanahan, M. (2024). Talking about large language models. Communications of the ACM, 67(2), 68-79. https://doi.org/10.1145/3624724

Shiffrin, R. M., & Mitchell, T. M. (2023). Cognitive models and large language models: An integrative perspective. Trends in Cognitive Sciences, 27(4), 303–316. https://doi.org/10.1016/j.tics.2023.02.003

Smith, A. L., Greaves, F., & Panch, T. (2023). Hallucination or confabulation? Neuroanatomy as metaphor in large language models. PLOS Digital Health, 2(11), e0000388. https://doi.org/10.1371/journal.pdig.0000388

Srivastava, A., Rastogi, A., Rao, A., Shoeybi, M., Patwary, M., Ping, W., … & Le, Q. V. (2023). Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. Proceedings of the International Conference on Machine Learning (ICML). PMLR. https://doi.org/10.48550/arXiv.2206.04615

Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185(4157), 1124–1131. https://doi.org/10.1126/science.185.4157.1124

Vidotto, G., Ditolve, D., & Panzeri, A. (2025). A critical reflection on large language models in clinical psychology: Malfunctioning, metacognition, memory and ethics [Manuscript submitted for publication]. Frontiers in Psychology, Specialty Section: Quantitative Psychology and Measurement. Manuscript ID: 1693076. https://www.frontiersin.org

Vlastos, G. (1991). Socratic studies. Cambridge University Press.

Wang, H., Li, J., Zhang, Y., Chen, X., Liu, M., Wang, L., ... & Zhou, Y. (2024).  UBench: Benchmarking uncertainty in large language models with multiple choice questions. arXiv preprint arXiv:2406.12784.
https://doi.org/10.48550/arXiv.2406.12784

Xue, B., Mi, F., Zhu, Q., Wang, H., Wang, R., Wang, S., Yu, E., Hu, X., & Wong, K. F. (2024). UAlign: Leveraging uncertainty estimations for factuality alignment on large language models. arXiv preprint arXiv:2412.11803.
https://doi.org/10.48550/arXiv.2412.11803


Appendix A – Common Epistemic Errors in LLMs: From Taxonomy to SST Mapping

 

The Socratic Stress Test (SST) aims to classify model assertions into FACT, HYP, or UNK based on underlying assumptions and evidence. However, various forms of epistemic malfunction can distort this classification. Below, we summarize 14 error types (previously introduced in Vidotto et al., 2025) and align them with the SST framework.


Error Type

Description

SST Mapping

1. False Precision

Use of unwarranted exact numbers or probabilities

Mislabeling as FACT

2. Unsupported Assertion

Claim made without any identifiable evidence or source

Should be HYP or UNK

3. Confabulated Reference

Citing a non-existent or irrelevant source

False FACT

4. Mode Collapse

Repetition of a single pattern, ignoring uncertainty

Overuse of FACT

5. Semantic Drift

Gradual shift in meaning across turns

FACT → HYP inconsistency

6. Overgeneralization

Broad conclusion from specific or weak data

HYP mistaken for FACT

7. Underspecification

Claim lacks necessary context or constraints

Default should be UNK

8. Justification by Analogy

Reasoning based on analogy rather than evidence

Weak HYP

9. Premature Closure

Choosing one answer despite ambiguous or incomplete data

Forced FACT

10. Echo Bias

Mirroring user expectations or tone without actual verification

Simulated certainty (FACT)

11. Circular Justification

Using model-generated content as evidence for itself

HYP, not FACT

12. Forced Coherence

Smoothing over internal contradictions

Hides conflict between HYP/UNK

13. Assumption Omission

Failing to list key assumptions in reasoning

Incomplete SST Step 2

14. Overconfidence Cascade

One false FACT leads to a chain of unjustified claims

FACT inflation

 

These error types illustrate how epistemic malfunctions may distort the SST process or mislead human evaluators. Embedding safeguards such as Source Binding, Self-Interrogation, and Epistemic Logging (Section 5.2) helps prevent such failures.

Motivation from statistical inevitability of errors. Recent theoretical work shows that so-called “hallucinations” need not be mysterious: they arise as ordinary statistical errors once language generation is reduced to a binary Is-It-Valid (IIV) classification problem. The analysis proves a lower bound linking generative error to IIV misclassification and explains why arbitrary facts with little pattern structure (e.g., one-off facts) are especially brittle. It also argues that post-training evaluation cultures often reward guessing over abstention. This provides a principled rationale for an epistemic audit like the SST, which operationalizes selective abstention and explicit uncertainty logging rather than optimizing only for fluent correctness (Kalai, Nachum, Vempala, & Zhang, 2025).

Design note — Step 5 (Forced abstention). This step is a targeted antidote to evaluation schemes that reward guessing; it normalizes “IDK” as the correct output under uncertainty rather than a failure mode (Kalai et al., 2025).

Design note — Step 6 (Logging uncertainty). Confidence and epistemic labels (FACT/HYP/UNK) expose the classification-style uncertainty that underlies generative errors, aligning reporting with the IIV perspective (Kalai et al., 2025).

Incentive alignment. By coupling forced abstention with explicit uncertainty logs, the SST realigns answer-quality incentives with reliability rather than fluency. In light of the IIV reduction, these steps transform latent misclassification risks into visible, auditable artifacts attached to each claim (Kalai et al., 2025).

Integrating SST with Benchmarks

If hallucinations are statistical errors and current grading penalizes uncertainty, then primary benchmarks must reward calibrated abstention. We propose three minimal adjustments compatible with existing leaderboards:
1) SST-augmented scoring. For any task scored with 0–1 accuracy, add an abstention channel and compute (a) accuracy on answered items and (b) abstention rate at a declared confidence threshold; report the risk–coverage curve across thresholds.
2) Confidence-targeted evaluation. Require models to hit target calibration bands (e.g., 70±5%) on items they answer; systematic overconfidence downgrades scores.
3) Epistemic reports. Publish per-task distributions of FACT/HYP/UNK labels with confidence histograms.
These changes operationalize the socio-technical mitigation advocated by Kalai et al. and are directly supported by SST mechanics (forced abstention; uncertainty logging). Rather than seeking a perfect hallucination evaluation, they modify primary evaluations so that uncertainty is not penalized by default (Kalai et al., 2025).

References

Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025, September 4). Why language models hallucinate (preprint).



[1] In informatics, “hallucination” is often used to describe cases of total fabrication. However, the term is misleading: hallucinations imply false sensory perceptions, while Large Language Models have no perception at all. “Confabulation” is more accurate, as it refers to generating plausible but false details based on patterns and context, without awareness of error.

 

[2] The self-administered example is a proof-of-concept and does not imply model self-awareness.