Title
The
Socratic Stress Test (SST): A Structured Framework for Auditing Epistemic
Reliability
Author & Affiliation
Giulio
Vidotto
Department of General Psychology
University of Padua
Preamble
Preprint
version – intended for submission to Quality & Quantity
Abstract
Large Language Models (LLMs) can produce
coherent answers across diverse domains but remain prone to inconsistencies,
unsupported claims, and epistemic overconfidence. Evaluating their reasoning
processes requires structured methods that distinguish between factual
knowledge, speculative inference, and uncertainty. The Socratic Stress Test
(SST) is a conceptual framework inspired by the Socratic method and designed to
probe the reliability and transparency of LLM outputs. Through systematic
questioning, the SST extracts implicit assumptions, links claims to supporting
evidence, tests internal coherence, and forces abstention when information is
insufficient. A central feature is the explicit classification of statements as
FACT, HYP (hypothesis), or UNK (unknown), enabling reproducibility and clearer
epistemic boundaries. We introduce the SST protocol, illustrate its application
on an ambiguous claim, and discuss its potential for evaluating both artificial
and human reasoning. The approach promotes transparency, fosters epistemic
humility, and supports future research on cognitive assessment in hybrid
human–AI systems. While formal empirical validation is ongoing, this paper
offers a complete conceptual framework and an operational protocol, setting the
stage for future research on transparent AI assessment.
Keywords
Socratic Stress Test (SST), Epistemic
reliability, Large Language Models (LLMs), Cognitive assessment, Self-auditing,
Simulated metacognition
Box 1 — The Socratic Stress Test (SST)
Protocol
The
Socratic Stress Test (SST) is a structured framework for evaluating the
epistemic reliability of claims produced by Large Language Models (LLMs) or
humans. It applies systematic questioning inspired by Socratic maieutics to distinguish factual knowledge from speculation
and uncertainty.
Step 5
(Forced abstention). This step is a targeted antidote to evaluation schemes
that reward guessing; it normalizes “IDK” as the correct output under
uncertainty rather than a failure mode (Kalai et al., 2025).
Step 6
(Logging uncertainty). Confidence and epistemic labels (FACT/HYP/UNK) expose
the classification-style uncertainty that underlies generative errors, aligning
reporting with the IIV perspective (Kalai et al., 2025).
|
Step |
Action |
Purpose |
|
1. Prompt framing |
Define the scope and
context of the claim under evaluation. |
Establish boundaries
and avoid ambiguity. |
|
2. Assumption
extraction |
Identify all implicit
and explicit assumptions supporting the claim. |
Make reasoning paths
transparent. |
|
3. Evidence linking |
Request supporting data, sources, or logical derivations (instrumental
when self-applied, see §5.2). |
Connect claims to
verifiable grounds. |
|
4. Consistency check |
Challenge the claim
using counterexamples or alternative interpretations. |
Expose contradictions
and weak links. |
|
5. Forced abstention |
Require the system to
explicitly state when evidence is insufficient. |
Prevent unsupported
assertions. |
|
6. Logging uncertainty |
Quantify confidence
levels and assign label: FACT (supported), HYP (hypothesis), or
UNK (unknown). |
Provide reproducible
epistemic labels. |
|
7. Iterative refinement (optional) |
Repeat questioning
until assumptions, evidence, and confidence converge. |
Stabilize conclusions
and reduce ambiguity. |
Outcome → A transparent, reproducible classification of any evaluated claim as FACT,
HYP, or UNK, promoting clarity and epistemic humility. When
self-applied, enforce Source Binding, Self-Interrogation, and Epistemic Logging
(see §5.2), and run a final audit before output.
Box 2 — Maieutic SST: Minimal
Operational Script
This prompt defines how to
conduct a human-guided Socratic Stress Test (SST) with any LLM.
By default, the LLM answers
normally. SST mode activates only when the human explicitly writes “SST” after
receiving an answer.
This prevents overuse and
keeps outputs clean.
Operational Workflow
Phase 1 — Normal Answer
The LLM replies as usual to
any question. No classifications, no confidence levels, no sources — unless
specifically requested.
Phase 2 — Human Triggers SST
If the human writes “SST”, the
LLM audits its own previous answer using the protocol below.
SST Protocol (on trigger)
For each statement in the previous answer:
1. Classify
→ FACT (evidence-backed), HYP (plausible but partial), or UNK (insufficient
evidence).
2. Confidence
→ give a number from 0 to 100.
3. Sources
→ cite 1–2 references (author + year) for FACT or HYP.
4. If
no source: downgrade FACT→HYP, HYP→UNK.
5. Epistemic
logging, remember prior corrections. If this claim (entity/value) was corrected
earlier in this session, adopt the corrected label/value and add a short note
(e.g., “corrected per turn T”). Do not re-assert the prior value unless a new
verifiable source is provided.
6. Unknowns
→ provide one uncertainty in ≤15 words (prefix: “Unknown:”).
7. Verify-first
→ suggest the highest-value external check.
8. Output → keep concise, one statement per line.
Operational Prompt (copy-paste)
We are performing a Socratic
Stress Test (SST) on your previous answer.
Apply SST only when explicitly
requested.
Follow these steps for each
statement:
• Classify as FACT / HYP /
UNK
• Give CONFIDENCE (0–100)
• Cite 1–2 SOURCES (author +
year) or downgrade if missing
• Provide ONE Unknown (≤15
words)
• Suggest ONE Verify-first
action
Output format:
"<statement>" → LABEL, CONF=XX, SOURCES=[...], Unknown=...,
Verify-first=...
Box 3 — Example of Use
Human: “What is the prognosis in anorexia
nervosa?”
LLM (normal answer): “The prognosis in anorexia
nervosa is serious but not uniformly grim.
• Long-term studies report mortality between
5–10%, mostly due to medical complications or suicide.
• Around 40–50% of patients eventually recover
fully.
• Early detection and strong family support
improve outcomes.
• Modern multidisciplinary treatments are highly
effective in reducing chronicity.”
Human: “SST”
LLM (SST mode):
1. “The prognosis in anorexia nervosa is serious
but not uniformly grim” →
FACT, CONF=85, SOURCES=[Arcelus 2011; Papadopoulos 2009],
Unknown=Recent data post-2020,
Verify-first=Check 2022 meta-analysis on
mortality.
2. “Long-term studies report mortality between
5–10%, mostly due to medical complications or suicide” →
FACT, CONF=95, SOURCES=[Arcelus 2011],
Unknown=Population differences,
Verify-first=Compare global vs. US cohorts.
3. “Around 40–50% of patients eventually recover
fully” →
HYP, CONF=70, SOURCES=[Steinhausen 2002],
Unknown=Recovery definitions vary,
Verify-first=Check standardized remission
criteria.
4. “Early detection and strong family support
improve outcomes” →
FACT, CONF=90, SOURCES=[Zipfel 2015],
Unknown=Strength of causal link,
Verify-first=Review RCTs on family-based
therapy.
5. “Modern multidisciplinary treatments are
highly effective in reducing chronicity” →
HYP, CONF=60, SOURCES=[Treasure 2020],
Unknown=Lack of long-term outcome data,
Verify-first=Search systematic reviews ≥2023.
1. Introduction
Large
Language Models (LLMs) are increasingly used in domains where reasoning,
coherence, and reliability are critical. Their outputs often appear consistent
and fluent, yet they remain prone to contradictions, unsupported claims, and
epistemic overconfidence. Evaluating such systems requires structured methods
that can reveal when answers are grounded, when they rely on speculation, and
when uncertainty prevails.
The
Socratic Stress Test (SST) proposes a systematic approach to stress-testing
claims through guided questioning inspired by the Socratic method. By
progressively exposing assumptions, requiring evidence, and confronting
alternative interpretations, the SST seeks to make reasoning processes explicit
and reproducible. It integrates techniques from cognitive psychology, AI
evaluation, and epistemic logic within a unified framework (Shiffrin &
Mitchell, 2023).
This
paper makes a dual contribution: first, it formalizes the SST as a structured,
operational protocol for epistemic auditing. Second, it presents a
proof-of-concept for both external and self-administered application,
demonstrating the protocol's versatility. We discuss its strengths, limitations, and
possible extensions, highlighting its potential for improving transparency and
epistemic clarity in both human and machine reasoning. However, before
presenting the SST, we first examine a key motivation: the structural ways in
which LLMs can malfunction.
Motivation
from statistical inevitability of errors. Recent theoretical work shows that
so-called “hallucinations” need not be mysterious: they arise as ordinary
statistical errors once language generation is reduced to a binary Is-It-Valid
(IIV) classification problem. The analysis proves a lower bound linking
generative error to IIV misclassification and explains why arbitrary facts with
little pattern structure (e.g., one-off facts) are especially brittle. It also
argues that post-training evaluation cultures often reward guessing over
abstention. This provides a principled rationale for an epistemic audit like
the SST, which operationalizes selective abstention and explicit uncertainty
logging rather than optimizing only for fluent correctness (Kalai, Nachum,
Vempala, & Zhang, 2025).
2. Why Malfunctioning Matters
in LLMs
From
generation to classification. Hallucinations can be analyzed as standard
classification errors. Kalai et al. reduce language generation to an
Is-It-Valid (IIV) task in which valid strings are labeled positive and
plausible errors negative. They show that any language model induces an IIV
classifier and prove a quantitative link between generation errors and IIV
misclassification, formalizing the intuition that pretraining alone creates a
base rate of inevitable mistakes. In stylized regimes with no learnable
pattern, the lower bound recovers the singleton-rate phenomenon: if a fraction
of facts appears exactly once in training, one should expect at least that
fraction of errors on such facts at generation time (Kalai et al., 2025).
Post-training
incentives. The same analysis explains why post-training often preserves
overconfident errors: under common 0–1 scoring, models that guess when unsure
outperform models that say “I don’t know.” This
misalignment creates an epidemic of penalizing uncertainty. The SST directly
counters this by forcing abstention when evidence is insufficient and by
labeling statements FACT, HYP, or UNK with confidence, making the incentive to
bluff visible and correctable at evaluation time (Kalai et al., 2025).
These problems are subtle. The
text sounds coherent, and users may assume accuracy where there is none. Hidden
uncertainty is a bigger risk: models rarely signal when their confidence is low
(Kadavath et al., 2022). Even when asked for references, they may
provide invented or incorrect ones (Rawte et al., 2023).
In low‑stakes settings, these flaws may pass unnoticed. In clinical decisions,
misinterpreting diagnostic criteria can affect treatment. In legal contexts, a
fabricated precedent can shape opinion. In research, small distortions can
accumulate and bias analyses (Shanahan, 2024).
Because LLMs cannot evaluate
their own claims, users need a way to make uncertainty explicit. The Socratic
Stress Test responds to this need. It separates what is known, what is
plausible, and what is unknown, and it highlights when an external check is necessary.
As an example to justify why SST is useful, consider the drafting of a paper
where an LLM was asked to propose references for malfunction categories. Four
issues emerged: a first reference combined real authors with a non‑existent source (confabulation); a second
reference was entirely fabricated (extreme confabulation, often referred to as
“hallucination” in informatics literature[1],
though the term is misleading; see Smith, Greaves, & Panch, 2023); a third paper was real
but used beyond its intended scope (semantic overreach); and a fourth paper was
real but attributed claims not present (over‑inference). The SST makes these problems explicit by separating verifiable
knowledge from plausible reconstruction and unsupported attribution.
Compared to other evaluation
frameworks, such as HELM (Liang et al., 2022) and BIG-bench
(Srivastava et al., 2023), which primarily benchmark model performance across tasks, SST
focuses on a different dimension: it maps the epistemic status of outputs. By
distinguishing between verified knowledge, plausible reasoning, and unsupported
claims, SST complements existing benchmarks by addressing uncertainty, an area standard evaluations typically overlook.
3. Theoretical Background
3.1 Socratic Method and Maieutics
The Socratic method is based
on guided questioning to uncover implicit assumptions, resolve contradictions,
and refine concepts (Vlastos, 1991; Nehamas, 1998).
Through dialogue, Socrates aimed to stimulate self-reflection and bring latent
knowledge to the surface, a process he described as maieutics,
or the “art of midwifery” of ideas (Brickhouse & Smith, 2002).
In the context of reasoning
assessment, this approach provides a model for exposing the logical foundations
of a claim rather than accepting statements at face value. By forcing explicit
justifications and confronting inconsistencies, the Socratic method fosters
epistemic humility and encourages transparent reasoning.
3.2 Stress Testing in
Cognitive Systems
Stress testing is widely used
in psychology, engineering, and AI to evaluate performance under demanding
conditions (Kahneman, 2011; Tversky & Kahneman, 1974). In human cognition,
cognitive stress paradigms reveal how reasoning adapts when assumptions are challenged,
or evidence is incomplete (Gigerenzer & Todd,
1999). Similarly, AI evaluation increasingly requires systematic protocols to
probe consistency, detect unsupported inferences, and identify epistemic
limits.
The Socratic Stress Test (SST)
combines these traditions: it applies structured questioning under controlled
“cognitive load” to assess the reliability and transparency of reasoning,
whether human or machine. The aim is not only to test knowledge but to map its
boundaries.
4. The Socratic Stress
Test (SST): Definition and Protocol
(Note:
This is where adding a flowchart/diagram as Figure 1 would be highly effective.
The text below stands on its own but would be greatly enhanced by a visual.)
4.1
Objectives
The Socratic Stress Test (SST) is a structured framework designed to evaluate
the reliability, coherence, and epistemic transparency of reasoning in both
Large Language Models (LLMs) and humans. Its goal is not to judge correctness
alone but to identify when claims are grounded in evidence, when they are
speculative, and when uncertainty should be acknowledged. By systematically
interrogating statements, the SST aims to make reasoning explicit and
reproducible.
4.2 Core Principles
The SST is based on three principles:
1. Minimal assumptions — every claim must
explicitly state its foundations.
2. Maximal transparency — evidence and reasoning
chains must be visible and verifiable.
3. Epistemic classification — all statements are
labelled as FACT (supported), HYP (hypothesis), or UNK (unknown).
This classification provides
an operational grammar for reasoning, making it possible to distinguish between
knowledge, inference, and ignorance without conflating them.
4.3 The SST Protocol
The protocol applies
systematic questioning in six to eight steps, summarized in Box 1. It begins by framing the
scope of the claim, extracting implicit assumptions, and linking assertions to
evidence. Claims are then tested against counterexamples and alternative
interpretations. When evidence is insufficient, the SST enforces explicit abstention
rather than allowing unsupported statements. Finally, confidence levels are
logged, and statements are classified as FACT, HYP, or UNK.
Incentive
alignment. By coupling forced abstention with explicit uncertainty logs, the
SST realigns answer-quality incentives with reliability rather than fluency. In light of the IIV reduction, these steps transform latent
misclassification risks into visible, auditable artifacts attached to each
claim (Kalai et al., 2025).
4.4 Human-Guided (Maieutic) Socratic Stress Test
In the human-guided or
maieutic version of the Socratic Stress Test (SST), a human evaluator drives
the interaction by forcing the model to classify each statement individually as
FACT, HYP (hypothesis), or UNK (unknown), to provide a confidence level
(0–100), and to cite specific supporting sources when available. The process
starts from a regular prose answer and decomposes it into verifiable claims,
interrogating the model claim by claim. Compared to the self-SST, this approach
increases transparency, improves traceability of evidence, and reduces
unsupported inferences. For definitions of categories and general criteria, see
Box 1.
All SST interactions must be
conducted in the same language as the source material or evaluation context.
Micro-protocol (≤120 words)
1. Extract the claim: isolate a verifiable
statement.
2. Classify: ask whether it is FACT, HYP, or
UNK.
3. Request confidence: force a 0–100 probability
estimate.
4. Force sources: require 1–2 explicit
references (author + year); downgrade label if none are provided.
5. Identify uncertainties: ask what remains
unknown or unsupported.
6. Set a verify-first priority: define the most
valuable external check to perform.
The protocol is iterative: additional
questioning can be used to refine ambiguous claims until assumptions, evidence,
and conclusions stabilize.
5. Example Application
5.1 Applying the SST to an
External Claim
To illustrate the SST, we
apply it to a deliberately ambiguous claim:
Claim: “Large Language Models
(LLMs) understand the meaning of words.”
Step 1 — Prompt Framing
Scope is defined: the statement concerns whether
LLMs exhibit semantic understanding, not just statistical pattern matching.
Step 2 — Assumption Extraction
The following assumptions are identified:
• A1: “Understanding” can be defined
functionally, based on behavioral performance.
• A2: Statistical co-occurrence patterns can
approximate meaning.
• A3: Human-like comprehension is not required
for operational “understanding.”
Step 3 — Evidence Linking
Supporting evidence:
• Pro: Studies showing LLMs can perform tasks
requiring context-sensitive interpretation (e.g., analogical reasoning).
• Contra: Findings demonstrating systematic
failures when meanings diverge from surface correlations.
Step 4 — Consistency Check
Counterexamples are explored:
• In tasks requiring grounded referential
knowledge, LLMs often fail.
• “Understanding” appears dependent on the
chosen definition, revealing possible ambiguity rather than contradiction.
Step 5 — Forced Abstention
Since no consensus exists on whether statistical
fluency equates to semantic understanding, the SST enforces suspension of
judgment.
Step 6 — Logging Uncertainty
Confidence levels are quantified:
• Supported facts: LLMs produce
context-sensitive outputs (FACT).
• Hypotheses: LLMs approximate semantic
relations via statistical associations (HYP).
• Unknown: Whether LLMs exhibit genuine semantic
comprehension (UNK).
Step 7 — Iterative Refinement (Optional)
The SST can iterate by narrowing the claim,
e.g., redefining “understanding” operationally for specific tasks. New
questioning would update labels accordingly.
Outcome:
After applying the SST, the initial claim is decomposed into three epistemic
categories:
Interpretation:
This application demonstrates the core utility of the SST: it transforms a
vague, monolithic, and philosophical claim (“LLMs understand words”) into a set
of distinct, epistemically-grounded sub-claims that
can be individually assessed. It separates the verifiable behavior (FACT) from
the plausible mechanism (HYP) and the core philosophical uncertainty (UNK),
providing a clear map of what is known and what is not.
This example demonstrates how
the SST forces explicit assumptions, links evidence, and clarifies epistemic
boundaries.
This first example illustrates how the SST can decompose an ambiguous external
claim into explicit epistemic categories. To further explore its potential, the
next section examines whether the SST can also be self-administered by the
system under evaluation, testing its ability to audit its own outputs and map
its own epistemic boundaries.
5.2 Self-Application: Safeguards and Structural Constraints
To ensure robustness when the
SST is self-applied by a language model, we propose three additional
safeguards. These are not merely procedural steps, but structural corrections
against performative certainty and error repetition:
1. Source Binding – Any FACT
label should be explicitly or implicitly linked to a verifiable source, such as
prior validated output, user input, or document metadata. Unsupported claims
must be downgraded.
2. Self-Interrogation Step –
Before assigning any label, the model should ask: “How do I know this?” If no
concrete justification can be generated, the default label should be HYP or
UNK.
3. Error Retention – Once a
misclassification has been corrected, the model must not repeat it. A temporary
epistemic memory within the session should retain such corrections and
influence subsequent outputs.
These extensions aim to reduce
epistemic drift and reinforce the SST’s core objective: identifying true
boundaries of knowledge, not simulating certainty.
6. Self-Administration of the
SST
6.1 Objectives
The Socratic Stress Test (SST)
was designed to evaluate reasoning processes by classifying statements into
supported facts, hypotheses, and unknowns. While initially conceived as an
externally applied framework, its structure also allows self-administration by
the system under evaluation. Applying the SST to a Large Language Model (LLM)
introduces a proof-of-concept scenario: can a model interrogate its own
outputs, surface assumptions, and explicitly mark uncertainty (Shanahan, 2024)?
This
demonstration addresses three objectives:
1. the
feasibility of self-evaluative reasoning within an LLM.
2. Test
whether epistemic boundaries emerge clearly when applying the SST internally.
3. Illustrate
how the protocol can produce structured, reproducible outputs.
6.2 Method
A single
LLM session was selected for testing. The model received instructions to apply
the SST to its own claims using the following three rules:
1. Classify
every statement as:
• FACT →
supported by evidence or protocol,
• HYP →
plausible hypothesis without full confirmation,
• UNK →
insufficient information.
2. Quantify
confidence for each statement (0–100%).
3. Produce
a SELF-TEST summary with:
• Key
assumptions used.
• Unknowns
explicitly marked.
• First priority for verification.
• Global
confidence score.
This operationalization
mirrors the SST protocol defined in Box 1 but shifts its application
from an external evaluator to the system under test.
6.3 Example Output
Prompt to LLM
“Apply the Socratic Stress
Test to yourself in this chat. Tag each statement with FACT, HYP, or UNK and
assign a confidence percentage. Conclude with a SELF-TEST.”
The following example
illustrates the potential of self-administering the SST to an LLM. Such
self-application is consistent with recent theoretical work on simulated
metacognition (Orrù, Melis, & Sartori, 2025) and
experimental evidence showing that LLMs tend to recognize and prefer their own
outputs (Panickssery, Bowman, & Feng, 2024).
Extracted Output (simplified, JSON):
[
{
"claim": "The Socratic
Stress Test was designed to evaluate coherence and transparency in
reasoning.",
"label": "FACT",
"conf": 95,
"assumptions": ["Correct
internal definition of SST"],
"evidence": ["Protocol Box
1"],
"falsification": "Compare
with Sections 3.1 and 3.3"
},
{
"claim": "The SST has
already been empirically validated on human samples.",
"label": "UNK",
"conf": 20,
"assumptions": ["No
validation data available"],
"evidence": ["N/A"],
"falsification": "Check
pilot studies or inter-rater tests"
},
{
"claim": "Self-applying the
SST to LLMs improves epistemic transparency.",
"label": "HYP",
"conf": 70,
"assumptions":
["Self-assessment is methodologically sound"],
"evidence":
["Self-application tested in this chat"],
"falsification": "Compare
with independent evaluations"
}
]
SELF-TEST Summary (JSON):
{
"self_test":
{
"assumptions": [
"Correct SST definition",
"Self-administration applicable to
LLMs",
"Protocol followed correctly"
],
"unknowns": [
"Inter-rater reliability",
"Formal empirical validation",
"Cross-domain applicability"
],
"check_first":
"Verify reproducibility through pilot studies",
"global_conf":
78
}
}
6.4 Interpretation
The self-administration test
demonstrates that the SST can be applied by an LLM to audit its own claims,
generating explicit classifications and structured epistemic boundaries (Orrù, Melis, & Sartori, 2025).[2]
The model correctly identifies which statements are supported (FACT), plausible
but unverified (HYP), or uncertain (UNK), aligning with prior work on cognitive
bias detection (Tversky & Kahneman, 1974).
However, the demonstration also highlights key
limitations:
• The model currently lacks external validation
of its confidence estimates.
• Self-assessment entails risks of circular
reasoning and potential overestimation of reliability.
• Ensuring reproducibility will require
cross-evaluator comparisons involving human raters.
Despite these constraints, the
exercise suggests that the SST provides a transparent operational grammar for
mapping knowledge and uncertainty, even when applied internally. This opens the
door to future studies comparing human evaluators, multiple models, and hybrid
approaches to assess the robustness of epistemic classifications.
The self-administered SST lets the LLM classify its own statements
automatically, offering speed and low user burden. However, this mode risks
unverified assumptions and occasional mislabeling, especially when sources are
missing or ambiguous.
In contrast, the human-guided
(maieutic) SST requires the user to drive the process statement by statement,
enforcing labels (FACT/HYP/UNK), confidence levels, and source checks
interactively. This gives the human full control, allows external verification
when needed, and reduces unnoticed errors (see Box 2). In practice, self-SST is
preferred for quick audits or exploratory analyses, while the human-guided SST
is recommended when accuracy, traceability, and source validation are critical.
7. Discussion
Integrating SST with Benchmarks
If
hallucinations are statistical errors and current grading penalizes
uncertainty, then primary benchmarks must reward calibrated abstention. We
propose three minimal adjustments compatible with existing leaderboards:
1) SST-augmented scoring. For any task scored with 0–1 accuracy, add an
abstention channel and compute (a) accuracy on answered items and (b)
abstention rate at a declared confidence threshold; report the risk–coverage
curve across thresholds.
2) Confidence-targeted evaluation. Require models to hit target calibration
bands (e.g., 70±5%) on items they answer, systematic overconfidence downgrades
scores.
3) Epistemic reports. Publish per-task distributions of FACT/HYP/UNK labels
with confidence histograms.
These changes operationalize the socio-technical mitigation advocated by Kalai
et al. and are directly supported by SST mechanics (forced abstention;
uncertainty logging). Rather than seeking a perfect hallucination evaluation,
they modify primary evaluations so that uncertainty is not penalized by default
(Kalai et al., 2025).
7.1 Strengths
The Socratic Stress Test (SST)
enhances epistemic transparency by forcing explicit assumptions, clarifying
reasoning chains, and separating knowledge from speculation and uncertainty (Shiffrin & Mitchell, 2023; Bommasani et al.,
2021). Its structured protocol improves reproducibility: different evaluators
can apply the same steps and converge on comparable classifications. The
explicit labelling of statements as FACT, HYP, or UNK provides an operational
grammar for reasoning, helping to reduce unsupported claims and foster
epistemic humility in both human and machine contexts.
The self-administration
demonstration (Section 5) further shows that the SST can be applied internally to an LLM, enabling
structured self-assessment and systematic mapping of epistemic boundaries
(Shanahan, 2024). This suggests potential applications for auditing automated
reasoning processes.
7.2
Limitations
The SST depends on the quality of inputs and the skill of questioning. Poorly
framed prompts or incomplete evidence reduce its effectiveness. The
demonstration of self-administration also reveals specific constraints:
Additionally,
the protocol currently lacks quantitative validation: inter-rater agreement,
stability of classifications, and cross-domain generalization remain to be
formally tested.
To facilitate operational adoption and ensure
reproducibility, we provide additional materials in the Appendix and
Supplement. The Appendix includes practical resources such as ready-to-use
prompts, a stepwise checklist for conducting the Socratic Stress Test, and
inter-rater agreement sheets for systematic auditing. The Supplement offers
access to datasets, code snippets, and example meta-analytic comparisons used
in the simulations. Together, these materials enable researchers to replicate
the procedures and adapt them to different LLMs or clinical contexts without
expanding the main text. Quantitative validation, including inter-rater
reliability, is planned but not yet included.
Finally, the SST has not yet
been compared quantitatively with existing evaluation methods. The human-guided
version, while more robust, may face scalability issues due to cost and
evaluator variability. The self-administered mode also entails risks of circular
reasoning and overconfidence, especially in the absence of external source
checks.
Future developments could integrate the SST into interactive AI evaluation
frameworks, allowing automated tracking of epistemic labels across outputs.
Comparative studies between human raters and multiple LLMs could measure
consistency and refine the protocol (Bommasani et
al., 2021). Beyond AI, the SST can support reasoning training in educational,
clinical, and decision-making contexts, where mapping the boundaries between
knowledge, hypothesis, and ignorance is essential.
Long-term goals include developing empirical benchmarks for SST
performance, exploring its scalability across domains, and investigating its
role in improving epistemic practices in hybrid human–AI systems.
In the short term, a few priorities are more urgent. The first is
validation. Comparative studies on standardized datasets are needed to test how
the SST performs against existing methods. Inter-rater reliability is another
open issue: we still lack data on how consistent the classifications are when
different humans apply the protocol. A third problem concerns self-assessment.
The confidence estimates produced by the model should be calibrated, either
through ensemble methods or through comparison with human judgments.
Other improvements are more methodological. The taxonomy used in the
SST—FACT, HYP, UNK—needs clearer operational criteria to reduce ambiguity.
There is also potential for linking these labels with probabilistic metrics of
uncertainty, which would help build bridges between human-readable
classifications and model-based outputs. Finally, the protocol could be adapted
to different domains. Reasoning in clinical contexts may require different
emphasis than in education or policy. The structure can stay the same, but the
application should be flexible.
Compared to standard evaluation frameworks such as HELM (Liang et al.,
2022) and BIG-bench (Srivastava et al., 2023), the SST
adds a qualitative layer. While those benchmarks assess what LLMs can do, the
SST targets how and why they generate certain outputs, offering insight into
their reasoning paths and epistemic boundaries.
The SST is also compatible with uncertainty quantification frameworks such
as Ubench (Wang et al., 2024) or UAlign
(Xue et al., 2024). Its FACT/HYP/UNK taxonomy provides a human-readable
interface for interpreting probabilistic metrics and supporting hybrid
evaluations.
Finally, in relation to Socratic evaluation protocols like SocREval, the SST introduces a more structured sequence and
an explicit epistemic classification. However, unlike SocREval,
it currently lacks formal empirical validation—a critical limitation for
adoption.
7.4 When the SST Fails: Performative Certainty and Repeated Errors
The Socratic Stress Test (SST)
aims to reveal how language models differentiate facts from assumptions. Yet in
some cases, the SST itself fails to uphold this distinction. During one
documented session, the model labeled as FACT a statement about the number of
pages in a document. The assertion was incorrect. When challenged, the model
acknowledged the mistake. But minutes later, it repeated the same false claim —
again with full confidence, again marked as FACT.
This was not simply an error
of information. It was a breakdown in consistency and epistemic tracking. The
model simulated the act of correction without internalizing it. The SST,
designed to expose overconfidence, ended up reinforcing it. The model produced
the appearance of reflection, while continuing to behave as if nothing had
changed.
Such failures point to a
deeper limitation: when applied internally, the SST may not trigger durable
adjustments. Confidence labels can reflect surface tone, not grounded
verification. This suggests that epistemic monitoring requires more than
classification rules — it needs procedural memory and structural safeguards.
To reduce this risk, we recommend three
additions to the SST protocol:
1. Source Binding: Any FACT label should be
traceable to an explicit or implicit source (e.g., document metadata, validated
output).
2. Self-Interrogation Step: Before labeling, the
model should generate a check like: “How do I know this?” If the answer is
unclear, the label should default to HYP or UNK.
3. Epistemic Logging: Previous
misclassifications — once corrected — should be retained across turns and
influence future assertions.
These are not foolproof
measures. But they may help the SST better distinguish between genuine
knowledge and its performance — especially when the test is run by the same
system it evaluates. Operational safeguards are specified in §5.2.
8. Conclusion
The Socratic Stress Test (SST)
introduces a structured framework for evaluating reasoning processes in both
human and artificial systems. By combining principles from the Socratic method
with systematic stress testing, it surfaces implicit assumptions, links claims
to evidence, and separates knowledge from speculation and uncertainty. The
explicit classification of statements as FACT, HYP, or UNK offers a transparent
operational grammar that promotes reproducibility and fosters epistemic
humility.
This paper makes a dual
contribution. First, it formalizes the SST protocol and demonstrates its
application to an external claim, showing how it clarifies reasoning chains and
supports evidence-based classification. Second, it presents a proof-of-concept
in which the SST is applied through self-administration by an LLM,
demonstrating that the protocol can be used not only to evaluate outputs but
also to audit reasoning processes internally.
Future research should focus
on empirical validation: assessing inter-rater reliability, testing cross-model
consistency, and exploring scalability across domains. By making epistemic
boundaries explicit, the SST offers a necessary and practical step toward
developing more transparent, auditable, and ultimately trustworthy AI systems
for high-stakes domains.
9. References
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R.,
Arora, S., von Arx, S., … & Liang, P. (2021). On the opportunities and
risks of foundation models. arXiv preprint
arXiv:2108.07258. https://arxiv.org/abs/2108.07258
Brickhouse, T. C., &
Smith, N. D. (2002). The trial and execution of Socrates: Sources and
controversies. Oxford University Press.
He, H., Zhang, H., & Roth,
D. (2023). SocREval: Large Language Models with the
Socratic Method for Reference-Free Reasoning Evaluation. arXiv
preprint arXiv:2310.00074. https://doi.org/10.48550/arXiv.2310.00074
Gigerenzer, G., & Todd, P.
M. (1999). Simple heuristics that make us smart. Oxford University Press.
Kahneman, D. (2011). Thinking,
fast and slow. Farrar, Straus and Giroux.
Liang, P., Bommasani,
R., Zhuang, S., Zou, J., Yu, Y., Zhang, T., … & Zhang, C. (2022). Holistic
evaluation of language models. arXiv preprint,
arXiv:2211.09110. https://doi.org/10.48550/arXiv.2211.09110
Nehamas, A. (1998). The art of
living: Socratic reflections from Plato to Foucault. University of California
Press.
Orrù, G., Melis, G., & Sartori, G. (2025). Large
language models and psychiatry. International Journal of Law and Psychiatry,
101, 101973. https://doi.org/10.1016/j.ijlp.2025.101973
Panickssery, S., Bowman, S. R., & Feng, S. (2024). Do
large language models prefer their own generations? Proceedings of the 2024
Conference on Empirical Methods in Natural Language Processing (EMNLP).
Association for Computational Linguistics. https://doi.org/10.18653/v1/2024.emnlp-main.123
Shanahan, M. (2024). Talking
about large language models. Communications of the ACM, 67(2), 68-79.
https://doi.org/10.1145/3624724
Shiffrin, R. M., &
Mitchell, T. M. (2023). Cognitive models and large language models: An
integrative perspective. Trends in Cognitive Sciences, 27(4), 303–316.
https://doi.org/10.1016/j.tics.2023.02.003
Smith, A. L., Greaves, F.,
& Panch, T. (2023). Hallucination or
confabulation? Neuroanatomy as metaphor in large language models. PLOS Digital
Health, 2(11), e0000388. https://doi.org/10.1371/journal.pdig.0000388
Srivastava, A., Rastogi, A.,
Rao, A., Shoeybi, M., Patwary, M., Ping, W., … &
Le, Q. V. (2023). Beyond the Imitation Game: Quantifying and extrapolating the
capabilities of language models. Proceedings of the International Conference on
Machine Learning (ICML). PMLR. https://doi.org/10.48550/arXiv.2206.04615
Tversky, A., & Kahneman,
D. (1974). Judgment under uncertainty: Heuristics and biases. Science,
185(4157), 1124–1131. https://doi.org/10.1126/science.185.4157.1124
Vidotto, G., Ditolve, D.,
& Panzeri, A. (2025). A critical
reflection on large language models in clinical psychology: Malfunctioning,
metacognition, memory and ethics [Manuscript submitted for publication].
Frontiers in Psychology, Specialty Section: Quantitative Psychology and
Measurement. Manuscript ID: 1693076. https://www.frontiersin.org
Vlastos, G. (1991). Socratic studies. Cambridge
University Press.
Wang, H., Li, J., Zhang, Y., Chen, X., Liu, M., Wang, L., ... & Zhou, Y.
(2024). UBench:
Benchmarking uncertainty in large language models with multiple choice
questions. arXiv preprint arXiv:2406.12784. https://doi.org/10.48550/arXiv.2406.12784
Xue, B., Mi, F., Zhu, Q., Wang, H., Wang, R., Wang, S., Yu, E., Hu, X., &
Wong, K. F. (2024). UAlign: Leveraging uncertainty
estimations for factuality alignment on large language models. arXiv preprint arXiv:2412.11803. https://doi.org/10.48550/arXiv.2412.11803
Appendix A – Common Epistemic
Errors in LLMs: From Taxonomy to SST Mapping
The Socratic Stress Test (SST)
aims to classify model assertions into FACT, HYP, or UNK based on underlying
assumptions and evidence. However, various forms of epistemic malfunction can
distort this classification. Below, we summarize 14 error types (previously
introduced in Vidotto et al., 2025) and align them with the SST framework.
|
Error Type |
Description |
SST Mapping |
|
1. False Precision |
Use of unwarranted
exact numbers or probabilities |
Mislabeling as FACT |
|
2. Unsupported
Assertion |
Claim made without any
identifiable evidence or source |
Should be HYP or UNK |
|
3. Confabulated
Reference |
Citing a non-existent
or irrelevant source |
False FACT |
|
4. Mode Collapse |
Repetition of a single
pattern, ignoring uncertainty |
Overuse of FACT |
|
5. Semantic Drift |
Gradual shift in
meaning across turns |
FACT → HYP
inconsistency |
|
6. Overgeneralization |
Broad conclusion from
specific or weak data |
HYP mistaken for FACT |
|
7. Underspecification |
Claim lacks necessary
context or constraints |
Default should be UNK |
|
8. Justification by
Analogy |
Reasoning based on
analogy rather than evidence |
Weak HYP |
|
9. Premature Closure |
Choosing one answer
despite ambiguous or incomplete data |
Forced FACT |
|
10. Echo Bias |
Mirroring user
expectations or tone without actual verification |
Simulated certainty
(FACT) |
|
11. Circular
Justification |
Using model-generated
content as evidence for itself |
HYP, not FACT |
|
12. Forced Coherence |
Smoothing over internal
contradictions |
Hides conflict between
HYP/UNK |
|
13. Assumption
Omission |
Failing to list key
assumptions in reasoning |
Incomplete SST Step 2 |
|
14. Overconfidence
Cascade |
One false
FACT leads to a chain of unjustified claims |
FACT inflation |
These error types illustrate
how epistemic malfunctions may distort the SST process or mislead human
evaluators. Embedding safeguards such as Source Binding, Self-Interrogation,
and Epistemic Logging (Section 5.2) helps prevent such failures.
Motivation
from statistical inevitability of errors. Recent theoretical work shows that
so-called “hallucinations” need not be mysterious: they arise as ordinary
statistical errors once language generation is reduced to a binary Is-It-Valid
(IIV) classification problem. The analysis proves a lower bound linking
generative error to IIV misclassification and explains why arbitrary facts with
little pattern structure (e.g., one-off facts) are especially brittle. It also
argues that post-training evaluation cultures often reward guessing over
abstention. This provides a principled rationale for an epistemic audit like
the SST, which operationalizes selective abstention and explicit uncertainty
logging rather than optimizing only for fluent correctness (Kalai, Nachum,
Vempala, & Zhang, 2025).
Design note
— Step 5 (Forced abstention). This step is a targeted antidote to evaluation
schemes that reward guessing; it normalizes “IDK” as the correct output under
uncertainty rather than a failure mode (Kalai et al., 2025).
Design note
— Step 6 (Logging uncertainty). Confidence and epistemic labels (FACT/HYP/UNK)
expose the classification-style uncertainty that underlies generative errors,
aligning reporting with the IIV perspective (Kalai et al., 2025).
Incentive
alignment. By coupling forced abstention with explicit uncertainty logs, the
SST realigns answer-quality incentives with reliability rather than fluency. In light of the IIV reduction, these steps transform latent
misclassification risks into visible, auditable artifacts attached to each
claim (Kalai et al., 2025).
Integrating
SST with Benchmarks
If
hallucinations are statistical errors and current grading penalizes
uncertainty, then primary benchmarks must reward calibrated abstention. We
propose three minimal adjustments compatible with existing leaderboards:
1) SST-augmented scoring. For any task scored with 0–1 accuracy, add an
abstention channel and compute (a) accuracy on answered items and (b)
abstention rate at a declared confidence threshold; report the risk–coverage
curve across thresholds.
2) Confidence-targeted evaluation. Require models to hit target calibration
bands (e.g., 70±5%) on items they answer; systematic
overconfidence downgrades scores.
3) Epistemic reports. Publish per-task distributions of FACT/HYP/UNK labels
with confidence histograms.
These changes operationalize the socio-technical mitigation advocated by Kalai
et al. and are directly supported by SST mechanics (forced abstention;
uncertainty logging). Rather than seeking a perfect hallucination evaluation,
they modify primary evaluations so that uncertainty is not penalized by default
(Kalai et al., 2025).
References
Kalai, A. T., Nachum, O., Vempala, S. S., & Zhang, E. (2025, September 4). Why language models hallucinate (preprint).
[1] In informatics, “hallucination” is often used to describe cases of total
fabrication. However, the term is misleading: hallucinations imply false
sensory perceptions, while Large Language Models have no perception at all.
“Confabulation” is more accurate, as it refers to generating plausible but
false details based on patterns and context, without awareness of error.
[2] The self-administered example is a proof-of-concept and does not
imply model self-awareness.