Retrieval-Augmented Verification Under Ambiguity

Benchmarking Classification Reliability Across Traditional and Large Language Model Architectures

Authors

Affiliation

Harvey John Reynoso

School of Library and Information Studies

Dan Anthony Dorado

School of Library and Information Studies

Published

May 23, 2026

Abstract

Automated misinformation verification has increasingly shifted from surface-text classification toward evidence-grounded claim verification. While large language models (LLMs) demonstrate strong language understanding capabilities, standalone configurations remain vulnerable to unsupported factual judgments and hallucination. Retrieval-augmented generation (RAG) has emerged as a potential mechanism for improving evidence grounding by supplying external information during inference, yet empirical findings remain inconsistent regarding when retrieval improves or redistributes classification reliability.

This study evaluates retrieval augmentation as an empirical verification condition rather than as an assumed solution. Using a quantitative comparative benchmark design, the study compares three verification architecture types on short fact-checked English claims: (1) a TF-IDF + Logistic Regression baseline, (2) standalone LLM configurations, and (3) bounded-corpus RAG LLM configurations. The benchmark evaluates one traditional baseline and five LLM families under controlled retrieval conditions using a binary misleading/not-misleading classification task derived from the LIAR dataset.

Performance is evaluated using Macro F1, class-level recall, precision, Matthews Correlation Coefficient (MCC), confusion matrices, statistical significance testing, error agreement analysis, textual pattern analysis, and subject-level performance analysis. The study additionally examines ambiguity-sensitive claim conditions, including numerical ambiguity, temporal ambiguity, and partial-truth structures, to identify claim categories that remain difficult across architectures.

The study positions retrieval augmentation as a retrieval-conditioned evidence-grounding mechanism whose effects depend on evidence relevance and contextual alignment rather than as an inherently superior verification approach. Findings contribute to misinformation verification research, evidence-grounded AI evaluation, and Library and Information Science discussions concerning retrieval, credibility-related classification reliability, and responsible AI-assisted verification. The study further demonstrates the importance of balanced reliability metrics and ambiguity-aware evaluation in benchmark-based misinformation research.

Keywords

misinformation verification, retrieval-augmented generation, large language models, evidence-grounded verification, claim verification, automated fact checking, bounded-corpus retrieval, classification reliability, misinformation detection, Library and Information Science

1 Introduction

1.1 Background and Motivation

Misinformation verification is increasingly a problem of evidence use, not only text classification. Contemporary information disorder is commonly understood as a family of harmful information conditions in which false, misleading, or decontextualized claims circulate through digital systems and become difficult for users and institutions to evaluate at scale (Kandel, 2020; Wardle & Derakhshan, 2017). From an information-theoretic and information-ethical perspective, the practical problem is not only whether information exists, but whether it can be interpreted, organized, and used reliably in context (Floridi, 2010). Automated fact-checking research has therefore moved toward pipeline designs that separate claim detection, evidence retrieval, and claim verification, with verification understood as a verdict produced in relation to evidence rather than as an isolated judgment on surface text alone (Guo et al., 2022; Nakov, Da San Martino, et al., 2021).

This shift matters for short fact-checked claims. Public benchmark datasets such as LIAR made it possible to compare models on naturally occurring political statements, but they also expose the limitations of claim-only inference: short statements often omit the temporal, numerical, policy, or source context needed for a defensible classification (Wang, 2017). Evidence-centered benchmarks such as FEVER further clarified that claim verification requires retrieving and using textual sources, not merely assigning labels to statements (Thorne et al., 2018). The present study builds from this claim-verification tradition while deliberately retaining a bounded and reproducible design.

Large language models add urgency to this problem. They can produce fluent classification rationales and apparent factual outputs, but natural language generation systems remain vulnerable to hallucination and inconsistency when outputs are not grounded in reliable external information (Ji et al., 2023). Broader critiques of large language models warn that scale and fluency should not be mistaken for understanding, reliable evidence use, or social safety (Bender et al., 2021; Weidinger et al., 2021). Retrieval-augmented generation (RAG) responds to this limitation by combining a generative model with a retriever that supplies external documents or passages during inference (Lewis et al., 2020). In principle, this architecture can improve access to relevant evidence; in practice, the outcome depends on retrieval quality, corpus coverage, ranking, and the model’s use of retrieved passages (Izacard & Grave, 2021; Jiang et al., 2023; Karpukhin et al., 2020).

This study therefore treats retrieval augmentation as an empirical condition to be evaluated, not as an assumed solution. The central question is not whether RAG is theoretically attractive, but whether bounded-corpus retrieval changes classification reliability when traditional text classification, standalone LLMs, and RAG LLMs are evaluated on the same short fact-checked English claims. Table 1 summarizes the verification architectures compared in the study and makes explicit the different information available to each system.

Table 1: Verification architectures compared in the study.

Architecture type	Operational system	Information available during inference	Analytical role
Traditional baseline	TF-IDF + Logistic Regression	Claim text only	Estimates what can be achieved from lexical signal alone
Standalone LLM	Five prompted LLM families	Claim text and parametric model knowledge	Tests parametric-only classification under fixed prompting
Bounded-corpus RAG LLM	The same five LLM families with retrieved passages	Claim text, retrieved evidence passages, and parametric model knowledge	Tests whether bounded retrieval changes classification reliability

1.2 Research Gap

The empirical gap addressed here is narrower than the broad question of whether AI can support human source evaluation. Human evaluation of online information is a complex behavior involving trustworthiness, accuracy, source interpretation, context, technological cues, and user judgment (Choi & Stvilia, 2015; Metzger, 2007; Sundar, 2008). This study does not observe human evaluators and does not claim to measure user trust. Instead, it evaluates automated classification reliability under different evidence conditions.

Three gaps motivate the benchmark. First, many misinformation classifiers still emphasize aggregate accuracy, even though class imbalance and asymmetric harms make accuracy alone insufficient. A system that misses misleading claims and a system that over-flags not-misleading claims create different risks, so balanced metrics such as Macro F1, class-level recall, and Matthews Correlation Coefficient (MCC) are necessary for interpretation (Boughorbel et al., 2017; Chicco & Jurman, 2020; Sokolova & Lapalme, 2009). Second, standalone LLMs and RAG systems are often compared without enough attention to error redistribution: retrieval may correct some false negatives while introducing new false positives. Third, open-web retrieval can make benchmark conditions difficult to reproduce because evidence availability changes over time. A bounded-corpus setting reduces that variability and allows the effect of retrieval access to be examined under controlled conditions.

The dataset partitions used in the study are summarized in Table 2. The table is generated from the study data using the binary mapping described in the methods. The held-out test split is the evaluation set used for the benchmark results.

Table 2: Dataset split sizes after binary label mapping.

split	misleading	not misleading	total
Train	3962	5083	9045
Validation	616	668	1284
Test	465	587	1052

Table 2 also clarifies why the study prioritizes balanced reliability rather than simple accuracy. The test set contains more misleading than not-misleading claims after binary mapping, so high accuracy can coexist with poor not-misleading recall. For that reason, the study treats Macro F1 as the primary benchmark metric and uses MCC, class-level recall, and confusion-matrix analysis as complementary indicators.

1.3 Problem Statement

Current misinformation verification systems frequently rely either on text-only classification approaches or standalone language models that produce unsupported factual outputs. Although retrieval augmentation has been proposed as a way to improve evidence grounding, existing findings remain uneven because retrieval can help, fail, or harm depending on whether the retrieved material is relevant, sufficient, and properly used by the model (Jiang et al., 2023; Lewis et al., 2020).

The unresolved empirical problem is therefore specific: under bounded claim-verification conditions, does retrieval augmentation improve balanced classification reliability relative to traditional baselines and standalone LLM configurations? This problem requires attention not only to aggregate scores but also to class-level behavior, error concentration, and claim structures that remain difficult across architectures.

Accordingly, this study reconstructs the thesis as a controlled comparative benchmark of evidence-grounded verification architectures. It evaluates whether bounded-corpus RAG improves classification reliability for short fact-checked English claims and examines the conditions under which retrieval appears to help or reduce reliability.

1.4 Research Questions and Hypotheses

The study is guided by four research questions:

How does verification architecture type, namely traditional baseline, standalone LLM, and bounded-corpus RAG LLM, affect classification performance on short fact-checked English claims?
To what extent does retrieval augmentation improve balanced classification reliability relative to standalone LLM configurations across Macro F1, class-level recall, and MCC?
Which categories of claims produce consistent classification difficulty across verification architectures?
Under what claim conditions does retrieval augmentation improve or reduce classification reliability?

The study also uses three directional hypotheses to structure the benchmark interpretation:

$H_1$: Bounded-corpus RAG LLM systems will show higher Macro F1 than their corresponding standalone LLM configurations.

$H_2$: Retrieval augmentation will improve recall for context-dependent misleading claims relative to standalone LLM configurations.

$H_3$: Claims involving numerical ambiguity, temporal ambiguity, and partial truth conditions will produce lower classification reliability across all verification architectures.

These hypotheses are evaluated as benchmark expectations rather than causal claims about model reasoning. The study compares fixed systems under fixed data conditions; it does not sample models from a statistical population and does not infer general causal laws from a single benchmark.

1.5 Strategic Contribution

The contribution of this study is a reproducible, bounded-corpus evaluation of evidence-grounded claim verification. Its strategic value is not the generic comparison of machine learning and LLM systems, which is already a crowded area, but the reconstruction of that comparison around evidence availability, retrieval-dependent reliability, and error-pattern analysis.

For misinformation studies, the study provides a controlled account of how retrieval access changes verification behavior on short fact-checked claims. For information science, it reframes automated verification as an information retrieval and evidence-use problem while avoiding unsupported claims about human information evaluation. For AI evaluation, it demonstrates why RAG systems should be judged through balanced metrics and failure analysis rather than through accuracy alone.

1.6 Scope and Delimitations

The study is limited to short English textual claims from a public fact-checking benchmark. It does not evaluate full-length news texts, social media threads, images, video, audio, or multimodal misinformation. It uses a binary label mapping for methodological control, even though the source dataset contains more granular truthfulness labels. The RAG condition uses a bounded evidence corpus rather than live web retrieval, so findings should be interpreted as evidence about controlled retrieval settings rather than open-domain fact-checking in the wild.

The study also does not measure human information evaluation, user trust, explanation faithfulness, or real-world intervention outcomes. Its claims are therefore restricted to automated classification reliability and retrieval-conditioned benchmark behavior. This narrower framing is necessary for publication readiness because it aligns the conceptual contribution with the empirical design.

2 Literature Review and Framework

2.1 Traditional Misinformation Classification

Early computational work on misinformation detection commonly treated the task as supervised text classification. In this approach, a model learns associations between textual features and veracity labels, then predicts a label for unseen claims. The LIAR benchmark is central to this tradition because it provides short, naturally occurring fact-checked political statements with metadata and truthfulness labels, making it suitable for claim-level modeling rather than document-level news classification (Wang, 2017). Traditional baselines such as logistic regression remain useful in this study because they establish the performance that can be obtained from claim text alone, following established predictive modeling practice in which simple baselines anchor interpretation of more complex systems (Hastie et al., 2009; Kuhn & Johnson, 2013).

The value of a traditional baseline is diagnostic rather than merely historical. A TF-IDF representation paired with logistic regression tests whether lexical regularities in short claims are sufficient for reliable classification. If this baseline performs similarly to more complex systems, the benchmark would suggest that the additional architecture does not provide meaningful evidence-use advantages under the study conditions. If more complex systems perform differently, the baseline helps identify whether the difference is plausibly associated with model scale, parametric knowledge, retrieved evidence, or shifts in class-level behavior.

However, claim-only classification is structurally limited. Many short claims are context dependent: the same wording may be misleading or not misleading depending on date, jurisdiction, speaker, comparison class, or missing qualifier. This limitation motivates the move from misinformation detection as surface-text classification toward claim verification as evidence-supported classification (Guo et al., 2022).

2.2 Automated Claim Verification and Evidence Retrieval

Automated fact-checking research increasingly conceptualizes verification as a pipeline that includes claim detection, evidence retrieval, and verdict prediction (Guo et al., 2022). Shared-task work such as CLEF CheckThat! also shows that automated fact-checking involves multiple subtasks, including identifying check-worthy claims, matching previously checked claims, retrieving evidence, and validating factuality (Nakov, Da San Martino, et al., 2021; Nakov, Corney, et al., 2021). This distinction is important because a system that predicts a veracity label without retrieving evidence is not performing the same information task as a system that grounds its prediction in retrieved sources. FEVER made this distinction explicit by framing fact verification as the assessment of claims against textual evidence (Thorne et al., 2018).

For the present study, the central issue is not whether a system can output a label, but whether evidence access changes balanced classification reliability. Evidence retrieval can improve a model’s opportunity to resolve missing context, but it can also introduce irrelevant, incomplete, or misleading passages. Therefore, retrieval must be treated as a mechanism whose effects require empirical evaluation rather than as an automatic guarantee of factual reliability.

This evidence-centered view connects automated verification to Library and Information Science because retrieval, source selection, and evidence use are information processes. The study retains this LIS relevance while avoiding claims about human information evaluation that are not measured directly.

2.3 Standalone LLMs and Unsupported Verification

Standalone LLMs differ from traditional text classifiers because they can use broad parametric knowledge and instruction-following behavior during inference. This makes them attractive for claim verification, especially when no task-specific training is available. At the same time, natural language generation systems can produce fluent but unsupported outputs, and hallucination remains a recognized limitation of generative systems used for factual tasks (Ji et al., 2023).

The risk is not only that an LLM may be wrong. The risk is that an LLM may present an unsupported classification with the appearance of explanation, thereby making the output seem more evidentially grounded than it is. This concern is consistent with warnings that large language models can produce persuasive language without reliable grounding in communicative intent, source accountability, or factual support (Bender et al., 2021; Weidinger et al., 2021). This is why the study distinguishes between standalone LLMs and retrieval-augmented LLMs as separate verification architectures. Standalone conditions test classification without retrieved evidence, while RAG conditions test whether adding retrieved passages changes classification reliability.

LLM auditing scholarship also supports this caution. Evaluation of language models should attend to system behavior, downstream risks, and limitations of black-box outputs rather than relying only on general capability claims (Mökander et al., 2023). In this study, that principle is operationalized through architecture-level comparison, class-level metrics, and error-pattern analysis.

2.4 Retrieval-Augmented Generation and Evidence Grounding

Retrieval-augmented generation combines a language model with an external retrieval component. The original RAG formulation was motivated by the limits of relying only on knowledge stored in model parameters, especially for knowledge-intensive NLP tasks (Lewis et al., 2020). Passage-retrieval approaches such as Fusion-in-Decoder further demonstrate that retrieved passages can support generative models in open-domain knowledge tasks, while later work on dense retrieval and active retrieval emphasizes that retrieval quality, timing, and evidence selection influence downstream performance (Izacard & Grave, 2021; Jiang et al., 2023; Karpukhin et al., 2020).

Recent RAG evaluation studies caution that retrieval augmentation should not be treated as uniformly beneficial. Benchmarking work shows that RAG systems require evaluation across different capabilities, including how models use retrieved information and how they respond when retrieval is imperfect (Chen et al., 2024). Survey work similarly frames RAG as a family of architectures whose performance depends on retrieval, augmentation, and generation choices (Gao et al., 2023).

This study therefore treats bounded-corpus RAG as an evidence condition. The bounded corpus supports reproducibility and reduces the variability of live web retrieval, but it also limits coverage. The study’s RAG systems are expected to perform well only when retrieved passages are relevant enough for the model to use and when the claim can be resolved from the available evidence.

2.5 Evidence Use and Construct Boundaries

Information science provides important concepts for understanding why evidence selection and source context matter, but those concepts must be used carefully in this study. Research on online information evaluation treats credibility as a multidimensional human construct shaped by trustworthiness, accuracy, source features, cognitive authority, interface cues, user context, and evaluation behavior (Choi & Stvilia, 2015; Hilligoss & Rieh, 2008; Metzger, 2007; Rieh, 2002; Sundar, 2008). Information behavior research also shows that source evaluation is embedded in situated practices of seeking, orienting, and selection rather than reducible to a classifier output (Savolainen, 2007). The present study does not observe users, measure trust, or evaluate human information behavior. It therefore does not claim to measure human source evaluation directly.

Instead, information science enters the study through evidence organization, retrieval, and system-mediated verification. The empirical object measured here is classification reliability: whether systems assign the operational benchmark label correctly under different evidence conditions. This distinction prevents conceptual inflation. The study can contribute to LIS by evaluating retrieval-supported verification systems, but it cannot infer how people would accept, reject, or interpret those systems without a separate user study.

This boundary is especially important for RAG. Retrieved evidence may make a system appear more transparent, but evidence presence is not equivalent to evidence quality. A retrieved passage can be irrelevant, incomplete, outdated, or misinterpreted. Knowledge organization scholarship is relevant here because the way evidence is selected, represented, and organized affects what can be retrieved and interpreted (Hjorland, 2012). Consequently, the study frames retrieval augmentation as a potential support for evidence-conditioned prediction reliability rather than as a direct measure of credibility.

2.6 Evaluation Beyond Accuracy

The evaluation literature supports moving beyond aggregate accuracy for NLP systems. Behavioral testing approaches argue that benchmark accuracy can overestimate system reliability unless complemented by analyses of specific capabilities and failure patterns (Ribeiro et al., 2020). This point is directly relevant to claim verification because the practical meaning of an error depends on the class and claim condition.

Balanced metrics are also necessary because misinformation benchmarks may contain unequal class distributions. Classification-metric research shows that performance measures capture different aspects of classifier behavior and must be selected according to the task and error costs (Sokolova & Lapalme, 2009). MCC is useful because it summarizes the relationship between predicted and actual binary labels using all cells of the confusion matrix and is less vulnerable than accuracy to misleading conclusions under imbalance (Boughorbel et al., 2017; Chicco & Jurman, 2020). Macro F1 is similarly appropriate because it gives equal weight to class-level F1 scores rather than allowing the larger class to dominate interpretation.

For this reason, the study treats Macro F1 as the primary benchmark metric and interprets it alongside MCC, class-level recall, precision, confusion matrices, and error agreement. This strategy makes the results more informative for evidence-grounded verification because it distinguishes overall performance from the system’s tendency to miss misleading claims or over-flag not-misleading claims.

2.7 Claim Ambiguity and Difficult Verification Conditions

Short fact-checked claims frequently compress complex factual situations into brief statements. This creates difficulty when the truth value depends on numerical ranges, comparison groups, dates, policy definitions, or missing qualifiers. A claim may be partly accurate yet misleading because it omits context, selects a favorable denominator, or frames a technically true statement in a deceptive way.

The study therefore treats ambiguity and partial truth as moderating conditions. These conditions are not separate model architectures; they are claim features that may affect the reliability of all architectures. Binary misinformation classification also structurally compresses contextual ambiguity: source labels that distinguish degrees of truthfulness, missing context, and partial accuracy are forced into a smaller operational label space. This compression is useful for reproducible benchmarking, but it can obscure the difference between a plainly false claim and a claim that is technically accurate yet misleading by omission. A standalone LLM may rely on prior knowledge or linguistic plausibility, while a RAG LLM may retrieve evidence that clarifies the claim or evidence that reinforces the wrong interpretation. The hard-case analysis is designed to identify these patterns systematically rather than treating errors as isolated failures.

2.8 Formal Theoretical Mechanism

The study formalizes retrieval augmentation as a conditional mechanism rather than as an automatic improvement. In the standalone LLM condition, the pathway is:

Standalone LLM -> parametric inference only -> label prediction

In the bounded-corpus RAG condition, the pathway is:

Retrieval augmentation -> evidence availability -> evidence grounding opportunity -> reduced contextual uncertainty -> classification reliability.

The phrase “grounding opportunity” is deliberate. Prior RAG and attribution-evaluation research shows that retrieved or cited material can support generation, but external references still require evaluation because retrieval may be incomplete, irrelevant, or misused by the model (Izacard & Grave, 2021; Lewis et al., 2020; Yue et al., 2023). The mechanism is therefore falsifiable: if retrieval improves evidence-conditioned prediction reliability, RAG systems should show higher Macro F1, MCC, or class-level recall than their standalone counterparts. If retrieval adds noise or shifts the model toward over-classification, RAG systems should show weaker balanced metrics or new error patterns.

The moderation logic is also explicit. Ambiguity conditions increase contextual dependency because numerical, temporal, and partial-truth claims require information not fully contained in the claim text. These conditions are expected to reduce reliability across architectures. RAG should help only when retrieved evidence resolves the missing context; it should fail or harm performance when retrieval supplies weak evidence, mismatched evidence, or evidence that the model over-weights. This formal mechanism connects the conceptual framework to the hypotheses without making unsupported claims about human cognition or general model reasoning.

2.9 Conceptual Framework

Figure 1 presents the study’s conceptual framework. The framework begins with a benchmark claim set and compares two primary LLM verification conditions: standalone LLM verification, which relies on parametric model knowledge, and bounded-corpus RAG verification, which adds retrieval and evidence integration. The mechanism of interest is retrieval and evidence grounding: a bounded corpus is searched by a retriever, top-ranked evidence is supplied to the model, and the model produces a classification decision. The study then evaluates reliability outcomes using balanced metrics and error analysis.

Figure 1: Conceptual framework for retrieval-augmented verification and classification reliability under bounded-corpus conditions.

The framework also specifies the study’s moderating conditions. Numerical ambiguity, temporal ambiguity, and partial truth conditions may reduce reliability because they increase the amount of context needed for correct classification. These moderators are expected to affect both standalone and RAG systems, although the direction of the effect may differ. Retrieval may improve classification when it supplies clarifying evidence, but it may reduce reliability when retrieved passages are weak, mismatched, or over-weighted by the model.

The framework in Figure 1 is conceptual rather than a claim that all links are causally proven. The empirical design compares fixed architectures under controlled benchmark conditions. It can show whether retrieval-augmented systems are associated with higher or lower reliability metrics, but it cannot establish a general causal theory of factual reasoning. This scope limitation is deliberate: the framework supports a reproducible benchmark study of evidence-grounded verification, not a socio-cognitive account of human source evaluation.

2.10 Literature Synthesis and Research Positioning

The literature points to four conclusions that structure the study. First, misinformation verification is better understood as evidence-supported claim verification than as surface-text fake news detection alone (Guo et al., 2022; Nakov, Da San Martino, et al., 2021; Thorne et al., 2018). Second, standalone LLMs require careful evaluation because fluent outputs can still be factually unsupported (Bender et al., 2021; Ji et al., 2023; Weidinger et al., 2021). Third, RAG offers a plausible mechanism for improving evidence access, but its benefits depend on retrieval quality and model use of retrieved passages (Chen et al., 2024; Gao et al., 2023; Izacard & Grave, 2021; Lewis et al., 2020). Fourth, evaluation should use balanced metrics and error-pattern analysis because aggregate accuracy can hide class-level failures (Chicco & Jurman, 2020; Ribeiro et al., 2020; Sokolova & Lapalme, 2009).

The study is positioned at this intersection. Its contribution is a bounded-corpus benchmark of evidence-grounded verification reliability for short fact-checked English claims. The study does not claim to solve misinformation or measure human source evaluation. It instead provides a controlled evaluation of when retrieval augmentation appears to improve, fail to improve, or redistribute classification reliability across verification architectures.

3 Methodology and Framework

3.1 Research Design

The study uses a quantitative comparative benchmark design. This design is appropriate because the purpose is to compare fixed verification architectures under the same claim set, label mapping, prompt version, retrieval condition, and evaluation procedure. The benchmark compares three architecture types: a traditional text-classification baseline, standalone LLMs, and bounded-corpus RAG LLMs. This design follows automated fact-checking research that distinguishes evidence retrieval from verdict prediction (Guo et al., 2022) and predictive modeling guidance that emphasizes transparent baselines, held-out evaluation, and task-appropriate metrics (Hastie et al., 2009; Kuhn & Johnson, 2013). It also follows NLP evaluation guidance that treats empirical system comparison as requiring explicit metrics and controlled test conditions (Dror et al., 2018).

The design is comparative rather than causal in the strong experimental sense. The systems are not sampled from a population of possible models, and the benchmark does not identify internal reasoning processes. The study can therefore report that one architecture condition is associated with higher or lower reliability metrics under the study conditions, but it does not claim that retrieval augmentation generally causes better factual reasoning.

3.2 Data Source and Unit of Analysis

The unit of analysis is the short fact-checked English claim. The study uses a public benchmark of short political claims originally labeled with fine-grained truthfulness categories (Wang, 2017). The claim-level unit is appropriate because the study evaluates whether verification systems can classify concise factual statements under controlled evidence conditions. It does not evaluate whole news documents, social media threads, multimodal posts, or live web monitoring.

The dataset partitions used in the study were introduced in Table 2. Those partitions are retained for methodological separation between model development and final evaluation. The held-out test partition is used for the comparative benchmark. Because the benchmark is reconstructed as a binary verification task, original labels are mapped into two operational classes, as shown in Table 3.

Table 3: Operational binary label mapping used in the study.

Original label category	Operational binary class	Methodological rationale
pants-fire	misleading	Strongly false claims are treated as misleading for binary verification
false	misleading	False claims are treated as misleading
barely-true	misleading	Minimally true claims are treated as misleading because they require corrective interpretation
half-true	not misleading	Partially accurate claims are grouped with non-misleading claims for the binary benchmark
mostly-true	not misleading	Mostly accurate claims are treated as not misleading
true	not misleading	True claims are treated as not misleading

Table 3 is a methodological simplification. It supports reproducible binary classification, but it also compresses important differences among partial truth categories. This binary compression is analytically important rather than merely technical: it can turn claims with omitted context or mixed truth conditions into borderline benchmark cases. For this reason, the study interprets hard cases and partial-truth errors cautiously rather than treating the binary labels as a complete representation of factual nuance.

3.3 Verification Architectures

The independent variable is verification architecture type. The baseline condition represents claim-only lexical classification. The standalone LLM condition represents prompted classification without retrieved external evidence. The bounded-corpus RAG condition represents prompted classification with retrieved evidence passages supplied during inference.

The architecture comparison in Table 1 defines the system families evaluated by the study. The traditional baseline estimates the value of textual signal alone. The standalone LLM condition estimates how selected LLMs behave when they must classify claims without explicit retrieved evidence. The RAG condition estimates whether adding bounded evidence retrieval changes reliability, class-level recall, and error distribution.

This architecture framing is central to the study’s methodological validity. It prevents the results from being interpreted as a generic ranking of models and instead treats each system as an operationalization of a different evidence condition.

3.4 Retrieval and Evidence Grounding Procedure

The RAG condition uses bounded-corpus retrieval. This means that retrieval is performed against a controlled collection of claim-related evidence rather than against the open web. The bounded approach improves reproducibility because the evidence environment is fixed for the benchmark. It also clarifies the scope of inference: the findings apply to controlled retrieval settings and should not be generalized directly to live open-domain fact-checking.

The retrieval procedure follows the mechanism represented in Figure 1. A claim is submitted to the retrieval component, top-ranked evidence is selected, and the retrieved passages are provided to the LLM as context for classification. The model then returns a label under the same binary scheme used for the baseline and standalone systems. RAG research supports this general architecture, but also shows that performance depends on retrieval coverage, ranking quality, and the model’s ability to use external context (Chen et al., 2024; Gao et al., 2023; Lewis et al., 2020).

Because the study evaluates end-task classification rather than full explanation faithfulness, retrieval quality is treated as a mechanism and limitation rather than as a fully measured outcome. The results can show whether RAG changes classification reliability, but they cannot fully determine whether the model’s final answer faithfully used the retrieved evidence.

3.5 Retrieval-Effect Audit

The available benchmark outputs do not include full retrieved-passage logs, rank positions, or document-level relevance annotations. For that reason, direct retrieval hit rate and evidence relevance cannot be computed as primary retrieval-quality metrics in this version of the study. This limitation is reported explicitly because retrieval quality is central to the interpretation of RAG results.

To reduce that limitation, the study adds a lightweight retrieval-effect audit using the paired standalone and RAG predictions for each model family. The audit identifies three classification-level patterns: cases where RAG corrected a standalone error, cases where RAG introduced an error after the standalone system was correct, and cases where both systems remained incorrect. This is a proxy for retrieval effect, not a direct measure of evidence relevance. It helps distinguish whether retrieval access is associated with net correction, net harm, or unchanged difficulty while preserving the study’s bounded inferential scope.

3.6 Prompting and Model Execution

The LLM conditions use fixed prompting to support comparability across systems. A fixed prompt reduces uncontrolled variation and makes the benchmark reproducible, but it also creates a prompt-sensitivity limitation. LLM outputs can vary with prompt wording, instruction order, formatting, and model updates. Therefore, the study treats prompt design as part of the benchmark configuration rather than as a universally optimal verification protocol.

The benchmark uses deterministic or near-deterministic settings where available. This reduces output variability and supports paired comparison across systems. However, model non-determinism and provider-side changes remain relevant validity risks. The study addresses this by reporting benchmark configuration details, using the same evaluation claims across systems, and interpreting system differences within the bounded run conditions.

3.7 Variables and Operational Definitions

The study’s variables are defined operationally in Table 4. This table connects the conceptual framework to the empirical benchmark and prevents the analysis from drifting into constructs that are not directly measured.

Table 4: Operational variables and measures used in the study.

Construct	Role in the study	Operational definition	Primary indicators
Verification architecture type	Independent variable	System condition used to classify each claim	Traditional baseline, standalone LLM, bounded-corpus RAG LLM
Evidence availability	Mechanism	Whether external retrieved evidence is supplied during inference	No retrieval vs bounded retrieved evidence
Classification reliability	Dependent variable	Agreement between predicted and operational true labels	Macro F1, MCC, balanced accuracy, class-level precision and recall
Error distribution	Analytical outcome	Pattern of false positives, false negatives, and shared errors across systems	Confusion matrices, error agreement, corrected and introduced errors
Claim difficulty	Moderating condition	Claim features associated with persistent cross-system error	Numerical ambiguity, temporal ambiguity, partial truth, missing context

As Table 4 shows, the measured dependent variable is classification reliability. This distinction aligns the study’s theoretical framing with the available data and avoids inferring human information behavior from automated benchmark results.

3.8 Evaluation Metrics

The study uses multiple metrics because no single measure adequately captures verification reliability under class imbalance. Accuracy is reported for interpretability, but it is not treated as the primary metric. Macro F1 is prioritized because it gives equal weight to both operational classes. MCC is included because it incorporates all four cells of the binary confusion matrix and is recommended as a balanced measure for binary classification, especially where accuracy or F1 may be misleading (Boughorbel et al., 2017; Chicco & Jurman, 2020; Sokolova & Lapalme, 2009).

Table 5 lists the evaluation metrics and their analytical function in the study.

Table 5: Evaluation metrics and analytical purposes.

Metric or output	Analytical purpose	Interpretive caution
Accuracy	Summarizes overall proportion of correct predictions	Can obscure poor minority-class or class-specific performance
Macro F1	Primary balanced metric across misleading and not-misleading classes	Does not show which class drives errors without class-level analysis
Class-level recall	Shows whether systems miss misleading claims or over-flag not-misleading claims	Must be interpreted with false positive and false negative patterns
Class-level precision	Shows reliability of each predicted class	Sensitive to prediction distribution and class imbalance
MCC	Summarizes balanced binary association between predicted and true labels	Less intuitive than accuracy and should be reported with confusion matrices
Confusion matrix	Displays false positives and false negatives directly	Requires prose interpretation rather than score ranking alone
Retrieval-effect proxy	Identifies RAG-corrected and RAG-introduced errors relative to standalone counterparts	Does not replace direct retrieved-document relevance or hit-rate analysis
Error agreement	Identifies claims that multiple systems classify incorrectly	Exploratory unless paired with a systematic claim taxonomy

Each output in Table 5 is discussed in the results rather than presented as a stand-alone score. This is necessary because a system may improve misleading recall while worsening not-misleading recall, or improve accuracy while reducing balanced reliability. The study therefore evaluates retrieval augmentation as a tradeoff-sensitive architecture change.

3.9 Statistical and Comparative Analysis

The statistical analysis is used to support paired benchmark comparison, not broad population inference. Because all systems classify the same held-out claims, paired comparison is more appropriate than treating model outputs as independent samples. For binary correctness outcomes, McNemar-style comparisons are appropriate for paired nominal observations, and NLP evaluation guidance emphasizes that significance tests should be chosen to match the evaluation setting and interpreted carefully (Dror et al., 2018).

The study uses statistical tests as robustness indicators. A statistically detectable difference in correctness does not automatically imply practical superiority, generalizable population effects, epistemic reliability, or better reasoning. For that reason, statistical outputs are interpreted alongside effect direction, Macro F1, MCC, class-level recall, and error patterns.

3.10 Error Taxonomy and Hard-Case Analysis

The study includes a structured hard-case analysis to identify claims that remain difficult across architectures. This analysis is motivated by behavioral testing approaches that argue system evaluation should go beyond aggregate scores and examine specific failure patterns (Ribeiro et al., 2020).

Hard cases are identified through repeated misclassification across evaluated systems. They are then interpreted according to claim conditions derived from the conceptual framework: numerical ambiguity, temporal ambiguity, partial truth, missing context, and evidence insufficiency. These categories allow the analysis to ask whether retrieval helps with context-dependent claims, whether it introduces new errors, and whether certain claim structures remain difficult even when evidence is available. Because recent work on attributed LLMs shows that externally referenced outputs still require reliable attribution evaluation, the hard-case analysis treats evidence-grounded output as something to audit rather than assume (Yue et al., 2023).

3.11 Validity Safeguards and Limitations

Several validity risks are built into the design. Dataset contamination is possible because public benchmark claims may have appeared in model training data. Prompt sensitivity is possible because LLM behavior can change under different instructions. Retrieval contamination is possible if retrieved passages are irrelevant, incomplete, or misleading. Model non-determinism and provider updates may also affect reproducibility. These risks are consistent with recent concerns in LLM evaluation and attribution evaluation, where high average performance may mask instability, contamination, or unsupported source use (Chen et al., 2024; Yue et al., 2023).

The study addresses these risks through bounded retrieval, fixed evaluation claims, fixed prompts, transparent metric reporting, and cautious inferential language. These safeguards do not eliminate all threats to validity, but they make the benchmark auditable and reduce conceptual overreach.

3.12 Ethical Considerations

Automated verification systems can create harms when users over-rely on incorrect classifications or when systems over-flag accurate claims as misleading. False negatives may allow misleading claims to pass unchallenged, while false positives may wrongly reduce confidence in accurate information. LLM auditing literature emphasizes the need to evaluate downstream risks and system behavior rather than relying on model capability claims alone (Mökander et al., 2023).

The study therefore avoids presenting any evaluated system as a production-ready fact-checker. Its ethical contribution is evaluative: it identifies reliability limits, retrieval tradeoffs, and error conditions that should be considered before evidence-grounded AI systems are used in public information environments.

3.13 Reproducibility Framework

The methodology is designed as a reproducible benchmark workflow. The workflow begins with a fixed claim set and binary label mapping, applies the same evaluation claims to each architecture condition, records predictions and benchmark outputs, computes balanced metrics, and analyzes error patterns. Figure 1 provides the conceptual version of this workflow, while Table 4 and Table 5 provide the operational definitions used for analysis.

To support reproducibility and open access, the manuscript source, dataset splits, benchmark outputs, and related rendering files are archived in a GitHub repository at https://github.com/dddorado/reynoso-reproducibility. A preprint version of the manuscript is also available through RPubs at https://rpubs.com/danddorado/1434840.

The study’s reproducibility framework has three practical implications. First, every result should be traceable to a defined architecture condition and label mapping. Second, every table or figure should be generated from the study data or benchmark outputs rather than manually transcribed. Third, every interpretation should remain within the bounded-corpus design and avoid unsupported claims about open-web fact-checking, human information evaluation, or general model reasoning.

4 Results and Analysis

4.1 Benchmark Performance Overview

The benchmark evaluated eleven systems on the same held-out claim set: one traditional baseline, five standalone LLMs, and five bounded-corpus RAG LLMs. The strongest system by Macro F1 was RAG LLM (deepseek/deepseek-chat-v3-0324), with a Macro F1 of 0.6194. The strongest system by MCC was RAG LLM (deepseek/deepseek-chat-v3-0324), with an MCC of 0.2390. The highest accuracy was produced by Standalone LLM (google/gemini-2.5-flash), with an accuracy of 0.6638.

Table 6 ranks all evaluated systems by Macro F1. The table shows that the best-performing systems are not simply those with the highest misleading recall. Instead, the strongest systems combine relatively higher misleading recall with less severe deterioration in not-misleading recall, which is why Macro F1 and MCC provide a more informative view than accuracy alone.

Table 6: Overall benchmark performance ranked by Macro F1.

Rank	System	Architecture	Macro F1	Accuracy	MCC	Misleading recall	Not-misleading recall
1	RAG LLM (deepseek/deepseek-chat-v3-0324)	RAG LLM	0.6194	0.6488	0.2390	0.7176	0.5234
2	RAG LLM (meta-llama/llama-3.3-70b-instruct)	RAG LLM	0.6153	0.6393	0.2368	0.7237	0.4855
3	RAG LLM (anthropic/claude-haiku-4.5)	RAG LLM	0.6117	0.6354	0.2259	0.6834	0.5479
4	Standalone LLM (openai/gpt-4o-mini)	Standalone LLM	0.6006	0.6117	0.2209	0.6027	0.6281
5	RAG LLM (google/gemini-2.5-flash)	RAG LLM	0.6002	0.6180	0.2088	0.6430	0.5724
6	Standalone LLM (anthropic/claude-haiku-4.5)	Standalone LLM	0.5975	0.6164	0.2020	0.6455	0.5635
7	Standalone LLM (meta-llama/llama-3.3-70b-instruct)	Standalone LLM	0.5942	0.6125	0.1931	0.6773	0.4944
8	Baseline (TF-IDF + LogReg)	Baseline	0.5893	0.6046	0.1909	0.6174	0.5813
9	RAG LLM (openai/gpt-4o-mini)	RAG LLM	0.5884	0.6527	0.1930	0.8117	0.3630
10	Standalone LLM (google/gemini-2.5-flash)	Standalone LLM	0.5818	0.6638	0.2008	0.8570	0.3118
11	Standalone LLM (deepseek/deepseek-chat-v3-0324)	Standalone LLM	0.5729	0.6488	0.1715	0.8289	0.3207

Figure 2 visualizes the same ranking. The figure makes the central result easier to see: bounded-corpus RAG systems occupy several of the strongest Macro F1 positions, but the improvement is not uniform across all model families. This means $H_1$ is only partially supported. RAG improved Macro F1 for several model families, but it did not improve Macro F1 for every family.

Figure 2: Macro F1 by evaluated verification system.

4.2 Balanced Classification Reliability

Table 7 aggregates the benchmark by architecture type. The RAG group has the highest mean Macro F1 and mean MCC, but the group average hides important variation by model family. Standalone LLMs do not consistently outperform the traditional baseline on balanced reliability, and RAG systems do not uniformly improve all class-level outcomes.

Table 7: Mean performance by verification architecture type.

Architecture	Systems	Mean Macro F1	Mean accuracy	Mean MCC	Mean misleading recall	Mean not-misleading recall
RAG LLM	5	0.6070	0.6388	0.2207	0.7159	0.4984
Standalone LLM	5	0.5894	0.6306	0.1977	0.7223	0.4637
Baseline	1	0.5893	0.6046	0.1909	0.6174	0.5813

The architecture-level pattern supports the study’s narrowed interpretation. Retrieval augmentation is associated with stronger average balanced reliability, but this should not be read as evidence that RAG is inherently superior. The mean values in Table 7 show an architecture-level tendency, while the family-level deltas in the next subsection show the unevenness of that tendency.

4.3 Retrieval-Conditioned Classification Behavior

Table 8 compares each standalone LLM with its RAG counterpart. RAG improved Macro F1 for Gemini, Claude, DeepSeek, and Llama, but reduced Macro F1 for GPT-4o-mini. The largest Macro F1 gain occurred for DeepSeek, while GPT-4o-mini showed a shift toward high misleading recall and weak not-misleading recall. This pattern is consistent with the audit concern that retrieval can redistribute errors rather than simply reduce them.

Table 8: Standalone-to-RAG changes by model family.

Model family	Delta Macro F1	Delta accuracy	Delta MCC	Delta misleading recall	Delta not-misleading recall
deepseek-chat-v3-0324	0.0465	0.0000	0.0675	-0.1113	0.2027
llama-3.3-70b-instruct	0.0211	0.0268	0.0437	0.0464	-0.0089
gemini-2.5-flash	0.0184	-0.0458	0.0080	-0.2140	0.2606
claude-haiku-4.5	0.0142	0.0190	0.0239	0.0379	-0.0156
gpt-4o-mini	-0.0122	0.0410	-0.0279	0.2090	-0.2651

Figure 3 visualizes the same family-level changes. The figure shows that RAG often improved Macro F1 when it also improved or preserved not-misleading recall. When retrieval strongly increased misleading recall while sharply reducing not-misleading recall, the result was not necessarily better balanced reliability. This finding supports RQ4: retrieval helps under some model-family conditions but can reduce reliability when it pushes the model toward over-classifying claims as misleading.

Figure 3: Change in Macro F1 and class-level recall from standalone to RAG configurations.

The paired standalone-versus-RAG comparisons in Table 9 further show that several architecture changes produced statistically detectable differences in correctness. These tests should be interpreted cautiously: they indicate paired differences on the benchmark claims, not population-level inference or generalizable causal superiority. In practical terms, the tests confirm that retrieval often changed system behavior, but the metric deltas show whether those changes improved balanced reliability.

Table 9: Paired standalone-versus-RAG comparisons for all-class correctness.

Model family	Standalone correct, RAG incorrect	RAG correct, standalone incorrect	Chi-squared	p-value	Significant
gpt-4o-mini	179	231	6.3439	0.01180	Yes
gpt-4o-mini	195	203	0.1231	0.72600	No
gpt-4o-mini	169	199	2.2853	0.13100	No
gpt-4o-mini	156	203	5.8942	0.01520	Yes
gpt-4o-mini	186	221	2.8403	0.09190	No
gemini-2.5-flash	226	168	8.2462	0.00408	Yes
gemini-2.5-flash	212	176	3.1572	0.07560	No
gemini-2.5-flash	188	169	0.9076	0.34100	No
gemini-2.5-flash	197	166	2.4793	0.11500	No
claude-haiku-4.5	163	187	1.5114	0.21900	No
claude-haiku-4.5	161	202	4.4077	0.03580	Yes
claude-haiku-4.5	185	214	1.9649	0.16100	No
deepseek-chat-v3-0324	177	177	0.0028	0.95800	No
deepseek-chat-v3-0324	193	181	0.3235	0.56900	No
llama-3.3-70b-instruct	156	190	3.1474	0.07600	No

4.4 Retrieval-Effect Proxy Audit

The study could not compute direct retrieval hit rate or document-level evidence relevance because the archived benchmark outputs do not include retrieved passage logs, passage ranks, or relevance annotations. To avoid overclaiming, Table 10 reports a classification-level proxy: whether each RAG system corrected errors made by its standalone counterpart or introduced errors on claims the standalone system classified correctly. This table is not a substitute for retrieval-quality evaluation, but it provides a minimum audit of how retrieval access redistributed correctness in the available benchmark outputs.

Table 10: Classification-level retrieval-effect proxy audit.

Model family	RAG-corrected standalone errors	RAG-introduced errors	Both incorrect	Net corrected errors	Correction rate	Introduction rate
gpt-4o-mini	231	179	261	52	18.2%	14.1%
llama-3.3-70b-instruct	190	156	301	34	15.0%	12.3%
claude-haiku-4.5	187	163	299	24	14.8%	12.9%
deepseek-chat-v3-0324	177	177	268	0	14.0%	14.0%
gemini-2.5-flash	168	226	258	-58	13.3%	17.8%

Table 10 strengthens the interpretation of Table 8 by showing that RAG gains are not simply aggregate score movements. Some RAG configurations corrected more standalone errors than they introduced, while others introduced enough new errors to weaken balanced reliability. This supports the study’s revised contribution: retrieval augmentation redistributes reliability unevenly under ambiguity-sensitive claim conditions.

Table 11 lists illustrative cases where a RAG configuration introduced an error relative to its standalone counterpart. These examples should be read as failure probes, not as a representative qualitative sample. They show why future work should preserve retrieved passages and relevance judgments: without that evidence trail, the study can identify that retrieval-conditioned behavior changed, but it cannot determine whether the cause was poor retrieval, poor evidence integration, or an ambiguity in the benchmark label.

Table 11: Illustrative RAG-introduced error cases from the proxy audit.

Model family	Claim excerpt	True label	Standalone prediction	RAG prediction
gpt-4o-mini	Building a wall on the U.S.-Mexico border will take literally years.	not misleading	not misleading	misleading
gemini-2.5-flash	Ronald Reagan faced an even worse recession than the current one.	misleading	misleading	not misleading
claude-haiku-4.5	Wisconsin is on pace to double the number of layoffs this year.	misleading	misleading	not misleading
deepseek-chat-v3-0324	Says that Tennessee law requires that schools receive half of proceeds – $31 million per year – from a ha…	not misleading	not misleading	misleading
llama-3.3-70b-instruct	Over the past five years the federal government has paid out $601 million in retirement and disability bene…	not misleading	not misleading	misleading

4.5 Class-Level Error Tradeoffs

The most important class-level result is that retrieval augmentation changed the balance between misleading and not-misleading recall. For several systems, RAG increased misleading recall while reducing not-misleading recall. This is not a trivial tradeoff. In misinformation verification, false negatives allow misleading claims to pass, but false positives may incorrectly mark accurate or mostly accurate claims as misleading.

The results therefore do not support a simple claim that RAG “solves” verification. They support a more bounded claim: retrieval augmentation can improve balanced reliability for some model families, but its effect depends on whether the added evidence improves both sensitivity to misleading claims and restraint with not-misleading claims.

4.6 Error Agreement and Hard Cases

Table 12 summarizes how often systems failed on the same claims. The existence of moderate and high error-agreement bands indicates that some claims are difficult across architectures, not merely difficult for one model. This directly addresses RQ3.

Table 12: Distribution of cross-system error agreement.

Error agreement band	Claims	Percent
No system error	216	17.0%
Low error agreement	441	34.8%
Moderate error agreement	358	28.3%
High error agreement	252	19.9%

Figure 4 gives a more granular view by plotting the number of claims against the number of systems that misclassified them. The figure shows whether errors are dispersed across many isolated claims or concentrated in a smaller set of difficult claims. Concentration at higher error counts is analytically important because it points to claim structures that may exceed the capacity of both parametric and retrieval-augmented verification under the current design.

Figure 4: Distribution of the number of systems misclassifying each claim.

Table 13 lists illustrative high-agreement hard cases. These claims should not be interpreted as a qualitative sample of all errors, but they show the kinds of statements that multiple systems found difficult. Several involve compressed context, numerical or temporal interpretation, or partial-truth framing, which aligns with $H_3$.

Table 13: Illustrative claims with the highest cross-system error agreement.

Claim excerpt	True label	Systems in error
Under Rosemary Lehmberg, the Travis County D.A.s office convened the grand jury that indicted Rick Perry.	misleading	11
In Massachusetts, Scott Brown pushed for a law to force women considering abortion – force them – to look at color …	not misleading	11
Hes the only candidate whos balanced budgets and brought jobs to Providence.	not misleading	11
Putting three Republicans in my Cabinet…is unprecedented.	not misleading	11
Says U.S. Sen. Ron Johnson led the fight to let polluters release unlimited amounts of carbon pollution and took near…	not misleading	11
Im well aware that medical marijuana is a recognized, medical, viable treatment for this sort of [pancreas] pain cond…	misleading	11
Property taxes have increased 20 percent under four years of Chris Christie.	not misleading	11
It is already in the law that there is a requirement to screen (refugees) for religion.	not misleading	11

4.7 Hypothesis Evaluation

$H_1$ is partially supported. RAG produced higher Macro F1 than standalone configurations for four of the five model families, but not for all. The result therefore supports a conditional version of $H_1$: bounded-corpus RAG can improve balanced reliability, but improvement is model-family dependent.

$H_2$ is also partially supported. RAG improved misleading recall for several model families, but the improvement sometimes came with reduced not-misleading recall. This means retrieval may help identify misleading claims while increasing the risk of over-flagging not-misleading claims.

$H_3$ is supported at the level of error-pattern evidence. Claims with high cross-system error agreement show that some verification difficulty persists across architectures. The pattern is consistent with the framework’s expectation that ambiguity, missing context, and partial truth conditions reduce classification reliability.

4.8 Interpretive Limits of Benchmark Evaluation

The statistical tests in this chapter are benchmark diagnostics. They identify paired correctness differences among fixed systems on a fixed held-out claim set. They should not be interpreted as population inference about all possible models, all possible retrieval systems, or all misinformation verification contexts. The defensible interpretation is narrower: under the study’s bounded-corpus conditions, retrieval access changed classification behavior in measurable and uneven ways.

The binary label mapping also limits interpretation. It makes the benchmark reproducible, but it compresses partial truth, omitted context, and mixed factual status into two operational classes. This compression is not a minor coding detail; it helps explain why some claims remain difficult across architectures. A system may fail not because it lacks any factual signal, but because the benchmark label asks the model to reduce a context-dependent factual situation into a binary decision. This is one of the study’s central analytical findings: retrieval augmentation does not remove the structural ambiguity created by binary misinformation classification.

4.9 Results Synthesis

The main result is not that one architecture dominates all others. The main result is that retrieval augmentation changes verification behavior unevenly. RAG improved average balanced reliability and produced the strongest Macro F1 system, but it also redistributed class-level errors. The study therefore supports a bounded conclusion: retrieval-augmented verification is promising under controlled corpus conditions, but its value depends on retrieval quality, model-family behavior, and the ambiguity structure of the claims being verified. The strongest contribution is that retrieval augmentation redistributes reliability unevenly under ambiguity-sensitive claim conditions rather than simply improving verification across the board.

5 Conclusion and Implications

5.1 Summary of Findings

This study evaluated whether retrieval augmentation improves evidence-grounded classification reliability for short fact-checked English claims under bounded-corpus conditions. The results show that retrieval augmentation changed model behavior in meaningful but uneven ways. RAG configurations produced the strongest overall Macro F1 result and higher average architecture-level balanced reliability, but the improvement was not universal across model families. In some cases, retrieval improved misleading recall while weakening not-misleading recall, creating a tradeoff between catching misleading claims and avoiding over-flagging claims that were operationally not misleading.

The findings therefore support a conditional interpretation of retrieval augmentation. Bounded-corpus RAG can improve classification reliability, but it does not automatically make verification more reliable. Its value depends on the interaction among retrieval quality, model-family behavior, and claim difficulty.

5.2 Answers to the Research Questions

RQ1 asked how verification architecture type affects classification performance. The results show that architecture type matters, but not in a simple hierarchy. The traditional baseline remained competitive with several LLM configurations, standalone LLMs varied in class-level behavior, and RAG systems produced the strongest average Macro F1 and MCC. This indicates that evidence access changes verification performance, but architecture type alone does not determine reliability.

RQ2 asked whether retrieval augmentation improves balanced classification reliability. The answer is partially yes. RAG improved Macro F1 for four of the five model families and produced the top-ranked Macro F1 system. However, one RAG configuration reduced Macro F1 relative to its standalone counterpart, and several RAG systems shifted recall balance. Retrieval augmentation should therefore be described as conditionally beneficial rather than generally superior.

RQ3 asked which claims produce consistent classification difficulty. The error-agreement analysis shows that some claims were misclassified by multiple systems, suggesting persistent difficulty across architectures. These hard cases often reflect compressed context, numerical interpretation, temporal dependence, or partial-truth structure. This supports the framework’s expectation that ambiguity conditions moderate classification reliability.

RQ4 asked when retrieval improves or reduces reliability. Retrieval appears most useful when it improves evidence access without causing the model to over-classify claims as misleading. It becomes less reliable when the added context shifts the model toward high misleading recall at the cost of not-misleading recall. The key mechanism is not retrieval presence alone, but whether retrieved evidence is relevant, sufficient, and appropriately used.

5.3 Theoretical Implications

The study contributes to evidence-conditioned verification theory by separating three related but distinct constructs: evidence availability, evidence grounding opportunity, and classification reliability. RAG increases evidence availability by adding retrieved context, but this does not guarantee evidence grounding or reliable classification. The results therefore challenge a simplistic retrieval equals grounding assumption.

For information science, the study supports an operationally careful account of automated verification as an evidence-use problem. Classification reliability is not the same as human source evaluation. This distinction strengthens the LIS contribution because it keeps the analysis tied to retrieval, evidence mediation, and information-system evaluation rather than unsupported claims about user trust.

5.4 Methodological Implications

The results reinforce the need for benchmark evaluation beyond accuracy. Accuracy alone would obscure the class-level tradeoffs observed in the RAG conditions. Macro F1, MCC, class-level recall, confusion matrices, paired comparisons, and error-agreement analysis together provide a more defensible account of system behavior.

The study also shows why RAG evaluation should include both improvement and harm pathways. A retrieval-augmented system may correct false negatives while introducing false positives. Publication-ready RAG evaluation should therefore report not only whether retrieval improves aggregate metrics, but also how it redistributes errors across classes and claim types. The proxy audit used here is a minimum step; stronger studies should retain retrieved passages, ranks, and relevance judgments so retrieval failures can be separated from evidence-integration failures.

5.5 Practical Implications

For designers of AI-assisted verification systems, the findings suggest that retrieval augmentation should be implemented with evidence auditing and uncertainty signaling. A system should not merely retrieve passages and produce a label; it should expose retrieval provenance, allow inspection of retrieved evidence, and communicate uncertainty when evidence is incomplete or ambiguous.

For libraries, educators, and information professionals, the study provides a cautious framework for evaluating AI-assisted verification tools. RAG systems may support evidence-aware workflows, but they should not be treated as substitutes for critical source evaluation, domain expertise, or human review. The most defensible use case is human-in-the-loop verification, where retrieved evidence and model predictions become inputs to review rather than final authority.

5.6 Limitations

The study has several limitations. First, it uses short English claims and does not evaluate multilingual, Filipino-language, multimodal, or platform-native misinformation. Second, the binary label mapping improves comparability but compresses the original truthfulness categories, especially partial-truth cases. Third, the bounded corpus improves reproducibility but does not represent the volatility and breadth of open-web retrieval. Fourth, the archived benchmark outputs do not include retrieved-passage logs, so the study reports a retrieval-effect proxy rather than direct hit-rate, relevance, or evidence-sufficiency measures. Fifth, the study evaluates classification outputs rather than explanation faithfulness, so it cannot determine whether a model used retrieved evidence in a fully faithful way.

The study is also limited by possible benchmark contamination, prompt sensitivity, and model update effects. Public fact-checking claims may have appeared in model training data, fixed prompts may favor some systems over others, and externally hosted models may change over time. These limitations do not invalidate the benchmark, but they constrain the scope of inference.

5.7 Future Research

Future research should evaluate retrieval quality directly by measuring evidence relevance, retrieval hit rate, and evidence sufficiency. It should also test prompt sensitivity by comparing deterministic, evidence-focused, and uncertainty-aware prompts. A stronger next-stage design would pair classification metrics with evidence-faithfulness assessment to determine whether RAG systems are merely improving labels or actually grounding their decisions.

Future studies should also extend the benchmark to Filipino, Philippine English, and multilingual claims, as well as to claims from non-political domains such as health, disaster risk, education, and public policy. Finally, user-centered studies are needed to examine how people interpret AI verification outputs, how retrieved evidence affects reliance on those outputs, and how human reviewers incorporate model predictions into source evaluation.

5.8 Final Conclusion

This study reconstructs AI-assisted claim verification as a bounded evidence-conditioned classification problem rather than a broad claim about automated source evaluation. The results show that retrieval augmentation can improve balanced reliability, but its effects are uneven and tradeoff-dependent. RAG is best understood as a potentially useful verification architecture whose reliability must be audited through balanced metrics, retrieval-effect analysis, and hard-case evaluation. Under bounded-corpus conditions, retrieval-supported verification is promising, but it remains an evaluative and human-accountable information process rather than an automated solution to misinformation.

References

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–623. https://doi.org/10.1145/3442188.3445922

Boughorbel, S., Jarray, F., & El-Anbari, M. (2017). Optimal classifier for imbalanced data using matthews correlation coefficient metric. PLOS ONE, 12(6), e0177678. https://doi.org/10.1371/journal.pone.0177678

Chen, J., Lin, H., Han, X., & Sun, L. (2024). Benchmarking large language models in retrieval-augmented generation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16), 17754–17762. https://doi.org/10.1609/aaai.v38i16.29728

Chicco, D., & Jurman, G. (2020). The advantages of the matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, 21(1). https://doi.org/10.1186/s12864-019-6413-7

Choi, W., & Stvilia, B. (2015). Web credibility assessment: Conceptualization, operationalization, variability, and models. Journal of the Association for Information Science and Technology, 66(12), 2399–2414. https://doi.org/10.1002/asi.23543

Dror, R., Baumer, G., Shlomov, S., & Reichart, R. (2018). The hitchhiker’s guide to testing statistical significance in natural language processing. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 1383–1392. https://doi.org/10.18653/v1/P18-1128

Floridi, L. (2010). Information: A very short introduction. Oxford University Press.

Gao, Y., Xiong, Y., Gao, X., et al. (2023). Retrieval-augmented generation for large language models: A survey. arXiv Preprint arXiv:2312.10997. https://doi.org/10.48550/arXiv.2312.10997

Guo, Z., Schlichtkrull, M., & Vlachos, A. (2022). A survey on automated fact-checking. Transactions of the Association for Computational Linguistics, 10, 178–206. https://doi.org/10.1162/tacl_a_00454

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

Hilligoss, B., & Rieh, S. Y. (2008). Developing a unifying framework of credibility assessment: Construct, heuristics, and interaction in context. Information Processing & Management, 44(4), 1467–1484. https://doi.org/10.1016/j.ipm.2007.10.001

Hjorland, B. (2012). Knowledge organization = information organization? Knowledge Organization, 39(3), 154–159.

Izacard, G., & Grave, E. (2021). Leveraging passage retrieval with generative models for open domain question answering. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 874–880. https://doi.org/10.18653/v1/2021.eacl-main.74

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y. J., Madotto, A., & Fung, P. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730

Jiang, Z., Xu, F. F., Gao, L., Sun, Z., Liu, Q., Dwivedi-Yu, J., Yang, Y., Callan, J., & Neubig, G. (2023). Active retrieval augmented generation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 7969–7992. https://doi.org/10.18653/v1/2023.emnlp-main.495

Kandel, N. (2020). Information disorder syndrome and its management. Journal of Nepal Medical Association, 58(224). https://doi.org/10.31729/jnma.4968

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., & Yih, W. (2020). Dense passage retrieval for open-domain question answering. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 6769–6781. https://doi.org/10.18653/v1/2020.emnlp-main.550

Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Springer.

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Kuttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. arXiv Preprint arXiv:2005.11401. https://doi.org/10.48550/arXiv.2005.11401

Metzger, M. J. (2007). Making sense of credibility on the web: Models for evaluating online information and recommendations for future research. Journal of the American Society for Information Science and Technology, 58(13), 2078–2091. https://doi.org/10.1002/asi.20672

Mökander, J., Schuett, J., Kirk, H. R., & Floridi, L. (2023). Auditing large language models: A three-layered approach. AI and Ethics, 4(4), 1085–1115. https://doi.org/10.1007/s43681-023-00289-2

Nakov, P., Corney, D., Hasanain, M., Alam, F., Elsayed, T., Barron-Cedeno, A., Papotti, P., Shaar, S., & Da San Martino, G. (2021). Automated fact-checking for assisting human fact-checkers. Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, 4551–4558. https://doi.org/10.24963/ijcai.2021/619

Nakov, P., Da San Martino, G., Elsayed, T., Barron-Cedeno, A., Miguez, R., Shaar, S., Alam, F., Haouari, F., Hasanain, M., Babulkov, N., et al. (2021). Overview of the CLEF–2021 CheckThat! Lab on detecting check-worthy claims, previously fact-checked claims, and fake news. In Experimental IR meets multilinguality, multimodality, and interaction (pp. 264–291). Springer International Publishing. https://doi.org/10.1007/978-3-030-85251-1_19

Ribeiro, M. T., Wu, T., Guestrin, C., & Singh, S. (2020). Beyond accuracy: Behavioral testing of NLP models with CheckList. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4902–4912. https://doi.org/10.18653/v1/2020.acl-main.442

Rieh, S. Y. (2002). Judgment of information quality and cognitive authority in the web. Journal of the American Society for Information Science and Technology, 53(2), 145–161. https://doi.org/10.1002/asi.10017

Savolainen, R. (2007). Media credibility and cognitive authority: The case of seeking orienting information. Information Research, 12(3).

Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing & Management, 45(4), 427–437. https://doi.org/10.1016/j.ipm.2009.03.002

Sundar, S. S. (2008). The MAIN model: A heuristic approach to understanding technology effects on credibility. In M. J. Metzger & A. J. Flanagin (Eds.), Digital media, youth, and credibility (pp. 73–100). MIT Press.

Thorne, J., Vlachos, A., Christodoulopoulos, C., & Mittal, A. (2018). FEVER: A large-scale dataset for fact extraction and VERification. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, 809–819. https://doi.org/10.18653/v1/N18-1074

Wang, W. Y. (2017). “Liar, Liar Pants on Fire”: A new benchmark dataset for fake news detection. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 422–426. https://doi.org/10.18653/v1/P17-2067

Wardle, C., & Derakhshan, H. (2017). Information disorder: Toward an interdisciplinary framework for research and policy making. Council of Europe. https://rm.coe.int/information-disorder-toward-an-interdisciplinary-framework-for-researc/168076277c

Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, A., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W., … Gabriel, I. (2021). Ethical and social risks of harm from language models. arXiv Preprint arXiv:2112.04359. https://doi.org/10.48550/arXiv.2112.04359

Yue, X., Wang, B., Chen, Z., Zhang, K., Su, Y., & Sun, H. (2023). Automatic evaluation of attribution by large language models. Findings of the Association for Computational Linguistics: EMNLP 2023, 4615–4635. https://doi.org/10.18653/v1/2023.findings-emnlp.307