Zero-Shot Large Language Models as High-Recall Triage Systems for Election Monitoring: Evidence from VoteReportPH During the 2025 Philippine Elections

Authors
Affiliations

Lanz Anthonee A. Lagman

UP Data Science Society, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

Computational Science Research Center, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

Jess Vincent Ybut

UP Data Science Society, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

National Institute of Physics, Address

Hershel Mikhail Fajardo

UP Data Science Society, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

AI Program, College of Engineering, University of the Philippines - Diliman, Quezon City

Juan Carlos Cruz

UP Data Science Society, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

Dan Anthony Dorado

UP Data Science Society, College of Science Administration Building, University of the Philippines - Diliman, Quezon City

UP School of Library and Information Studies

Ian Angelo Aragoza

Computer Professionals’ Union, Philippines

Andresito de Guzman

PWA Pilipinas, Philippines

Abstract

Election monitoring increasingly depends on the capacity to process high-volume citizen reports, social media posts, and platform-based submissions under conditions of uncertainty. This study evaluates the conditions under which a zero-shot large language model (LLM) pipeline can function as a high-recall triage system for election monitoring using VoteReportPH data from the 2025 Philippine elections. Drawing on signal detection theory, information overload theory, and human-AI complementarity, the study frames LLM classification as decision support rather than autonomous adjudication and formalizes triage performance as an error-cost problem. The analysis used a postprocessed Election Monitoring System dataset of 3,618 reports and a cleaned model-evaluation dataset of 4,158 reports. For binary validity detection, the model correctly surfaced 166 of 181 human-validated reports, yielding a recall of 0.9171, specificity of 0.8973, and accuracy of 0.8983, but low precision of 0.3198. Bootstrap intervals indicate that the high-recall result is substantively stable, while comparison with a keyword baseline shows that the zero-shot model’s advantage is concentrated in reducing false negatives rather than improving overall accuracy. In multiclass incident categorization, the model performed strongly on explicit categories such as automated counting machine errors and illegal campaigning, but weakly on rare, residual, and procedurally ambiguous categories. The findings show that zero-shot LLMs can support civic monitoring as triage infrastructure when missed violations are weighted more heavily than review burden, but they require human verification, uncertainty routing, transparent error handling, and category-specific workflow design.

Keywords

election monitoring, large language models, civic AI, zero-shot classification, human-AI collaboration, signal detection theory, information overload, computational social science, Philippines, VoteReportPH

Introduction

Election monitoring depends on the capacity to convert dispersed public observations into verifiable civic information. In the Philippine context, this work carries high democratic stakes because election-related irregularities do not appear only as formal complaints. They also circulate through posts, messages, photographs, videos, and citizen reports produced during the compressed time frame of election day. Volunteer monitoring groups must therefore process a stream of heterogeneous information while distinguishing actionable reports from political commentary, duplicate posts, vague claims, and non-violative public discourse.

This creates an information filtering problem. Information overload occurs when the volume or complexity of incoming information exceeds the processing capacity of decision-makers (Eppler & Mengis, 2004; Roetzel, 2019). In election monitoring, the overload problem has two consequences. First, valid reports may be missed if volunteers cannot process incoming reports quickly enough. Second, limited volunteer attention may be spent on non-actionable content rather than on reports that require verification, escalation, or public communication. The central methodological problem is not only whether an automated system can classify reports accurately, but whether it can help prioritize human attention under conditions of uncertainty.

Large language models (LLMs) offer one possible response to this filtering problem. Recent work on LLM-based text classification shows that these models can perform zero-shot and few-shot classification tasks using prompts rather than large task-specific training datasets (Chae & Davidson, 2023; Davidson & Chae, 2025). This feature matters for civic monitoring because volunteer organizations often lack the labeled data, computational infrastructure, and annotation capacity required for conventional supervised machine learning. Zero-shot classification can support rapid deployment, especially when incident categories already exist as an operational taxonomy. Yet this advantage also introduces risks. Model outputs may reflect prompt design, class imbalance, ambiguity in source text, and the uneven linguistic distribution of social media data.

This study examines those tensions through the VoteReportPH election monitoring pipeline for the 2025 Philippine elections. VoteReportPH receives election-related reports from public and partner channels, then transforms these reports into structured categories for monitoring and communication. The pipeline evaluated in this article uses a zero-shot LLM classification process to detect and categorize potential election-related incidents. The practical goal is not to replace volunteer judgment. It is to assess the conditions under which an LLM can function as a high-recall triage layer that preserves valid reports for human verification while keeping review burden within defensible limits.

The article frames this problem through three linked bodies of literature. First, information overload theory explains why volunteer monitoring systems face bottlenecks when the rate of incoming information exceeds available human processing capacity (Eppler & Mengis, 2004; Jackson & Farzaneh, 2012; Roetzel, 2019). Second, signal detection theory provides a language for evaluating the trade-off between missed violations and false alarms (Green & Swets, 1966). In this setting, false negatives represent potentially missed civic harms, while false positives represent additional verification work. Third, human-AI collaboration research stresses that AI systems should be evaluated not only by standalone accuracy but by how their outputs interact with human judgment, institutional workflows, and complementary sources of knowledge (Holstein et al., 2022; Vössing et al., 2022).

The contribution of this study is fourfold. First, it provides an empirical evaluation of a zero-shot LLM pipeline using election monitoring data from the Philippines. Second, it reframes automated election report classification as a triage problem rather than a replacement problem. Third, it formalizes triage performance as a decision problem in which missed violations and false positives carry different civic costs. Fourth, it examines performance under class imbalance, where rare but important violations may be difficult to evaluate statistically.

The guiding argument is that LLMs may hold value in civic monitoring when they operate as recall-oriented filters within a human-in-the-loop system. Their value depends less on perfect classification and more on whether their error profile fits the ethical and operational demands of election monitoring. A system that misses few valid reports may support situational awareness, but a system that generates many false positives may still burden volunteers. The research problem is therefore not simply whether zero-shot LLMs “work.” It is: under what conditions do zero-shot LLMs produce acceptable error trade-offs for human-in-the-loop triage systems in high-noise civic data environments?

The analysis is organized around five research questions. RQ1 asks how valid and non-valid reports are distributed in the monitoring corpus. RQ2 asks what recall-precision trade-off the zero-shot LLM produces for binary triage. RQ3 asks how that performance compares with a transparent keyword-based baseline. RQ4 asks what patterns explain variation in performance across incident categories. RQ5 asks how triage evaluation changes when missed violations and false positives are assigned different expected civic costs.

Theory and Literature Review

Election Monitoring and Civic Data Infrastructures

Election monitoring has historically relied on formal observation missions, institutional reporting, and structured documentation of electoral processes. Recent developments in digital communication have expanded this model by enabling citizen-based reporting through online platforms and social media. These systems allow individuals to contribute real-time observations, thereby increasing the geographic and temporal coverage of monitoring activities. Evidence from election monitoring initiatives shows that digital platforms can support citizen oversight and increase participation in reporting irregularities, though their effectiveness depends on participation rates, access to technology, and the ability to process incoming data at scale (Garbiras-Díaz & Montenegro, 2022; Tsandzana, 2019).

Crowdsourced reporting systems, such as crisis mapping platforms, demonstrate how citizen-generated data can be aggregated, structured, and visualized to support decision-making during high-stakes events. These systems rely on combining heterogeneous inputs into actionable information streams, often under conditions of uncertainty and time pressure (Norheim-Hagtun & Meier, 2010). However, the shift toward digital reporting introduces a core tension: while the volume of available information increases, the capacity to interpret and verify that information does not scale at the same rate. This creates a structural bottleneck in civic data infrastructures, where the challenge lies not only in data collection but in effective filtering and prioritization.

Large Language Models for Text Classification

Large language models (LLMs) represent a recent development in natural language processing that has transformed approaches to text classification. Unlike traditional supervised machine learning systems, which require labeled training data, LLMs can perform classification tasks using natural language prompts through zero-shot or few-shot learning. This allows models to generalize across domains and perform classification without task-specific training (Chae & Davidson, 2023; Davidson & Chae, 2025).

Empirical studies show that LLMs can achieve competitive performance in a wide range of classification tasks, including sentiment analysis, stance detection, and content categorization. Their ability to process unstructured text and adapt to new classification schemas makes them attractive for applications where labeled datasets are limited or evolving. In social media contexts, zero-shot classification has been used to extract meaningful information from posts without extensive annotation pipelines, highlighting its relevance for real-time monitoring tasks.

Despite these advantages, LLM-based classification introduces methodological challenges. Performance depends on prompt design, category definitions, and the clarity of textual input. Zero-shot models may struggle with ambiguous language, overlapping categories, and rare classes, particularly in domains where semantic distinctions are subtle or context-dependent. These limitations are critical in civic monitoring, where classification errors can have operational and ethical consequences.

Human-AI Decision Systems

The integration of AI systems into decision-making processes has shifted research attention from standalone model performance to human-AI collaboration. In such systems, AI outputs function as inputs to human judgment rather than as final decisions. The effectiveness of these systems depends on how well human and machine capabilities complement each other.

Research on human-AI collaboration highlights that AI systems and human decision-makers often have access to different types of information and exhibit different strengths. AI systems can process large volumes of structured and unstructured data, while humans can interpret context, ambiguity, and domain-specific nuances (Holstein et al., 2022). Effective collaboration requires aligning these complementary capabilities rather than assuming that AI outputs can replace human reasoning.

Designing for human-AI complementarity involves addressing issues of transparency, trust, and interaction. Decision support systems must present outputs in a way that allows humans to interpret and act on them without overreliance or blind acceptance (Vössing et al., 2022). In high-stakes domains, including civic monitoring, this implies that AI systems should support human workflows by prioritizing information, highlighting uncertainty, and enabling verification rather than automating final judgments.

Information Overload and Filtering

Information overload theory provides a framework for understanding the limitations of human information processing in environments characterized by high data volume and complexity. Information overload occurs when the amount of information exceeds an individual’s capacity to process it effectively, leading to reduced decision quality and increased cognitive strain (Eppler & Mengis, 2004; Jackson & Farzaneh, 2012).

In digital environments, the problem is intensified by continuous data streams, heterogeneous formats, and varying levels of information quality. Social media platforms exemplify these conditions, where relevant signals are embedded within large amounts of noise. Studies show that when information load surpasses processing capacity, individuals may rely on heuristics, ignore relevant data, or fail to identify critical signals (Roetzel, 2019).

In election monitoring, information overload manifests as the difficulty of identifying valid reports among large volumes of non-violative content. This creates a need for filtering mechanisms that can reduce the volume of information presented to human reviewers without discarding relevant signals. The effectiveness of such mechanisms depends on their ability to balance coverage and selectivity.

Signal Detection Theory and Error Trade-Offs

Signal detection theory (SDT) provides a formal framework for analyzing decision-making under uncertainty by distinguishing between signals (relevant information) and noise (irrelevant information). In classification tasks, SDT characterizes performance through four outcomes: true positives, false positives, true negatives, and false negatives (Green & Swets, 1966; Lynn & Barrett, 2014).

This framework is particularly useful for evaluating automated classification systems in civic monitoring contexts. False negatives represent missed valid reports, which may correspond to unreported electoral violations. False positives represent non-violative content incorrectly flagged as relevant, which increases the workload for human reviewers. The relative cost of these errors depends on the operational context. In election monitoring, missing a valid violation may carry greater consequences than reviewing additional non-violative reports.

SDT emphasizes that classification systems operate under trade-offs between sensitivity (recall) and specificity (precision). Adjusting decision thresholds can increase detection rates at the cost of higher false positives, or reduce false positives at the cost of missed detections. The choice of operating point reflects the priorities of the system. In high-stakes monitoring environments, systems may be designed to prioritize recall to ensure that critical events are not missed, even if this increases verification workload.

This study translates that trade-off into a simple expected-cost framework. Let (FN) denote valid incidents missed by the triage system and (FP) denote non-valid reports forwarded for review. Triage cost can be represented as:

\[C = \alpha \cdot FN + \beta \cdot FP\]

where \(\alpha\) is the civic cost assigned to a missed violation and \(\beta\) is the review cost assigned to a false positive. In election monitoring, \(\alpha\) is plausibly greater than \(\beta\), because missing a valid incident may carry democratic or rights-related consequences while a false positive primarily increases verification workload. This does not mean false positives are harmless. It means the preferred operating point depends on the relative weighting of civic harm and reviewer burden. A model is therefore triage-effective only if its error profile lowers expected cost under a defensible range of \(\alpha / \beta\) values.

Conceptual Framework

This study integrates the preceding theoretical perspectives into a conceptual model of election monitoring as a human-AI triage system. Social media and citizen reports generate a continuous stream of unstructured information. A zero-shot LLM classification layer processes this stream to identify potential election-related incidents. The output of this layer is then reviewed by human volunteers, who verify, contextualize, and finalize classifications.

The effectiveness of this system depends on the interaction between three components:

  1. Information filtering: the ability of the LLM to reduce the volume of data while retaining relevant reports.
  2. Error trade-offs: the balance between false positives and false negatives as defined by signal detection theory.
  3. Human-AI complementarity: the alignment between automated classification and human verification processes.
Figure 1: Conceptual framework for evaluating zero-shot LLMs as high-recall triage systems in election monitoring.

Figure 1 presents the conceptual framework guiding this study. The framework positions election monitoring as a human-in-the-loop information filtering process rather than a fully automated classification task. It begins with social media and citizen reports, which enter the monitoring system as high-volume, heterogeneous, and noisy inputs. These reports may contain genuine election incidents, vague allegations, partisan commentary, duplicated posts, or incomplete descriptions. This first stage reflects the problem of information overload, where the quantity and complexity of incoming information can exceed the processing capacity of human reviewers (Eppler & Mengis, 2004; Roetzel, 2019).

The second stage introduces the zero-shot LLM as a triage layer. In this framework, the model’s role is to identify potentially relevant reports and assign preliminary incident categories. The model does not serve as the final arbiter of election violations. Instead, it functions as a filtering mechanism that redistributes attention toward reports that may require review. This design follows the logic of signal detection theory: the system must distinguish signal from noise while managing the consequences of false positives and false negatives (Green & Swets, 1966; Lynn & Barrett, 2014). In election monitoring, false negatives are especially costly because they represent valid incidents that the system fails to surface. False positives also matter because they increase the verification burden placed on volunteers.

The third stage centers on human volunteer verification. This stage reflects the principle of human-AI complementarity. LLMs can process large quantities of text at speed, but human reviewers retain responsibility for contextual judgment, local interpretation, and final classification. Human reviewers can assess ambiguity, procedural nuance, place-specific references, and the political meaning of reports in ways that automated systems may not reliably capture (Holstein et al., 2022; Vössing et al., 2022).

The final stage produces structured civic intelligence. Reports that pass through human verification can support dashboards, public communication, escalation, and post-election analysis. The framework therefore treats AI classification as part of a broader civic data infrastructure. Its value depends not only on accuracy but on whether its error profile fits the ethical and operational demands of election monitoring. A high-recall system may help preserve situational awareness, but only if its false positives remain manageable and its uncertain cases remain visible to human reviewers.

Figure 1 therefore frames the study’s empirical analysis. The evaluation asks whether the zero-shot LLM pipeline can operate as a recall-oriented triage mechanism under conditions of information overload, class imbalance, and human verification constraints.

Methodology

Research Design

This study uses an observational comparative evaluation design to assess whether a zero-shot large language model (LLM) pipeline can operate as a high-recall triage layer for election monitoring. The study does not estimate the causal effect of AI use on electoral integrity, volunteer performance, or institutional response. Rather, it evaluates the correspondence between AI-generated classifications and human validation outputs within the VoteReportPH 2025 election monitoring workflow.

The unit of analysis is the individual election-related report. Each report may contain social media text, webform content, SMS-submitted information, partner-provided records, or related metadata. The primary empirical task is to evaluate whether the AI pipeline can distinguish potentially actionable election-related reports from non-actionable content and, where applicable, assign incident categories consistent with volunteer classifications.

The design treats human validation as the operational benchmark. This benchmark should not be interpreted as an error-free ground truth. Volunteer decisions were made within an applied monitoring workflow, and they may reflect time pressure, incomplete information, inconsistent category boundaries, or validator disagreement. The study therefore evaluates model-human alignment rather than absolute truth.

This design follows the conceptual framework introduced in Figure 1. The LLM is evaluated not as an autonomous adjudicator of election violations but as a triage mechanism operating within a human-in-the-loop civic monitoring system. The Results section operationalizes this design in five ways: corpus characterization in Table 1, binary validity evaluation in Table 4 and Figure 3, comparison with a transparent keyword baseline in Table 6, uncertainty-aware routing in Table 8, and multiclass category evaluation in Table 13 and Figure 6.

Data Sources and Data Lineage

The analysis uses two related analytical corpora drawn from the VoteReportPH monitoring workflow. The first corpus represents the broader stream of reports processed by the Election Monitoring System after automated postprocessing. It is used to evaluate whether the model can separate reports that require human attention from reports that are non-actionable or insufficiently supported. This corpus supports the binary validity analysis, descriptive label distributions, and workload-reduction estimate.

The second corpus represents the subset of reports for which model-generated incident labels could be compared with volunteer-assigned incident labels. It is used to evaluate model-human alignment in multiclass incident categorization. This corpus supports the volunteer-confirmed incident distribution, overall multiclass metrics, per-class performance estimates, and error analysis.

The two corpora should not be treated as duplicates. They answer different analytical questions and reflect different stages of the monitoring workflow. The broader postprocessed corpus is appropriate for triage and validity analysis because it preserves operational validity decisions. The comparison corpus is appropriate for incident-category evaluation because it contains paired AI and volunteer labels. The difference in sample size reflects workflow sequence and variable availability rather than a contradiction in the data.

Variables

The analysis uses four main variable groups, described here conceptually because the underlying monitoring data are not publicly distributed.

First, report text refers to the citizen report, social media post, webform entry, partner record, or related textual description used as the basis for classification. Report text may originate from different intake channels and may vary in length, completeness, language, and evidentiary detail. The current analysis focuses on classification outputs rather than text-level predictors.

Second, human validity status refers to whether VoteReportPH validators treated a report as valid within the applied monitoring workflow. This measure serves as the operational benchmark for binary validity detection. Its distribution is reported in Table 2, which shows that valid reports make up a small minority of the broader analysis corpus.

Third, AI-generated incident classification refers to the incident label assigned by the zero-shot LLM pipeline. For binary validity detection, this label is recoded into an AI validity prediction. Reports labeled as non-violative or insufficiently supported are treated as AI-predicted invalid or non-actionable. Reports assigned any substantive incident category are treated as AI-predicted valid. The distribution of original AI labels is reported in Table 3 and Figure 2.

Fourth, volunteer-assigned incident classification refers to the category assigned by human reviewers during the VoteReportPH validation process. Volunteer labels are used as the comparison benchmark for multiclass classification. Because some reports may involve multiple categories, the first listed volunteer category is treated as the primary category for single-label evaluation. This decision makes volunteer labels comparable with the AI pipeline, which returns one principal incident category. The resulting volunteer-confirmed incident distribution is shown in Table 11.

Data Preparation

The data preparation process followed four steps.

First, both analytical corpora were loaded and inspected for sample size, completeness, and suitability for the relevant evaluation task. This step established the role of each corpus, as reflected in Table 1.

Second, AI-generated labels were standardized by trimming whitespace and converting missing or empty labels into explicit missing values. This was necessary because label distributions form the basis of Table 3 and Figure 2.

Third, the binary AI validity prediction was derived from the model’s incident label. The recoding follows the logic of the monitoring workflow: labels indicating no violation or insufficient information are treated as non-actionable predictions, while all substantive incident categories are treated as actionable predictions. This recoding produces the confusion matrix reported in Table 4 and Figure 3.

Fourth, volunteer labels were parsed for multiclass evaluation. Since some reports may receive more than one volunteer category, the first listed category was extracted as the primary label for single-label comparison. Empty labels were treated as non-confirmed incidents. The multiclass analysis then focused on reports with volunteer-confirmed incident labels. This preparation step supports Table 11, Table 12, Table 13, and Figure 6.

Binary Validity Evaluation

Binary validity detection evaluates whether the AI pipeline can identify reports that human validators marked as valid. The analysis uses four confusion-matrix outcomes:

  • true positives: valid reports predicted valid by AI;

  • false negatives: valid reports predicted invalid by AI;

  • false positives: invalid reports predicted valid by AI;

  • true negatives: invalid reports predicted invalid by AI.

The confusion matrix is reported in Table 4 and visualized in Figure 3. The following metrics are computed from this matrix and reported in Table 5 and Figure 4: accuracy, precision, recall or sensitivity, specificity, false positive rate, false negative rate, and F1 score.

The binary evaluation is central to the article because the pipeline is framed as a high-recall triage system. Recall indicates the proportion of human-valid reports surfaced by the AI. Precision indicates the proportion of AI-flagged reports later validated by humans. In election monitoring, this distinction matters because false negatives may correspond to missed civic incidents, while false positives increase review burden.

Baseline Comparison and Uncertainty Routing

To avoid evaluating the LLM in isolation, the binary triage results are compared with a transparent keyword-based baseline. The baseline flags reports for review when report text contains election-monitoring terms associated with vote buying, illegal campaigning, machine problems, ballot issues, harassment, intimidation, disenfranchisement, or related procedural concerns. This baseline is intentionally simple: it approximates a low-cost rule-based triage system that a civic organization could deploy without an LLM. It is not presented as an optimized classifier, but as a counterfactual for assessing whether the zero-shot model adds value beyond lexical matching.

The analysis also distinguishes uncertainty as a routing problem. The main binary evaluation treats non-violative and insufficient-information labels as non-actionable predictions because both would ordinarily remove a report from priority review. However, insufficient information is conceptually different from invalidity. For this reason, the Results section reports a separate routing table that distinguishes reports predicted valid, predicted invalid, and uncertain. This triage view makes visible the cases that should be routed for follow-up rather than silently collapsed into invalidity.

Statistical Inference

Metric uncertainty is assessed using nonparametric bootstrap resampling of reports. The bootstrap procedure resamples reports with replacement and recomputes recall, precision, and F1 for the binary triage task. The resulting percentile intervals quantify sampling variability around the reported metrics. Model-baseline paired accuracy is compared using McNemar’s test because the zero-shot model and keyword baseline are evaluated on the same reports. These inferential checks are descriptive rather than causal: they assess the stability of observed performance differences within the available corpus.

Workload Reduction Estimate

The workload reduction estimate uses the AI binary validity recoding. Reports predicted valid are treated as the priority-review pool. Reports predicted invalid are treated as deprioritized by the AI triage layer. Because the study does not observe actual review time, reviewer staffing, or queue dynamics, this measure is interpreted strictly as theoretical filtering capacity, not as measured labor savings. The estimate is computed as:

\[\text{Estimated workload reduction} = 1 - \frac{\text{AI-predicted valid reports}}{\text{Total reports}}\]

This estimate is reported in Table 10 and Figure 5.

This measure should be interpreted as a triage estimate rather than direct evidence of labor savings. It indicates how many reports would be forwarded or deprioritized under the adopted AI-validity rule. It does not measure actual review time, volunteer fatigue, or decision quality.

Multiclass Incident Categorization

The multiclass evaluation compares the model-generated incident label against the primary volunteer-assigned incident label. This analysis is conducted on the subset of reports with volunteer-confirmed incident labels. The category distribution is reported in Table 11, which also shows the class imbalance that constrains interpretation.

The multiclass evaluation reports three overall metrics: accuracy, macro F1, and weighted F1. These are shown in Table 12. Accuracy measures the proportion of exactly matched AI and volunteer labels. Macro F1 gives equal weight to each category and is sensitive to poor performance in rare classes. Weighted F1 weights categories by support and therefore reflects performance across the observed distribution.

Per-class precision, recall, F1, and support are reported in Table 13. Figure 6 visualizes F1 scores across categories.

The multiclass evaluation is interpreted cautiously because several categories have small support. Rare categories may be important in civic terms, but small sample sizes limit the stability of precision, recall, and F1 estimates. For this reason, the Results section reports both class support and per-class performance rather than only aggregate metrics.

Error Analysis

The error analysis identifies where AI labels diverge from volunteer labels. Mismatches are interpreted as evidence of classification difficulty, not simply model failure. Some mismatches may result from ambiguous source text, overlapping category definitions, rare categories, or differences between single-label AI output and multi-label volunteer annotation.

The Results section interprets these errors using the per-class pattern in Table 13 and Figure 6. High performance in categories such as ACM Errors and Illegal Campaigning suggests that the model performs better when reports contain explicit lexical or procedural markers. Lower performance in categories such as BEI/EB Non-Compliance with Election Procedures, Disenfranchisement, and Others suggests that procedural ambiguity and residual category design weaken zero-shot classification.

Bias, Validity, and Interpretive Limits

Five limitations guide the interpretation of the results.

First, human validation is used as the benchmark, but it may contain uncertainty. The labels reflect applied monitoring decisions rather than controlled laboratory annotation.

Second, both the binary and multiclass analyses operate under class imbalance. The validity distribution in Table 2 shows that valid reports are rare. The volunteer-confirmed incident distribution in Table 11 shows that some violation categories have very small support. These imbalances explain why accuracy must be interpreted alongside recall, macro F1, per-class F1, and support.

Third, source composition may influence performance. Reports submitted through social media, SMS, partner files, or webforms may differ in length, completeness, language, and evidentiary quality. The current analysis evaluates aggregate performance, while future work should examine source-specific performance when source-level labels can be harmonized.

Fourth, the AI pipeline produces single-label outputs, while human validators may assign multiple labels. The use of the first volunteer label as the primary label is analytically necessary for single-label evaluation, but it may simplify reports that involve multiple violation types.

Fifth, the workload reduction estimate in Table 10 and Figure 5 is not a direct labor measurement. It estimates the proportion of reports forwarded or deprioritized under the AI triage rule. Future research should measure review time, reviewer burden, and decision outcomes directly.

Reproducibility

The analysis is written in Quarto with executable R code. The workflow loads the analytical corpora, cleans labels, derives binary validity predictions, computes evaluation metrics, and produces tables and figures. This structure allows future researchers with appropriate data access to revise recoding assumptions, test alternative category harmonization rules, or compare the zero-shot LLM pipeline against few-shot prompts and supervised baselines.

Results

Dataset Characteristics

The analysis used two related analytical corpora. The broader postprocessed monitoring corpus contained 3,618 reports and was used for binary validity evaluation because it preserved both human validity decisions and AI-generated incident labels. The model-volunteer comparison corpus contained 4,158 reports and was used to assess agreement between AI-generated incident labels and volunteer-assigned incident labels.

Table 1: Summary of datasets used in the analysis.
Dataset Summary
Analytical_Corpus Reports Primary_Use
Postprocessed monitoring corpus 3618 Binary validity detection; AI label distribution
Model-volunteer comparison corpus 4158 AI-volunteer incident classification comparison

Table 1 clarifies the analytical role of each corpus. This distinction is important because the two corpora represent different stages of the VoteReportPH data workflow. The postprocessed monitoring corpus supports validity detection because it preserves human validation decisions. The comparison corpus supports label comparison because it pairs AI incident labels with volunteer incident labels.

The broader monitoring corpus showed a highly imbalanced validity distribution. Of 3,618 reports, 181 were marked valid and 3,437 were marked invalid. Valid reports therefore represented only 5.00% of the corpus. This imbalance reflects the high-noise environment in which election monitoring systems operate: most captured reports do not become verified election-related incidents.

Table 2: Human validity distribution in the broader monitoring corpus.
Human Validity Distribution
Human_Validity n Percent
Valid 181 5.00%
Invalid 3437 95.00%

Table 2 establishes the first interpretive constraint of the results. Because valid incidents are rare, overall accuracy alone can mislead. A model could appear accurate by classifying most reports as invalid, yet still fail as an election monitoring tool if it misses valid reports. For this reason, the following analysis emphasizes recall, false negative rate, specificity, and false positive rate rather than relying only on accuracy.

AI-Generated Incident Label Distribution

The AI-generated labels were dominated by non-violative classifications. The model classified 2,894 reports as No violation and 205 reports as Information not enough. The remaining classifications were distributed across election-related categories such as ACM Errors, Illegal Campaigning, Vote Buying, BEI/EB Non-Compliance with Election Procedures, and Disenfranchisement.

Table 3: Distribution of AI-generated incident labels in the broader monitoring corpus.
AI-Generated Incident Label Distribution
AI_Label n Percent
No violation 2894 79.99%
Information not enough 205 5.67%
ACM Errors 195 5.39%
Illegal Campaigning 106 2.93%
Black Propaganda 44 1.22%
Vote Buying 43 1.19%
BEI/EB Non-Compliance with Election Procedures 40 1.11%
Disenfranchisement 31 0.86%
Others 20 0.55%
Red-tagging 13 0.36%
Harassment of Voters/Poll Watchers/EB 12 0.33%
Election Violence 10 0.28%
Tampered Ballots 5 0.14%
Figure 2: Distribution of AI-generated incident labels in the broader monitoring corpus.

Table 3 and Figure 2 show the model’s triage behavior. The model assigned most reports to non-incident categories, while still surfacing a smaller set of reports as possible election violations. This pattern aligns with the intended role of the pipeline as a filtering layer: it reduces a large stream of noisy reports into a smaller pool requiring closer review.

The category distribution also reveals an analytical challenge. Some AI labels appear frequently, such as ACM Errors and Illegal Campaigning, while other categories appear rarely, such as Tampered Ballots, Election Violence, and Harassment of Voters/Poll Watchers/EB. This class imbalance shapes the reliability of per-class estimates.

Binary Validity Detection

For the binary validity analysis, AI labels were recoded into a validity prediction. Reports classified as No violation or Information not enough were treated as AI-predicted invalid or non-actionable. Reports assigned to any substantive incident category were treated as AI-predicted valid. Human validation status served as the benchmark.

The resulting confusion matrix is shown in Table 4 and Figure 3.

Table 4: Confusion matrix for binary validity detection.
Binary Validity Detection Confusion Matrix
Human_Status AI_Predicted_Valid AI_Predicted_Invalid
Valid 166 15
Invalid 353 3084
Figure 3: Binary validity detection confusion matrix comparing human validity status and AI validity prediction.

The model correctly identified 166 of 181 valid reports. It missed 15 valid reports, producing a false negative rate of 8.29%. It also correctly rejected 3,084 of 3,437 invalid reports. The main trade-off appears in the 353 false positives: reports that the AI flagged as potential incidents but human validation marked as invalid.

This pattern supports the interpretation of the system as a high-recall triage mechanism. The model captured most valid incidents, but it did so by forwarding a larger set of non-valid reports for potential review.

Table 5: Binary validity detection performance metrics.
Binary Validity Detection Metrics
Metric Value
Accuracy 0.8983
Precision 0.3198
Recall / Sensitivity 0.9171
Specificity 0.8973
False Positive Rate 0.1027
False Negative Rate 0.0829
F1 Score 0.4743
Figure 4: Performance metrics for binary validity detection.

Table 5 and Figure 4 clarify the model’s operating profile. Accuracy was high at 0.8983, but this metric must be interpreted against the severe class imbalance shown in Table 2. The more informative result is the contrast between recall and precision. Recall reached 0.9171, indicating that the model surfaced more than nine out of ten valid reports. Precision was 0.3198, meaning that many AI-flagged reports were not validated by humans.

From a signal detection perspective, this is a low-threshold classification profile. The system favors sensitivity over selectivity. In an election monitoring context, such a design can be defensible if the cost of missing valid incidents exceeds the cost of reviewing additional false positives. It also means that the system should not be treated as an autonomous classifier. Its output still requires human verification.

Baseline Comparison and Metric Uncertainty

To assess whether the zero-shot model improves on a simple rule-based alternative, Table 6 compares the LLM triage rule with a keyword baseline. The keyword baseline achieved nearly identical accuracy because invalid reports dominate the corpus, but its recall was lower. The zero-shot model missed 15 valid reports, while the keyword baseline missed 39. This difference is substantively important in a triage setting because false negatives represent missed reports that would not reach priority review.

Table 6: Binary triage performance of the zero-shot LLM and keyword baseline.
Zero-Shot LLM Compared with Keyword Baseline
Model TP FN FP TN Accuracy Precision Recall Specificity F1
Zero-shot LLM 166 15 353 3084 0.8983 0.3198 0.9171 0.8973 0.4743
Keyword baseline 142 39 332 3105 0.8975 0.2996 0.7845 0.9034 0.4336

Bootstrap intervals in Table 7 indicate that the zero-shot model’s high-recall profile is stable within the observed corpus. The interval for recall remains high, while precision remains much lower, reinforcing the interpretation that the model is useful as a triage filter but not as a final validity classifier. A McNemar test comparing paired correctness between the zero-shot model and the keyword baseline did not indicate a statistically significant difference in overall accuracy (p = 0.9253), which is expected because accuracy is dominated by the large number of invalid reports. The more relevant comparison is therefore the decision-theoretic reduction in false negatives.

Table 7: Bootstrap confidence intervals for zero-shot binary triage metrics.
Bootstrap Intervals for Binary Triage Metrics
Metric Estimate CI_Lower CI_Upper
Recall 0.9171 0.8769 0.9548
Precision 0.3198 0.2819 0.3603
F1 Score 0.4743 0.4292 0.5174

Uncertainty-Aware Triage Routing

The binary analysis collapses insufficient-information outputs into the non-actionable class to evaluate a simple forwarding rule. Table 8 shows why this should be treated as an operational simplification rather than a final workflow recommendation. The model assigned 205 reports to an uncertainty-like insufficient-information state. Most were not human-validated as valid, but seven were valid reports. These seven cases illustrate the civic risk of treating uncertainty as simple invalidity.

Table 8: Uncertainty-aware routing view of zero-shot model outputs.
Uncertainty-Aware Triage Routing
Model_Routing_State Human_Valid Human_Invalid Total
Predicted valid 166 353 519
Predicted invalid 8 3239 3247
Uncertain / information insufficient 7 198 205

Decision-Theoretic Evaluation of Triage Performance

The decision-theoretic question is whether reducing missed valid reports justifies additional review burden. Table 9 applies the expected-cost function \(C = \alpha \cdot FN + \beta \cdot FP\) under three illustrative civic-harm weights, holding \(\beta\) = 1. When missed violations and false positives are weighted equally, the zero-shot model and keyword baseline have nearly identical expected costs. As the civic cost of a missed valid report increases, the zero-shot model becomes preferable because it substantially reduces false negatives. This formalizes the paper’s central claim: the model is not better because it is universally more accurate; it is better under operating conditions where missed violations are more costly than additional review.

Table 9: Expected triage cost under alternative missed-violation weights.
Decision-Theoretic Triage Cost
Model FN FP Cost_alpha_1 Cost_alpha_5 Cost_alpha_10
Zero-shot LLM 15 353 368 428 503
Keyword baseline 39 332 371 527 722

Estimated Workload Reduction

A practical question is whether the AI triage layer can reduce the number of reports requiring immediate review. Using the binary recoding above, the model flagged 519 reports as potentially valid and treated 3,099 reports as non-actionable. This implies that 14.34% of the 3,618 reports would be forwarded for priority review, while 85.66% would be deprioritized under the rule. This is a theoretical filtering estimate, not an observed labor outcome.

Table 10: Estimated triage workload reduction based on AI validity prediction.
Estimated Workload Reduction
Category n Percent
Forwarded for priority review 519 14.34%
Deprioritized by AI triage 3099 85.66%
Figure 5: Estimated share of reports forwarded for priority review and deprioritized by AI triage.

Table 10 and Figure 5 should be interpreted as theoretical filtering capacity rather than direct evidence of labor savings. The data show how many reports the model would forward or deprioritize under the chosen recoding rule. They do not measure actual review time, volunteer fatigue, queue dynamics, or decision quality. The estimate is useful because it shows the possible scale of filtering: the model reduced the priority-review pool from 3,618 reports to 519 reports while retaining 166 of 181 valid reports. Whether this translates into actual workload reduction depends on implementation details that are not observed in the present study.

Multi-Class Incident Categorization

The model-volunteer comparison corpus was used to compare AI-generated incident labels with volunteer-assigned labels. Volunteer labels were parsed from multi-category entries, and the first listed volunteer category was treated as the primary category for single-label evaluation. Empty volunteer labels were treated as No violation for the purpose of distinguishing volunteer-confirmed incidents from non-confirmed reports.

After label harmonization, 188 reports contained volunteer-confirmed incident labels. These reports formed the multiclass evaluation subset. The distribution remained highly imbalanced. ACM Errors accounted for 116 cases, followed by Illegal Campaigning with 29 cases. Several categories had fewer than five observations.

Table 11: Volunteer-confirmed incident distribution in the multiclass evaluation subset.
Volunteer-Confirmed Incident Distribution
Volunteer_Label Support Percent
ACM Errors 116 61.70%
Illegal Campaigning 29 15.43%
Others 12 6.38%
BEI/EB Non-Compliance with Election Procedures 11 5.85%
Disenfranchisement 7 3.72%
Vote Buying 5 2.66%
Red-tagging 3 1.60%
Election Violence 2 1.06%
Harassment of Voters/Poll Watchers/EB 2 1.06%
Tampered Ballots 1 0.53%

Table 11 shows why multiclass results require caution. The sample supports meaningful interpretation for common categories such as ACM Errors and Illegal Campaigning, but it does not support strong claims about rare categories such as Tampered Ballots, Election Violence, or Harassment of Voters/Poll Watchers/EB.

Overall multiclass accuracy was 0.75. The weighted average F1 score was 0.78, while the macro average F1 score was 0.49. The gap between these two values indicates that the model performed better on frequent categories than on rare ones.

Table 12: Overall multiclass classification performance.
Overall Multiclass Classification Performance
Metric Value
Accuracy 0.75
Macro F1 0.49
Weighted F1 0.78

Table 12 shows that the model’s apparent performance depends on the averaging strategy. Weighted F1 gives greater influence to frequent categories, especially ACM Errors. Macro F1 treats every class equally and therefore exposes weak performance for rare categories. For election monitoring, the macro result matters because rare categories may still carry high civic significance.

Per-Class Performance

Per-class metrics show a polarized performance pattern. The model performed well on explicit and relatively frequent categories, especially ACM Errors and Illegal Campaigning. It performed poorly on ambiguous, rare, or catch-all categories.

Table 13: Per-class multiclass classification performance.
Per-Class Classification Metrics
Incident_Category Precision Recall F1 Support
ACM Errors 0.93 0.85 0.89 116
BEI/EB Non-Compliance with Election Procedures 0.31 0.36 0.33 11
Disenfranchisement 0.33 0.43 0.38 7
Election Violence 1.00 1.00 1.00 2
Harassment of Voters/Poll Watchers/EB 0.00 0.00 0.00 2
Illegal Campaigning 1.00 0.90 0.95 29
No violation 0.00 0.00 0.00 0
Others 0.00 0.00 0.00 12
Red-tagging 1.00 1.00 1.00 3
Tampered Ballots 0.00 0.00 0.00 1
Vote Buying 1.00 0.80 0.89 5
Figure 6: Per-class F1 scores for multiclass incident categorization.

Table 13 and Figure 6 show that category-level performance depends on both semantic clarity and class frequency. Illegal Campaigning achieved an F1 score of 0.95, while ACM Errors achieved an F1 score of 0.89. These categories often contain recognizable lexical cues, such as references to campaign materials, sample ballots, malfunctioning machines, ballot rejection, or printing problems.

In contrast, BEI/EB Non-Compliance with Election Procedures and Disenfranchisement produced much lower F1 scores of 0.33 and 0.38, respectively. These categories are procedurally complex and may overlap in ordinary reporting language. A voter who reports being unable to vote may be describing administrative non-compliance, technical failure, or disenfranchisement. This ambiguity limits the reliability of zero-shot classification without further contextual evidence.

The model also failed to classify Others, Harassment of Voters/Poll Watchers/EB, and Tampered Ballots reliably. These results should not be overinterpreted because some categories had very small support. Yet the zero score for Others also reflects a conceptual limitation: catch-all categories are difficult for models because they lack stable semantic boundaries.

Error Structure and Interpretation

The multiclass results suggest three types of classification error.

First, the model performs well when reports contain explicit markers. Categories such as ACM Errors and Illegal Campaigning often contain concrete signals. These include machine problems, ballot rejection, receipt issues, campaign materials, or sample ballot distribution.

Second, the model struggles with procedural ambiguity. BEI/EB Non-Compliance with Election Procedures and Disenfranchisement require the model to infer whether a reported voting disruption reflects official procedural failure, voter exclusion, or another technical problem. These distinctions may be clear to trained volunteers but less clear from short social media text alone.

Third, the model struggles with rare and residual categories. Rare categories provide too few examples for stable evaluation, while residual categories such as Others contain heterogeneous incidents that do not share a clear semantic profile.

Taken together, the results support a bounded interpretation of the zero-shot LLM pipeline. The model is useful as a high-recall filtering system for surfacing potential reports, but it remains weaker as a final multiclass adjudicator. Its best role is to prioritize reports for human verification, especially when valid incidents are rare and information volume is high.

Discussion

LLM Classification as Civic Triage Rather Than Automated Judgment

The results support the argument that the zero-shot LLM pipeline is best understood as a civic triage mechanism rather than an autonomous election violation classifier. The distinction matters. A classifier implies final judgment. A triage system reorganizes attention under conditions of scarcity. In the VoteReportPH workflow, this means that the model’s main contribution lies in narrowing a high-volume stream of reports into a smaller pool of likely actionable cases for human review.

The validity distribution in Table 2 shows why triage is necessary. Only 181 of 3,618 postprocessed reports were human-validated as valid incidents. This means that valid reports made up only 5.00% of the dataset. In such a setting, human reviewers face a difficult information filtering problem: the relevant signals are rare, while non-actionable reports dominate the stream. This empirical pattern is consistent with information overload theory, which argues that decision quality can decline when information volume exceeds human processing capacity (Eppler & Mengis, 2004; Roetzel, 2019).

The AI-generated label distribution in Table 3 and Figure 2 shows that the model reproduced this filtering logic at scale. Most reports were assigned to No violation or Information not enough, while a smaller subset was assigned to substantive incident categories. The model therefore operated as a selective attention mechanism: it did not solve verification, but it created a prioritized review pool.

This interpretation is reinforced by Table 10 and Figure 5. Under the binary recoding rule, the AI would forward 519 of 3,618 reports for priority review and deprioritize 3,099 reports. This corresponds to an estimated filtering capacity of 85.66%. This result should not be read as direct evidence of labor savings because the study did not measure review time, volunteer fatigue, or decision quality. It instead shows the scale of reports that would be routed differently under the triage rule.

The Recall-Precision Trade-Off as an Ethical Design Problem

The binary classification results show a clear recall-oriented operating profile. As shown in Table 4 and Figure 3, the model correctly identified 166 of 181 valid reports and missed 15. It also produced 353 false positives. Table 5 and Figure 4 summarize this trade-off: recall reached 0.9171, while precision was 0.3198.

This pattern is not merely a technical result. It reflects a normative design choice. Signal detection theory provides a useful frame because it separates two kinds of classification error: missed detections and false alarms (Green & Swets, 1966; Lynn & Barrett, 2014). In this study, false negatives represent valid election reports that the model failed to surface. False positives represent reports that the model flagged even though human validation marked them invalid.

The ethical significance of these errors differs. In election monitoring, a false negative may correspond to a missed case of voter disenfranchisement, illegal campaigning, machine malfunction, intimidation, or other irregularity. A false positive imposes a verification burden on volunteers. Neither error is trivial, but they do not carry the same civic consequence. The expected-cost analysis in Table 9 makes this trade-off explicit. The zero-shot model is preferable to the keyword baseline when missed valid reports are weighted several times more heavily than false positives. If an organization weights both errors equally, the advantage is much weaker.

The model’s low precision shows why human verification remains indispensable. Of the reports flagged as potentially valid, many were not validated by humans. If the system were treated as an autonomous decision-maker, this false positive rate would be problematic. If treated as a triage layer, the same error profile becomes more defensible because the model’s task is to surface candidates for review, not to decide final validity.

Baseline Comparison and Generalization

The baseline comparison changes the interpretation of the findings. The zero-shot model does not dominate a simple keyword rule on every metric. Overall accuracy is nearly identical because both systems correctly reject many invalid reports. The zero-shot model’s advantage lies in recall: it misses fewer valid reports than the keyword baseline. This suggests a more general design principle for high-noise civic data systems. LLM triage is most defensible when the institutional goal is not to maximize aggregate accuracy but to reduce the probability that rare but important valid reports are lost in the stream.

This also limits generalization. The VoteReportPH case does not prove that zero-shot LLMs will outperform rule-based systems in every election monitoring environment. It shows that their value depends on the reporting environment, category taxonomy, language mix, error-cost weights, and review capacity. The generalizable contribution is therefore the evaluation framework: compare the model against transparent baselines, report uncertainty, separate invalidity from uncertainty, and evaluate triage through expected civic cost.

Class Imbalance and the Limits of Accuracy

The results also show why aggregate accuracy can obscure civic risk. The binary accuracy of 0.8983 in Table 5 appears strong, but it must be interpreted against the severe imbalance in Table 2. Since invalid reports dominate the dataset, a model can achieve high accuracy by correctly rejecting many non-valid cases. For rare-event monitoring, accuracy alone does not answer the central question: does the system capture the rare reports that matter?

The multiclass results show the same problem. Table 12 reports an overall accuracy of 0.75 and a weighted F1 score of 0.78. These values suggest acceptable performance. Yet the macro F1 score is only 0.49, which reveals uneven category-level performance. The divergence between weighted and macro F1 matters because weighted metrics give greater influence to common categories, while macro metrics expose performance weaknesses across rare classes. In imbalanced classification problems, metric choice affects interpretation because class prevalence can inflate aggregate performance (Farhadpour et al., 2024).

Table 11 shows that the multiclass evaluation subset is highly uneven. ACM Errors accounts for 116 cases, while several categories have fewer than five cases. This means that the model can perform well overall while still failing on rare but potentially severe incident types. For election monitoring, rare categories cannot be dismissed as analytically marginal. Incidents such as harassment, tampered ballots, red-tagging, or election violence may occur infrequently in the dataset but carry high democratic significance.

The implication is that future evaluations should report per-class metrics as a default. Table 13 and Figure 6 make this variation visible. Without these per-class results, the analysis would overstate the reliability of the model for categories with limited support.

Why the Model Performs Better on Explicit Than Procedural Categories

The per-class results reveal that the model performs best when incident categories have clear lexical and procedural markers. Table 13 shows strong performance for Illegal Campaigning and ACM Errors, with F1 scores of 0.95 and 0.89 respectively. These categories often contain explicit cues. Reports of ACM Errors may mention ballot rejection, machine failure, printing problems, or transmission issues. Reports of Illegal Campaigning may mention campaign materials, sample ballots, candidate paraphernalia, or activity near voting sites.

By contrast, the model performed weakly on more procedurally ambiguous categories. BEI/EB Non-Compliance with Election Procedures had an F1 score of 0.33, while Disenfranchisement had an F1 score of 0.38, as shown in Table 13 and Figure 6. These categories require more than lexical matching. They require an interpretation of election procedure, voter access, authority, intent, and context. A short report that says a voter was unable to vote may reflect machine failure, administrative error, queue cutoff, missing names, or deliberate exclusion. The text alone may not contain enough information to distinguish these possibilities.

This result is consistent with the broader literature on LLM-based classification. LLMs can perform well on text classification tasks, especially when categories are clearly described and text contains recognizable cues (Chae & Davidson, 2023; Davidson & Chae, 2025). Yet zero-shot classification remains vulnerable when category boundaries overlap or when the relevant context is implicit. The VoteReportPH case illustrates that civic classification is not only a language task. It is also a procedural and institutional interpretation task.

The Problem of Residual and Rare Categories

The weak performance for Others, Harassment of Voters/Poll Watchers/EB, and Tampered Ballots deserves careful interpretation. Table 13 shows zero F1 scores for these categories, but the causes are not identical.

For rare categories, low support limits statistical stability. Tampered Ballots had only one case, while Harassment of Voters/Poll Watchers/EB had two cases. A single classification error can therefore reduce recall or F1 to zero. These estimates should be treated as diagnostic signals rather than definitive evidence of model incapacity.

The Others category poses a different problem. It had 12 cases yet still produced an F1 score of zero. This suggests a category-design issue. Residual categories do not have stable semantic boundaries. They collect cases that do not fit predefined labels, which means they may not share common lexical or conceptual features. A zero-shot model that relies on category descriptions will struggle when a category is defined by exclusion rather than by positive characteristics.

For future monitoring systems, this has a practical implication. Others should not be treated as a final category. It should function as a routing state for human review, taxonomy revision, or secondary adjudication. If repeated patterns appear within Others, those patterns may justify new incident categories. In this way, classification errors can become a source of taxonomy improvement.

Human-AI Complementarity in Election Monitoring

The results support a complementarity model of human-AI collaboration. The AI system performs useful work by filtering large volumes of text and surfacing likely incidents. Human reviewers remain necessary because they supply contextual judgment, procedural knowledge, and accountability. This division of labor matches the human-AI complementarity literature, which argues that effective systems should combine machine-scale processing with human interpretation rather than assume substitution (Holstein et al., 2022; Vössing et al., 2022).

The binary results show where AI helps most. The model’s recall of 0.9171 in Table 5 indicates that it can surface most valid reports. This supports situational awareness in high-volume periods. The multiclass results show where humans remain central. Procedural ambiguity, rare cases, and residual categories require review beyond model output. Figure 6 makes this boundary visible: the model handles explicit categories better than ambiguous or sparse categories.

This has a design implication for civic AI systems. The user interface should not merely display the AI label. It should expose uncertainty, highlight reports assigned to ambiguous categories, and route low-confidence or procedurally complex cases to trained volunteers. Reports labeled Information not enough should not automatically disappear from the workflow. They may require secondary evidence, location confirmation, or manual escalation.

The uncertainty-routing analysis strengthens this point. A small number of human-valid reports appeared in the insufficient-information group. Even if most uncertain reports are ultimately invalid, collapsing uncertainty into invalidity can hide cases that require follow-up. A safer design is therefore a three-route workflow: priority review for substantive incident predictions, low-priority archival review for non-violative predictions, and evidence-gathering or manual screening for uncertain reports.

Implications for Civic AI Design

The findings suggest four design principles for AI-assisted election monitoring.

First, optimize for recall only when the expected civic cost justifies it. The model’s high recall in Table 5 is valuable because valid incidents are rare, but the 353 false positives in Table 4 show that review burden remains a real constraint. A civic monitoring system should choose an operating point based on explicit error-cost weights rather than treating high recall as automatically optimal.

Second, separate validity detection from incident categorization. Binary triage and multiclass classification serve different purposes. The model is stronger at identifying potentially actionable reports than at assigning difficult procedural categories. Treating both tasks as one problem would hide this difference.

Third, maintain human review for rare and ambiguous cases. Table 13 shows that rare and residual categories remain unreliable. These categories should be routed to human validators by design.

Fourth, use model errors to revise the taxonomy. Misclassifications are not only failures; they reveal where category boundaries are unclear. Persistent confusion between BEI/EB Non-Compliance with Election Procedures, Disenfranchisement, and ACM Errors may indicate that the operational taxonomy needs clearer decision rules, more examples, or multi-label classification.

Contribution to Computational Social Science and Election Monitoring

This study contributes to computational social science by showing how LLM evaluation can be reframed around civic workflows rather than standalone prediction. The central question is not whether the model achieves maximum accuracy. The more relevant question is whether its error profile supports the institutional task at hand. In the VoteReportPH case, the model’s high recall suggests value for triage, while its low precision and uneven multiclass performance limit its use as a final classifier.

The study also contributes to election monitoring research by documenting how citizen reports can be processed through a human-in-the-loop AI pipeline. Digital monitoring systems expand the volume of observable reports, but they also create verification bottlenecks. The results show that zero-shot LLMs can help manage this bottleneck, but only if embedded in a workflow that preserves human review and treats model outputs as provisional.

Finally, the study contributes a methodological lesson: evaluation metrics should reflect civic stakes. In this case, recall, false negative rate, macro F1, per-class performance, baseline comparison, uncertainty routing, and expected triage cost are more informative than accuracy alone. Civic AI systems should be evaluated by the consequences of their errors, not just by aggregate performance.

Limitations

This study has several limitations that should guide the interpretation of its findings and the design of future evaluations.

First, the analysis uses human validation as the operational benchmark, but this benchmark should not be treated as infallible ground truth. Volunteer labels are appropriate for evaluating model alignment with the VoteReportPH workflow, yet they may contain inconsistency, uncertainty, or disagreement. Election monitoring decisions often occur under time pressure and with incomplete evidence. Some reports may contain vague descriptions, missing locations, unclear timestamps, or ambiguous procedural details. These conditions can introduce label noise, a known issue in classification research where imperfect labels affect model evaluation and interpretation (Jindal et al., 2019). In this study, label noise may affect both binary validity evaluation and multiclass incident categorization.

Second, the study is constrained by class imbalance. As shown in Table 2, valid reports account for only 5.00% of the postprocessed EMS dataset. The multiclass distribution in Table 11 is also uneven, with ACM Errors dominating the volunteer-confirmed subset and several incident categories represented by only one to five cases. This imbalance limits the stability of category-level performance estimates. The strong weighted F1 score in Table 12 therefore should not be interpreted as uniform performance across all violation types. The lower macro F1 score and the per-class variation shown in Figure 6 provide a more cautious picture.

Third, the study evaluates a zero-shot classification pipeline without comparing it against alternative model configurations. The results show that the LLM performs well as a high-recall triage layer, as reflected in Table 5, but the analysis does not establish whether zero-shot prompting is superior to few-shot prompting, supervised classifiers, retrieval-augmented generation, or ensemble approaches. A stronger comparative design would test multiple models and prompting strategies on the same validation set.

The baseline comparison partially addresses this limitation by including a transparent keyword rule, but it should not be interpreted as a full model benchmark. The keyword rule is useful because it approximates a low-cost non-LLM triage alternative. It does not replace comparison with supervised machine learning, few-shot prompting, calibrated LLM confidence scores, or human-only workflows.

Fourth, the binary validity analysis depends on a recoding rule. Reports labeled No violation or Information not enough were treated as AI-predicted invalid or non-actionable, while all other AI-generated incident labels were treated as AI-predicted valid. This rule is defensible for triage evaluation because it reflects whether a report would be forwarded for priority review. Yet the rule also compresses uncertainty. In particular, Information not enough may include reports that are genuinely non-actionable and reports that require more evidence. Future systems should treat this category as a separate routing state rather than merging it automatically with invalid reports.

Fifth, the multiclass evaluation simplifies volunteer annotations. Some volunteer labels appear as list-like or multi-label entries, while the AI pipeline produces a single primary incident label. To make the comparison tractable, the analysis used the first volunteer label as the primary category. This decision enables single-label evaluation, but it may reduce the complexity of reports involving multiple violations. Future work should explore multi-label evaluation metrics, especially because election incidents may involve overlapping categories such as illegal campaigning, use of public resources, harassment, and procedural non-compliance.

Sixth, the workload reduction estimate is indirect. Table 10 and Figure 5 estimate the proportion of reports that would be forwarded or deprioritized by the AI triage rule. This estimate should be read as theoretical filtering capacity, not measured labor savings. A fuller evaluation would require workflow data, such as time-to-review, number of validators per report, escalation decisions, queue length, and disagreement rates before and after AI assistance.

Seventh, the analysis does not fully account for source, language, and platform bias. Reports from SMS, webforms, partner organizations, and social media may differ in structure, completeness, and evidentiary strength. Social media reports may also overrepresent users with internet access, public posting behavior, or particular political networks. Reports in Filipino, English, code-switched text, or regional languages may create different classification challenges. Without source-specific and language-specific performance estimates, the aggregate metrics may hide uneven model behavior across reporting channels.

Eighth, the study does not evaluate visual evidence. Election monitoring reports may include images, videos, screenshots, or links. The present evaluation focuses on text-based classification. This limits the model’s ability to assess incidents where the decisive evidence is visual, such as campaign materials near precincts, marked ballots, crowding, intimidation, or broken equipment. A future vision-language pipeline could evaluate whether multimodal evidence improves classification and reduces ambiguous cases.

Ninth, the explanatory analysis remains category-level rather than report-level. The results identify weaker performance in rare, residual, and procedurally ambiguous categories, but the study does not estimate a predictive model of report-level error. A stronger design would operationalize ambiguity using report length, lexical diversity, category frequency, evidence completeness, and source channel, then estimate whether these factors predict AI-volunteer mismatch.

Finally, the study examines one election monitoring context. VoteReportPH provides a rich case for evaluating civic AI in the Philippines, but the findings should not be generalized without caution to other countries, election systems, reporting cultures, or institutional environments. The broader contribution of the study lies in the evaluation framework: LLMs should be assessed as human-in-the-loop triage systems whose value depends on error trade-offs, class imbalance, and civic accountability.

Conclusion and Recommendations

This study evaluated a zero-shot large language model pipeline for election monitoring using VoteReportPH data from the 2025 Philippine elections. The central question was not whether the model could replace human volunteers, but whether it could support them by functioning as a high-recall triage layer under conditions of information overload, class imbalance, and limited verification capacity.

The findings support a bounded but meaningful role for LLMs in civic monitoring. The postprocessed EMS dataset contained 3,618 reports, but only 181 were human-validated as valid incidents. This distribution shows the central operational problem: valid civic signals are rare within a much larger stream of noisy, incomplete, or non-actionable reports. The model addressed this problem by forwarding 519 reports for priority review while deprioritizing 3,099 reports. This suggests that the pipeline can reduce the immediate review pool while retaining most valid incidents.

The strongest empirical result is the model’s high recall in binary validity detection. The model correctly surfaced 166 of 181 human-valid reports, yielding a recall of 0.9171. This supports the claim that the pipeline can serve as a recall-oriented filter. Yet the same analysis also shows the limit of automation: precision was only 0.3198, indicating that many AI-flagged reports still required human rejection. The model therefore produced useful prioritization, not final judgment.

The multiclass results further clarify this boundary. The model performed well on explicit and frequent categories such as ACM Errors and Illegal Campaigning, but it performed poorly on ambiguous, rare, and residual categories such as BEI/EB Non-Compliance with Election Procedures, Disenfranchisement, Others, and Tampered Ballots. This pattern shows that zero-shot classification is strongest when reports contain clear semantic cues and weakest when classification depends on procedural interpretation or contextual knowledge.

The study makes four contributions. First, it provides empirical evidence on the use of zero-shot LLMs for election monitoring in the Philippine context. Second, it reframes civic AI evaluation around triage rather than replacement. Third, it shows why class imbalance and rare-event detection require metrics beyond accuracy, including recall, false negative rate, macro F1, and per-class F1. Fourth, it offers a reproducible R and Quarto workflow for evaluating AI-assisted civic monitoring systems.

Several recommendations follow from these findings.

First, election monitoring systems should use LLMs as decision-support tools, not decision-making authorities. AI-generated labels should remain provisional until reviewed by human validators. This is especially important for categories involving voter access, intimidation, procedural non-compliance, or legal interpretation.

Second, future deployments should separate validity detection from incident categorization. The model appears more useful for identifying reports that deserve attention than for assigning final incident categories. A two-stage workflow is more defensible: use the model first to surface likely actionable reports, then route those reports to human reviewers for final classification.

Third, reports labeled as insufficiently supported should receive their own workflow status. They should not be automatically treated as invalid. Some may be genuinely non-actionable, but others may require follow-up information, location confirmation, or supporting evidence. Treating uncertainty as a separate routing state would make the workflow more accountable.

Fourth, category definitions should be revised using error analysis. Persistent confusion between ACM Errors, BEI/EB Non-Compliance with Election Procedures, and Disenfranchisement suggests that the taxonomy needs clearer decision rules and examples. The Others category should function as a temporary routing category rather than a final analytical label.

Fifth, future evaluation should test alternative model designs. Zero-shot prompting offers speed and low deployment cost, but it should be compared with few-shot prompting, retrieval-augmented generation, supervised classifiers, and human-only baselines. Such comparisons would clarify whether the observed performance reflects the underlying capacity of the LLM, the prompt design, or the structure of the taxonomy.

Sixth, future work should measure human workflow outcomes directly. The present study estimates theoretical filtering capacity using model predictions, but it does not measure review time, reviewer fatigue, disagreement rates, or escalation quality. A stronger evaluation would record how AI assistance changes volunteer workload and decision outcomes during live monitoring.

Seventh, future work should model error mechanisms directly. The present findings suggest that ambiguity, class frequency, and category design shape performance, but those mechanisms should be tested with report-level predictors. A next-stage study could estimate whether text length, lexical specificity, evidence completeness, source channel, and category frequency predict false negatives, false positives, or multiclass mismatch.

Seventh, future systems should incorporate multimodal evidence. Many election reports include images, screenshots, links, or videos. Text-only classification cannot fully evaluate reports where visual evidence carries the key information. A vision-language model could help assess campaign materials, marked ballots, crowding, equipment problems, or intimidation claims, while still requiring human review.

The broader implication is that civic AI should be designed around accountability rather than automation. Election monitoring involves democratic rights, institutional trust, and public evidence. In this context, the value of AI lies not in replacing judgment but in helping people see what might otherwise be missed. A high-recall LLM triage system can support election monitoring when its limits are visible, its errors are reviewed, and its outputs remain subordinate to human verification.

References

Chae, Y., & Davidson, T. (2023). Large language models for text classification: From zero-shot learning to instruction-tuning. Center for Open Science. https://doi.org/10.31235/osf.io/sthwk
Davidson, T. J., & Chae, Y. (2025). Large language models for text classification: From zero-shot learning to instruction-tuning. Center for Open Science. https://doi.org/10.31235/osf.io/sthwk_v2
Eppler, M. J., & Mengis, J. (2004). The concept of information overload: A review of literature from organization science, accounting, marketing, MIS, and related disciplines. The Information Society, 20(5), 325–344. https://doi.org/10.1080/01972240490507974
Farhadpour, S., Warner, T. A., & Maxwell, A. E. (2024). Selecting and interpreting multiclass loss and accuracy assessment metrics for classifications with class imbalance: Guidance and best practices. Remote Sensing, 16(3), 533. https://doi.org/10.3390/rs16030533
Garbiras-Díaz, N., & Montenegro, M. (2022). All eyes on them: A field experiment on citizen oversight and electoral integrity. American Economic Review, 112(8), 2631–2668. https://doi.org/10.1257/aer.20210778
Green, D. M., & Swets, J. A. (1966). Signal detection theory and psychophysics. Wiley.
Holstein, K., De-Arteaga, M., & Tumati, L. (2022). Toward supporting perceptual complementarity in human-AI collaboration via reflection on unobservables. arXiv. https://doi.org/10.48550/arXiv.2207.13834
Jackson, T., & Farzaneh, P. (2012). Theory-based model of factors affecting information overload. International Journal of Information Management, 32(6), 523–532. https://doi.org/10.1016/j.ijinfomgt.2012.04.006
Jindal, I., Pressel, D., & Lester, B. (2019). An effective label noise model for DNN text classification. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3246–3256. https://doi.org/10.18653/v1/N19-1328
Lynn, S. K., & Barrett, L. F. (2014). “Utilizing” signal detection theory. Psychological Science, 25(9), 1663–1673. https://doi.org/10.1177/0956797614541991
Norheim-Hagtun, I., & Meier, P. (2010). Crowdsourcing for crisis mapping in haiti. Innovations: Technology, Governance, Globalization, 5(4), 81–89. https://doi.org/10.1162/INOV_a_00046
Roetzel, P. G. (2019). Information overload in the information age: A review of the literature from business administration, business psychology, and related disciplines with a bibliometric approach and framework development. Business Research, 12(2), 479–522. https://doi.org/10.1007/s40685-018-0069-z
Tsandzana, D. (2019). Using on-line platforms to observe and monitor elections: A netnography of mozambique. Journal of African Elections, 18(2), 46–71. https://doi.org/10.20940/jae/2019/v18i2a3
Vössing, M., Kühl, N., & Lind, M. (2022). Designing transparency for effective human-AI collaboration. Information Systems Frontiers, 24(3), 877–895. https://doi.org/10.1007/s10796-022-10284-3