Probability Language Lexicon

Not all categories were used for the SEPA 2026 analyses. Categories are being continuously refined.

Category 1 — Modals (not Deontic)

Function: Grammatical markers of possibility or potentiality.

Interpretation:

High counts → cautious, hypothetical, or advisory framing.
Low counts → more declarative, confident communication.

Example words:

can, cannot, could, couldn’t, may, might, would, must, will, won’t, wouldn’t

Note: deontic modals like should, must, shall, and shouldn’t don’t count.

Category 2 — Epistemic Hedges

Function: Lexical cues of epistemic distance, inference, or uncertainty about the truth of a proposition.

Interpretation:

High counts → tentative or interpretive stance; speaker distances themself from full commitment (“it seems,” “appears”).
Low counts → more assertive, authoritative tone.

Example words:

perhaps, apparently, maybe, seem, seems, seemed, seeming, seemingly, appears, appeared, appearing, suggest, suggests, suggested, suggesting, presumably, possibly, possible, potentially, potential, roughly, approximately

Category 3 — Magnitude / Frequency (Linguistic)

Function: Qualitative expressions of how often, typical, or probable an outcome is, without explicit numbers.

Interpretation:

High counts → intuitive or narrative frequency framing (lay style).
Low counts → less emphasis on qualitative descriptions.

Example words:

common, commonly, uncommon, rare, rarely, frequent, frequently, infrequent, seldom, sometimes, often, occasionally, occasional, usually, unusually, typical, typically, normal, normally, majority, minority, most, many, few, several, nearly, about, likely, unlikely

Category 4 — Magnitude / Frequency (Numerical)

Function: Explicit numeric or quantitative expressions of likelihood, frequency, or proportion.

Interpretation:

High counts → precise, data-driven, quantitative framing.
Low counts → qualitative or narrative description predominates.

Notes: Captures both digits (1, 2, 3…), number words (“one,” “ten”), and patterns like “1 in 8,” “1/1000,” or “one out of five.”

Example words:

one, two, three, four, five, six, seven, eight, nine, ten, eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen, eighteen, nineteen, twenty, hundred, thousand, million, billion, percent, percentage

Category 5 — Comparative / Relative Framing

Function: Language contrasting magnitudes or likelihoods between groups or conditions.

Interpretation:

High counts → comparative or contrastive framing of uncertainty (“higher risk than,” “twice as common”).
Low counts → absolute, standalone risk descriptions.

Example words:

more, less, greater, smaller, higher, lower, increase, increases, increased, decrease, decreases, decreased, rises, rose, rising, drop, drops, dropped, twice, double, half, multiple, ratio, relative, comparative, versus, than

Category 6 — Risk-as-Object Language

Function: Domain-specific nouns that name or refer to risk, odds, or probability as factual properties of the world.

Interpretation:

High counts → risk is presented as a measurable attribute (“risk of,” “rate of”).
Low counts → discourse lacks explicit reference to “risk” as an object.

Example words:

risk, risks, chance, chances, odds, rate, rates, ratio, ratios, probability, probabilities, prevalence, incidence, hazard, likelihood, possibility, possibilities, exposure, event, events

Category 7 — Epistemic Boosters

Function: Lexical cues of speaker confidence, certainty, or full commitment to the truth of a proposition.

Interpretation:

High counts → assertive, authoritative framing; speaker presents outcomes as definite or unconditional (“always works,” “certainly effective”).
Low counts → more tentative or qualified communication.

Example words:

certainly, clearly, definitely, always, never, absolute, absolutely, indeed

Category 8 — Conditionals

Function: Language marking that an outcome or probability is contingent, variable, or dependent on other factors.

Interpretation:

High counts → probability is framed as context-sensitive or indeterminate rather than fixed (“depends on the stage,” “varies by patient”).
Low counts → probability statements presented as unconditional.

Example words:

depend, depends, depending, dependent, vary, varies, variable, variation, predict, predicted, predictive, unpredictable, expect, expected, expectation, unexpected

Category 9 — Intensifiers

Function: Words that amplify the degree or strength of an adjacent expression, potentially modifying how probability or risk language should be interpreted.

Interpretation:

High counts → probability language is frequently amplified or emphasized. Interpretation requires context: intensifiers modify likelihood terms (“very likely,” “extremely rare”) but also non-probability expressions (“very tired”). Bag-of-words counts alone are insufficient to distinguish these uses.

Example words:

very, extremely, highly, strongly, quite, rather, fairly, somewhat, pretty, especially, particularly, remarkably, notably, significantly, significant

Category 10 — Negations

Function: Words that reverse or negate the polarity of an adjacent expression, which will often fundamentally alter the meaning of probability and risk language.

Interpretation:

High counts → probability claims are frequently negated or qualified by reversal (“not likely,” “no evidence”). Bag-of-words counts alone cannot distinguish negated probability language from affirmative usage.

Example words:

not, no, nor, never, neither, without, unlikely, impossible, none, nothing

Lexicon developed for SEPA 2026. For questions contact the authors.

Probability Language Lexicon

SEPA 2026 — Corpus Linguistics Studies

Category 1 — Modals (not Deontic)

Category 2 — Epistemic Hedges

Category 3 — Magnitude / Frequency (Linguistic)

Category 4 — Magnitude / Frequency (Numerical)

Category 5 — Comparative / Relative Framing

Category 6 — Risk-as-Object Language

Category 7 — Epistemic Boosters

Category 8 — Conditionals

Category 9 — Intensifiers

Category 10 — Negations