Text Mining for Uncertainty: A Case Study Using the Loughran-McDonald Dictionary

Author

Saurabh C Srivastava

Published

May 1, 2025

Objective of the Analysis

The goal of this project is to examine the language of uncertainty in the Unabomber Manifesto using a financial sentiment lexicon — the Loughran-McDonald Dictionary. By identifying and analyzing uncertainty-related terms, this project aims to provide insight into the author’s psychological state and rhetorical strategies, particularly in terms of expressing ambiguity, doubt, or fear about modern society.

Practical Implementation

The techniques and approach used in this project can be extended to a wide range of real-world applications across different fields:

  • Business & Finance: Detecting uncertainty in earnings calls, annual reports (10-Ks), and investor communications to assess market sentiment or potential risks.

  • Public Policy & Government: Analyzing political speeches, public health updates, or government statements to monitor shifts in tone, public confidence, or emerging societal concerns.

  • Legal & Compliance: Identifying vague or hedging language in contracts, testimonies, or legal opinions to evaluate risk, clarity, or intent.

  • Media & Journalism: Exploring how uncertainty is framed in news articles or editorials, especially during crises or breaking events (e.g., elections, pandemics, economic downturns).

  • Education & Research: Examining academic writing or student essays to assess argumentative clarity or confidence in claims.

  • Customer Experience & Product Research: Applying uncertainty detection in product reviews or survey responses to highlight areas where customers express hesitation, doubt, or dissatisfaction.

By combining dictionary-based sentiment analysis with NLP techniques like tokenization, stopword removal, and stemming, this approach provides a structured way to turn raw text into insights — adaptable across domains where understanding language tone and intent is critical.

Brief Overview of Code

library(lingmatch)    # Download and read sentiment dictionaries
Loading required package: Matrix
library(dplyr)        # Data manipulation

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidytext)     # Tokenization and text preprocessing
library(tibble)       # Tibble format for clean dataframes
library(stringr)      # String processing
library(rvest)        # (Unused here but often for scraping)
library(ggplot2)      # Data visualization
library(SnowballC)    # Word stemming

1. Load the Loughran-McDonald Dictionary

  • Downloads the financial sentiment dictionary.

  • Reads it into a named list, where categories like "Uncertainty" hold words associated with that sentiment.

lingmatch::download.dict("loughranmcdonald", dir = tempdir())
loughranmcdonald dict downloaded:
  /private/var/folders/rp/mkwcs2k94r570kpc9y6hsywc0000gn/T/Rtmpuonhgf/loughranmcdonald.dic
lm_dict <- read.dic(file.path(tempdir(), "loughranmcdonald.dic"))

2. Clean and Tidy the Dictionary

  • Converts the dictionary into a long-form dataframe: one word per row.

  • Stems words to improve matching (e.g., “uncertain”, “uncertainty” → “uncertain”).

dict_df <- tibble::enframe(lm_dict, name = "sentiment", value = "word") %>%
  tidyr::unnest(word) %>%
  mutate(word = wordStem(word))

3. Load the Manifesto and Preprocess

  • Reads the text file line by line.

  • Tokenizes it into single words.

  • Removes common stop words (e.g., “the”, “and”).

  • Applies stemming so words match with the dictionary.

bomber = readLines("UnabomberManifesto.txt")

bomber_tib = tibble(text = bomber) %>% 
  unnest_tokens(word, text) %>% 
  anti_join(stop_words) %>% 
  mutate(word = wordStem(word))
Joining with `by = join_by(word)`

4. Extract & Plot Uncertainty Terms

  • Matches words in the manifesto with the dictionary.

  • Filters only uncertainty-related words.

  • Counts their frequency.

  • Creates a horizontal bar chart to display the top uncertainty terms.

uncertainty_df <- bomber_tib %>%
            inner_join(dict_df, by = "word") %>%
            filter(sentiment == 'Uncertainty') %>% 
            dplyr::count(word, sort = TRUE) %>%
            ggplot(aes(x = reorder(word, n), y = n)) +
            geom_col(fill = "#008080") +
            coord_flip() +
            labs(title = "Uncertainty in the Unabomber Manifesto",
                 x = "Stemmed Uncertainty Terms", y = "Word Frequency",
                 subtitle = "Based on Loughran-McDonald Dictionary | April 9, 2025",
                 caption = "Prepared by Saurabh Srivastava") +
            theme(legend.position = "none",
                  plot.title = element_text(size = 14, face = "bold", hjust = 0.5))
Warning in inner_join(., dict_df, by = "word"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 6 of `x` matches multiple rows in `y`.
ℹ Row 2027 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
uncertainty_df

Conclusion

The analysis of the Unabomber’s manifesto using the Loughran-McDonald Uncertainty dictionary reveals a distinct linguistic pattern centered around conditional or qualified statements. The overwhelming frequency of the word “depend” indicates a tendency to frame arguments with contingencies rather than absolutes. While other uncertainty-related terms like predict, suggest, and believe are also present, their comparatively lower frequency suggests that the narrative leans more toward cautious justification rather than chaotic ambiguity.

The frequent use of uncertain words shows that the writer was careful and cautious in expressing ideas, especially about modern technology. It wasn’t wild or emotional language, but rather a more thought-out and logical way of showing doubt. This helps us understand that the message wasn’t just emotional anger, but more of a planned and reasoned criticism — which is important when analyzing such extreme writings.