Comparing Political Speeches: A TF-IDF and Cosine Similarity Analysis of Obama, Romney, Trump, and User-defined ‘UB’

Objective of the Analysis

The objective of this analysis is to compare the textual similarities between speeches of different political figures (Obama, Romney, Trump, and a user-defined “UB”) using the Term Frequency-Inverse Document Frequency (TF-IDF) approach and cosine similarity. By analyzing the similarities between word usage in the speeches, the goal is to identify patterns of similarity and distinctiveness in the language used by these individuals. Additionally, the study aims to visualize these similarities to better understand how closely related their speeches are in terms of vocabulary.

Practical Implementation

Professors and Academics: Professors can use this technique to detect instances of plagiarism or to verify the authorship of student papers or research documents. By comparing the language style and vocabulary usage, they can confirm whether the document aligns with the student’s writing style or whether it has been copied from other sources.
Investigation Agencies: Law enforcement or investigative agencies can use authorship attribution to analyze documents in criminal investigations. This can help determine whether a document, such as a ransom note, threatening letter, or anonymous communication, was written by a suspect based on their known writing patterns.
Natural Language Processing (NLP): The techniques demonstrated (TF-IDF, cosine similarity) can be applied to various NLP tasks, such as document clustering, content recommendation, or topic modeling, helping in fields like content curation, sentiment analysis, and personalized communication strategies.

Brief Overview of Code

1. Loading and Preparing Data

First, we need to load the speeches from different authors (Obama, Romney, Trump, and the user-defined “UB”) into the R environment. We do this by using the tm package, which provides functions to load text files into a corpus. A corpus is a collection of text documents. In this case, we are reading .txt files from different directories for each author.

After loading the speeches, we rename each document in the corpus for clarity and then convert the corpus into a tidy format where each word is an individual token.

# Load necessary libraries
library(dplyr)      # For data manipulation, including functions like mutate(), count(), filter(), bind_rows()


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

library(tm)         # For text mining tasks, used to create and manage text corpora from documents

Loading required package: NLP

library(tidytext)   # Provides tools to manipulate text data in a tidy format, including unnesting text into tokens (words)
library(widyr)      # Provides tools for calculating similarity metrics like pairwise similarity between words or authors
library(ggplot2)    # For creating visualizations, used here to generate bar plots of similarity scores


Attaching package: 'ggplot2'

The following object is masked from 'package:NLP':

    annotate

library(stopwords)  # Provides predefined stop words (commonly occurring words to remove in text analysis)


Attaching package: 'stopwords'

The following object is masked from 'package:tm':

    stopwords

library(tidyr)      # For tidying data, used here to unnest text into individual tokens and reshape data
library(stringr)


# Add screen_name for the user-defined speech (UB)
bomber = readLines("UnabomberManifesto.txt")
bomber_df = data.frame(linenumber = 1:length(bomber), text= bomber)
bomber_ttt = bomber_df %>% unnest_tokens(word, text)
bomber_tt = bomber_ttt %>% anti_join(stop_words)

Joining with `by = join_by(word)`

bomber_tt = bomber_tt %>% 
        filter(!str_detect(word, "^[0-9]*$")) %>%
            filter(!str_detect(word, "ct"))
  
#Display top 10 words
bomber_tt %>% 
  group_by(word) %>%
  count(word, sort = TRUE) %>% 
  ungroup() %>%
  top_n(20) %>% 
  ggplot(aes(x = reorder(word,n), y = n, fill = as.factor(n))) +
  geom_col() + 
  coord_flip() +
  theme(legend.position = "none")

Selecting by n

bomber_ttt = bomber_tt %>% mutate(screen_name = c("UB"))
head(bomber_ttt)

  linenumber         word screen_name
1          1   industrial          UB
2          1   revolution          UB
3          1 consequences          UB
4          1     disaster          UB
5          2        human          UB
6          2         race          UB

# Loading text data (political speeches) into a corpus
speeches.corp <- tm::VCorpus(DirSource(directory = "./obama", pattern = "*.txt"))
romney.corp <- tm::VCorpus(DirSource(directory = "./romney", pattern = "*.txt"))
trump.corp <- tm::VCorpus(DirSource(directory = "./trump", pattern = "*.txt"))

# Rename documents for clarity
names(speeches.corp) <- c(paste("Speech", 1:21, sep = " "))
names(romney.corp) <- c(paste("Speech", 1:22, sep = " "))
names(trump.corp) <- c(paste("Speech", 1:4, sep = " "))

# Convert corpus to tidy format and tokenize text into words
speeches_corp <- tidy(speeches.corp)
speeches_text <- unnest_tokens(speeches_corp, word, text)
romney_tidy <- tidy(romney.corp)
romney_tt <- unnest_tokens(romney_tidy, word, text)
trump_corp <- tidy(trump.corp)
trump_text <- unnest_tokens(trump_corp, word, text)

2. Preprocessing Data

Now, we clean the data by removing stop words (e.g., “the”, “and”, “is”, etc.) that do not add meaningful information for text analysis. We then count the frequency of each word per document (speech). This gives us a representation of how many times each word occurs in each author’s speech.

# Count word frequencies and remove stopwords
obama_counts = speeches_text %>% anti_join(stop_words, by = "word") %>%
  count(id, word, sort = TRUE) %>%
  ungroup()
names(obama_counts)[1] = "author"

romney_counts = romney_tt %>% anti_join(stop_words, by = "word") %>%
  count(id, word, sort = TRUE)
names(romney_counts)[1] = "author"

trump_counts = trump_text %>% anti_join(stop_words, by = "word") %>%
  count(id, word, sort = TRUE)
names(trump_counts)[1] = "author"

bomber_ttt = rename(bomber_ttt, author = screen_name)

3. Combining Data

Once we have the word counts for each author, we need to combine all the word counts from the different authors (Obama, Romney, Trump, and UB) into one dataset. This is done using bind_rows() to stack all the word counts together.

# Combine word counts from all authors (Obama, Romney, Trump, UB)
word_counts <- bomber_ttt %>%
  count(author, word, sort = TRUE) %>%
  bind_rows(obama_counts %>% mutate(author = "Obama")) %>%
  bind_rows(romney_counts %>% mutate(author = "Romney")) %>%
  bind_rows(trump_counts %>% mutate(author = "Trump"))
head(word_counts); tail(word_counts)

  author       word   n
1     UB    society 258
2     UB     people 232
3     UB     system 224
4     UB      power 188
5     UB      human 164
6     UB technology 143

      author      word n
28650  Trump withdrawn 1
28651  Trump wonderful 1
28652  Trump    wouldn 1
28653  Trump   wyoming 1
28654  Trump        xl 1
28655  Trump     zinke 1

4. TF-IDF Calculation

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. We compute this to weigh each word based on its frequency in the document (speech) and how common it is across all documents. This allows us to focus on the most distinctive words for each author.

# Compute TF-IDF (Term Frequency - Inverse Document Frequency)
word_tf_idf <- word_counts %>%
  bind_tf_idf(word, author, n) %>%
  arrange(desc(tf_idf))

Warning: A value for tf_idf is negative:
 Input should have exactly one row per document-term combination.

head(word_tf_idf)

  author          word  n          tf      idf      tf_idf
1     UB       leftism 75 0.005727814 1.386294 0.007940437
2     UB       leftist 70 0.005345960 1.386294 0.007411074
3     UB      leftists 63 0.004811364 1.386294 0.006669967
4     UB technological 63 0.004811364 1.386294 0.006669967
5     UB psychological 56 0.004276768 1.386294 0.005928859
6     UB     surrogate 44 0.003360318 1.386294 0.004658389

tail(word_tf_idf)

      author   word   n         tf       idf      tf_idf
28650  Trump     92 101 0.01397924 -2.014903 -0.02816681
28651     UB system 224 0.01710707 -1.791759 -0.03065176
28652     UB  power 188 0.01435772 -2.140066 -0.03072647
28653  Trump     97 167 0.02311419 -1.558145 -0.03601525
28654     UB people 232 0.01771804 -2.484907 -0.04402767
28655  Trump     92 481 0.06657439 -2.014903 -0.13414095

5. Cosine Similarity Calculation

Cosine similarity is used to measure the similarity between word vectors of different authors. It compares how similar the word usage is between different authors.
The widyr::pairwise_similarity() function is used to calculate the cosine similarity between each pair of authors’ word vectors.

# Find similarities between authors using cosine similarity
word_tf_idf %>%
  widyr::pairwise_similarity(author, word, tf_idf) %>% slice(1:3)

# A tibble: 3 × 3
  item1  item2 similarity
  <chr>  <chr>      <dbl>
1 Trump  UB         0.184
2 Obama  UB         0.175
3 Romney UB         0.184

6. Visualization of Similarities

Visualizations are created to display the similarity between authors. This helps to visualize the results of the cosine similarity calculations.
A bar plot is generated using ggplot2 to show the top similarities between all authors or specific pairs of authors.

# Visualization: Top Similarities Between Authors
word_tf_idf %>%
                          widyr::pairwise_similarity(author, word, tf_idf, sort = TRUE) %>%
                          # Filter for specific comparisons
                          filter((item1 == "Romney" & item2 == "Trump") |
                                   (item1 == "Romney" & item2 == "Obama") |
                                   (item1 == "Romney" & item2 == "UB") |
                                   (item1 == "Trump" & item2 == "UB") |
                                   (item1 == "Trump" & item2 == "Obama") |
                                   (item1 == "Obama" & item2 == "UB")) %>%
                          # For each pair, select the top similarity score
                          group_by(item1, item2) %>%
                          slice_max(similarity, n = 1) %>%
                          ungroup() %>%
                          ggplot(aes(x = interaction(item1, item2), y = similarity, fill = interaction(item1, item2))) +
                          geom_col(position = "dodge") +
                          labs(title = "Top Similarities Between Specific Word Pairs",
                               x = "Word Pair",
                               y = "Similarity Score",
                               fill = "Word Pair",
                               caption = "Saurabh's Work") +
                          theme_minimal() +
                          theme(legend.position = "none",
                                plot.title = element_text(hjust = 0.5, face = "bold"))

Conclusion

The analysis of the speeches from Obama, Romney, Trump, and “UB” using TF-IDF and cosine similarity reveals insights into the commonality and differences in their word usage. The cosine similarity scores indicate the level of similarity between different speech datasets, showing how closely related the political figures are in terms of their vocabulary. Visualizations further emphasize these relationships, highlighting top similarities and providing a clearer picture of the linguistic patterns across the speeches. The approach allows for an effective comparison of political discourse and can be extended to other datasets for deeper analysis of language usage in political speeches.