Comparing Political Speeches: A TF-IDF and Cosine Similarity Analysis of Obama, Romney, Trump, and User-defined ‘UB’
Author
Saurabh C Srivastava
Published
April 2, 2025
Objective of the Analysis
The objective of this analysis is to compare the textual similarities between speeches of different political figures (Obama, Romney, Trump, and a user-defined “UB”) using the Term Frequency-Inverse Document Frequency (TF-IDF) approach and cosine similarity. By analyzing the similarities between word usage in the speeches, the goal is to identify patterns of similarity and distinctiveness in the language used by these individuals. Additionally, the study aims to visualize these similarities to better understand how closely related their speeches are in terms of vocabulary.
Practical Implementation
Professors and Academics: Professors can use this technique to detect instances of plagiarism or to verify the authorship of student papers or research documents. By comparing the language style and vocabulary usage, they can confirm whether the document aligns with the student’s writing style or whether it has been copied from other sources.
Investigation Agencies: Law enforcement or investigative agencies can use authorship attribution to analyze documents in criminal investigations. This can help determine whether a document, such as a ransom note, threatening letter, or anonymous communication, was written by a suspect based on their known writing patterns.
Natural Language Processing (NLP): The techniques demonstrated (TF-IDF, cosine similarity) can be applied to various NLP tasks, such as document clustering, content recommendation, or topic modeling, helping in fields like content curation, sentiment analysis, and personalized communication strategies.
Brief Overview of Code
1. Loading and Preparing Data
First, we need to load the speeches from different authors (Obama, Romney, Trump, and the user-defined “UB”) into the R environment. We do this by using the tm package, which provides functions to load text files into a corpus. A corpus is a collection of text documents. In this case, we are reading .txt files from different directories for each author.
After loading the speeches, we rename each document in the corpus for clarity and then convert the corpus into a tidy format where each word is an individual token.
# Load necessary librarieslibrary(dplyr) # For data manipulation, including functions like mutate(), count(), filter(), bind_rows()
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(tm) # For text mining tasks, used to create and manage text corpora from documents
Loading required package: NLP
library(tidytext) # Provides tools to manipulate text data in a tidy format, including unnesting text into tokens (words)library(widyr) # Provides tools for calculating similarity metrics like pairwise similarity between words or authorslibrary(ggplot2) # For creating visualizations, used here to generate bar plots of similarity scores
Attaching package: 'ggplot2'
The following object is masked from 'package:NLP':
annotate
library(stopwords) # Provides predefined stop words (commonly occurring words to remove in text analysis)
Attaching package: 'stopwords'
The following object is masked from 'package:tm':
stopwords
library(tidyr) # For tidying data, used here to unnest text into individual tokens and reshape datalibrary(stringr)# Add screen_name for the user-defined speech (UB)bomber =readLines("UnabomberManifesto.txt")bomber_df =data.frame(linenumber =1:length(bomber), text= bomber)bomber_ttt = bomber_df %>%unnest_tokens(word, text)bomber_tt = bomber_ttt %>%anti_join(stop_words)
Joining with `by = join_by(word)`
bomber_tt = bomber_tt %>%filter(!str_detect(word, "^[0-9]*$")) %>%filter(!str_detect(word, "ct"))#Display top 10 wordsbomber_tt %>%group_by(word) %>%count(word, sort =TRUE) %>%ungroup() %>%top_n(20) %>%ggplot(aes(x =reorder(word,n), y = n, fill =as.factor(n))) +geom_col() +coord_flip() +theme(legend.position ="none")
# Loading text data (political speeches) into a corpusspeeches.corp <- tm::VCorpus(DirSource(directory ="./obama", pattern ="*.txt"))romney.corp <- tm::VCorpus(DirSource(directory ="./romney", pattern ="*.txt"))trump.corp <- tm::VCorpus(DirSource(directory ="./trump", pattern ="*.txt"))# Rename documents for claritynames(speeches.corp) <-c(paste("Speech", 1:21, sep =" "))names(romney.corp) <-c(paste("Speech", 1:22, sep =" "))names(trump.corp) <-c(paste("Speech", 1:4, sep =" "))# Convert corpus to tidy format and tokenize text into wordsspeeches_corp <-tidy(speeches.corp)speeches_text <-unnest_tokens(speeches_corp, word, text)romney_tidy <-tidy(romney.corp)romney_tt <-unnest_tokens(romney_tidy, word, text)trump_corp <-tidy(trump.corp)trump_text <-unnest_tokens(trump_corp, word, text)
2. Preprocessing Data
Now, we clean the data by removing stop words (e.g., “the”, “and”, “is”, etc.) that do not add meaningful information for text analysis. We then count the frequency of each word per document (speech). This gives us a representation of how many times each word occurs in each author’s speech.
# Count word frequencies and remove stopwordsobama_counts = speeches_text %>%anti_join(stop_words, by ="word") %>%count(id, word, sort =TRUE) %>%ungroup()names(obama_counts)[1] ="author"romney_counts = romney_tt %>%anti_join(stop_words, by ="word") %>%count(id, word, sort =TRUE)names(romney_counts)[1] ="author"trump_counts = trump_text %>%anti_join(stop_words, by ="word") %>%count(id, word, sort =TRUE)names(trump_counts)[1] ="author"bomber_ttt =rename(bomber_ttt, author = screen_name)
3. Combining Data
Once we have the word counts for each author, we need to combine all the word counts from the different authors (Obama, Romney, Trump, and UB) into one dataset. This is done using bind_rows() to stack all the word counts together.
# Combine word counts from all authors (Obama, Romney, Trump, UB)word_counts <- bomber_ttt %>%count(author, word, sort =TRUE) %>%bind_rows(obama_counts %>%mutate(author ="Obama")) %>%bind_rows(romney_counts %>%mutate(author ="Romney")) %>%bind_rows(trump_counts %>%mutate(author ="Trump"))head(word_counts); tail(word_counts)
author word n
1 UB society 258
2 UB people 232
3 UB system 224
4 UB power 188
5 UB human 164
6 UB technology 143
TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. We compute this to weigh each word based on its frequency in the document (speech) and how common it is across all documents. This allows us to focus on the most distinctive words for each author.
author word n tf idf tf_idf
28650 Trump 92 101 0.01397924 -2.014903 -0.02816681
28651 UB system 224 0.01710707 -1.791759 -0.03065176
28652 UB power 188 0.01435772 -2.140066 -0.03072647
28653 Trump 97 167 0.02311419 -1.558145 -0.03601525
28654 UB people 232 0.01771804 -2.484907 -0.04402767
28655 Trump 92 481 0.06657439 -2.014903 -0.13414095
5. Cosine Similarity Calculation
Cosine similarity is used to measure the similarity between word vectors of different authors. It compares how similar the word usage is between different authors.
The widyr::pairwise_similarity() function is used to calculate the cosine similarity between each pair of authors’ word vectors.
# Find similarities between authors using cosine similarityword_tf_idf %>% widyr::pairwise_similarity(author, word, tf_idf) %>%slice(1:3)
Visualizations are created to display the similarity between authors. This helps to visualize the results of the cosine similarity calculations.
A bar plot is generated using ggplot2 to show the top similarities between all authors or specific pairs of authors.
# Visualization: Top Similarities Between Authorsword_tf_idf %>% widyr::pairwise_similarity(author, word, tf_idf, sort =TRUE) %>%# Filter for specific comparisonsfilter((item1 =="Romney"& item2 =="Trump") | (item1 =="Romney"& item2 =="Obama") | (item1 =="Romney"& item2 =="UB") | (item1 =="Trump"& item2 =="UB") | (item1 =="Trump"& item2 =="Obama") | (item1 =="Obama"& item2 =="UB")) %>%# For each pair, select the top similarity scoregroup_by(item1, item2) %>%slice_max(similarity, n =1) %>%ungroup() %>%ggplot(aes(x =interaction(item1, item2), y = similarity, fill =interaction(item1, item2))) +geom_col(position ="dodge") +labs(title ="Top Similarities Between Specific Word Pairs",x ="Word Pair",y ="Similarity Score",fill ="Word Pair",caption ="Saurabh's Work") +theme_minimal() +theme(legend.position ="none",plot.title =element_text(hjust =0.5, face ="bold"))
Conclusion
The analysis of the speeches from Obama, Romney, Trump, and “UB” using TF-IDF and cosine similarity reveals insights into the commonality and differences in their word usage. The cosine similarity scores indicate the level of similarity between different speech datasets, showing how closely related the political figures are in terms of their vocabulary. Visualizations further emphasize these relationships, highlighting top similarities and providing a clearer picture of the linguistic patterns across the speeches. The approach allows for an effective comparison of political discourse and can be extended to other datasets for deeper analysis of language usage in political speeches.