Inaugural addresses are speeches given by the newly elected President of the United States at the beginning of their term. These speeches are significant as they often outline the president’s vision, goals, and values for their presidency. In this analysis, we explore a dataset containing inaugural addresses to identify the most frequent terms used by U.S. presidents in these historic speeches. We will also look into the positive-negative sentiment of words used in these speeches and the average count of words for each speech through the years.
The following R Notebook will need the following packages downloaded: ‘quanteda’ is a package for quantitative text analysis. It provides a range of functions for creating and analyzing textual data, including tokenization, document-feature matrix creation, and various text processing tasks. In this notebook, quanteda is used for tokenizing and creating a document-feature matrix from the inaugural addresses dataset.
‘ggplot2’ is a powerful and flexible package for creating visualizations in R. It is based on the grammar of graphics, allowing users to create complex and customized plots. In this notebook, ggplot2 is used to generate a bar chart that visualizes the top 10 most frequent terms in the inaugural addresses, excluding certain common words.
# Install and load necessary packages
if (!require("quanteda")) install.packages("quanteda")
## Loading required package: quanteda
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "replValueSp"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "xMatrix"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "mMatrix"; definition not updated
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require("ggplot2")) install.packages("tidyr")
library(quanteda)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
The dataset used in this analysis is part of the quanteda package and is called data_corpus_inaugural. This dataset consists of the inaugural addresses of U.S. Presidents. An inaugural address is a significant speech delivered by a newly elected president at the commencement of their term, outlining their vision, goals, and values for their presidency. These addresses provide insights into the priorities and themes of different presidential administrations throughout U.S. history.
data(data_corpus_inaugural): This line
loads the inaugural addresses dataset (data_corpus_inaugural) into the R
environment. The dataset is now accessible for further analysis.
head(data_corpus_inaugural): This
function displays the first few documents in the dataset. In this case,
it shows a glimpse of the textual content of the inaugural addresses,
giving an overview of the structure and format of the data.
The output of
head(data_corpus_inaugural) allows you to
see the initial documents in the corpus, providing a sample of the text
data you’ll be working with. Each document represents a specific
inaugural address.
The line
dfm <- dfm(data_corpus_inaugural) is
essential for converting the textual data from the inaugural addresses
corpus into a Document-Feature Matrix (DFM). The DFM is a structured
representation where each row corresponds to a specific document
(inaugural address), and each column represents a unique term present in
the entire corpus. This transformation enables quantitative analysis of
the textual content, including tasks such as identifying term
frequencies and extracting valuable insights. By creating the DFM, the
subsequent analysis and visualization steps become more accessible,
allowing for the exploration of recurring themes and language patterns
within the inaugural addresses of U.S. Presidents throughout
history.
# Load the inaugural addresses dataset
data(data_corpus_inaugural)
# Display the first few documents in the dataset
head(data_corpus_inaugural)
## Corpus consisting of 6 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
##
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
##
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
# Create a document-feature matrix
dfm <- dfm(data_corpus_inaugural)
## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
The next step is to get the most frequent terms:
topfeatures(dfm, n = 39): This line
uses the topfeatures function from the quanteda package to get the top
most frequent terms in the document-feature matrix (dfm). These terms
are sorted by their frequency in the corpus.
We then should find the words we want to exclude (if any):
exclude_words \<- c("the", "of", ",", ... "be", "been"):
This line creates a vector exclude_words containing a list of common
English stopwords, punctuation, and other words that might not provide
significant insights for analysis. The goal is to filter out these words
from the results to focus on more meaningful terms.
filtered_terms \<- top_terms[!(names(top_terms) %in% exclude_words)]:
This line creates a new vector filtered_terms by excluding the words
specified in exclude_words from the top_terms. The %in% operator is used
to identify which terms from top_terms match the words in exclude_words,
and !(…) is used to negate the match, effectively excluding those
words.
By excluding common and less informative words, the analysis focuses on the most meaningful terms that can provide insights into the content of the inaugural addresses. This step helps in generating a more refined visualization that highlights the distinctive language used by U.S. presidents in these historic speeches.
# Get the most frequent terms
top_terms <- topfeatures(dfm, n = 39)
# Omit specific words from the results
exclude_words <- c("the", "of", ",", "and", "to", "in", "a", "our", ".", "that", "is", "it", "for", "which", "with", "as", "this", "are", "but", "has", ";", "by", "its", "or", "their", "on", "have", "be", "been")
filtered_terms <- top_terms[!(names(top_terms) %in% exclude_words)]
With our data pre-processed, we can now create a bar chart of the most frequent terms in our Inaugural Addresses Dataset. The ggplot function is used to perform this task, marking each word with its own designated color.
# Create a bar chart of the most frequent terms
ggplot(data.frame(word = names(filtered_terms), frequency = filtered_terms),
aes(x = reorder(word, -frequency), y = frequency, fill = word)) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Most Frequent Terms in Inaugural Addresses (excluding specific words)",
x = "Terms",
y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
To get the positive-negative sentiments of words used in all speeches within the corpus, this code fetches sentiment scores for words from the Bing lexicon, displays the resulting sentiment scores table, and then creates a bar graph to visualize the distribution of sentiment scores.
# Get sentiment scores for each word
sentiment_scores <- tidytext::get_sentiments("bing")
# View the resulting sentiment_scores table
print(sentiment_scores)
## # A tibble: 6,786 × 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
## 7 abomination negative
## 8 abort negative
## 9 aborted negative
## 10 aborts negative
## # ℹ 6,776 more rows
# Create a bar graph of sentiment scores
ggplot(sentiment_scores, aes(x = sentiment, fill = sentiment)) +
geom_bar() +
labs(title = "Distribution of Sentiment Scores", x = "Sentiment Category", y = "Count") +
theme_minimal()
To view the average words of each speech over the years, this code performs text analysis on the inaugural address corpus, calculates total term frequencies for each document, and visualizes the evolution of term frequency over time using a line plot.
options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("slam")
##
## The downloaded binary packages are in
## /var/folders/bv/0q0q47mj5_b9cqh241y0_s4h0000gn/T//Rtmp9Lh09c/downloaded_packages
# Assuming 'Date' is a column representing the publication date
# Convert the corpus to a data frame with document-level information
df <- convert(data_corpus_inaugural, to = "data.frame")
# Create a document-feature matrix
dfm <- dfm(data_corpus_inaugural)
## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
# Sum the term frequencies for each document using slam::row_sums
term_freq <- slam::row_sums(dfm)
# Combine the date information with term frequencies
df <- cbind(df, term_freq)
# Plot the evolution of term frequency over time
library(ggplot2)
ggplot(data = df, aes(x = Year, y = term_freq)) +
geom_line(color = "blue") +
labs(title = "Term Frequency Evolution Over Time", x = "Year", y = "Total Term Frequency")
From the visualization, we can see that the most common word in these addresses is “we”, which denotes that presidents giving their addresses are prone to address the nation as a whole with the pronoun. Other common words that are notable in this visualization are “people”, “government”, and “will”. “People” would also contribute to the hypothesis that presidents like to address their audience as a whole, perhaps to promote unity much like the word “we”. “Government” as a frequent word tells us the importance of this topic within speeches and its consistent relevance in various speeches throughout the years. Lastly, “will” can be interpreted as presidents’ desire to promise change or resilience in topics/matters of interest (eg. “We will…”).
In the bar chart, we see that a majority of the words used in these speeches is relatively negative in sentiment. With over 4,500 words being labeled as negative versus around 2,000 words being labeled positive, this may point towards speech topics fixating on the current problems of the country or worldly affairs at the time rather than its positive events or improvements. It can also point to the consistent preference or trend for presidents to point out these issues when addressing the people in their inaugural speeches.
In the line chart, we can deduce that the word count of speeches varies greatly over the years, with a spike in word count occurring somewhere in the 1840s and hitting its lowest count sometime before 1800. The number of words used seems to also plateau and stabilize after the 1950s. However, due to the great fluctuation of the word count, not much can be concluded from the information this visualization provides. Presidents may simply have different preferences and stylistic approaches when giving speeches, which may not correlate to any particular event going on at the time of its delivery, nor a trend in speech lengths.
Analyzing these core features in inaugural addresses provides insights into the recurring themes and priorities of U.S. presidents. By excluding common words such as articles and prepositions, reviewing positive-negative sentiment, and word count, we can focus on the distinctive language used by presidents to convey their messages throughout time. This exploration contributes to a better understanding of the historical and rhetorical aspects of these important speeches throughout U.S. history.
This analysis serves as a starting point for more in-depth investigations into the evolving language and priorities of U.S. presidents over time. It showcases the value of text analysis in uncovering patterns and themes in large datasets, contributing to a richer understanding of political discourse and leadership communication.