Introduction

Inaugural addresses are speeches given by the newly elected President of the United States at the beginning of their term. These speeches are significant as they often outline the president’s vision, goals, and values for their presidency. In this analysis, we explore a dataset containing inaugural addresses to identify the most frequent terms used by U.S. presidents in these historic speeches.

Breakdown of Code

Downloading Packages

The following R Notebook will need the following packages downloaded: ‘quanteda’ is a package for quantitative text analysis. It provides a range of functions for creating and analyzing textual data, including tokenization, document-feature matrix creation, and various text processing tasks. In this notebook, quanteda is used for tokenizing and creating a document-feature matrix from the inaugural addresses dataset.

‘ggplot2’ is a powerful and flexible package for creating visualizations in R. It is based on the grammar of graphics, allowing users to create complex and customized plots. In this notebook, ggplot2 is used to generate a bar chart that visualizes the top 10 most frequent terms in the inaugural addresses, excluding certain common words.

# Install and load necessary packages
if (!require("quanteda")) install.packages("quanteda")

## Loading required package: quanteda

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "replValueSp"; definition not updated

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "xMatrix"; definition not updated

## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "mMatrix"; definition not updated

## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

if (!require("ggplot2")) install.packages("ggplot2")

## Loading required package: ggplot2

library(quanteda)
library(ggplot2)

Loading and Exploring the Dataset

The dataset used in this analysis is part of the quanteda package and is called data_corpus_inaugural. This dataset consists of the inaugural addresses of U.S. Presidents. An inaugural address is a significant speech delivered by a newly elected president at the commencement of their term, outlining their vision, goals, and values for their presidency. These addresses provide insights into the priorities and themes of different presidential administrations throughout U.S. history.

Load the Dataset:

data(data_corpus_inaugural): This line loads the inaugural addresses dataset (data_corpus_inaugural) into the R environment. The dataset is now accessible for further analysis.

Display the First Few Documents:

head(data_corpus_inaugural): This function displays the first few documents in the dataset. In this case, it shows a glimpse of the textual content of the inaugural addresses, giving an overview of the structure and format of the data.

The output of head(data_corpus_inaugural) allows you to see the initial documents in the corpus, providing a sample of the text data you’ll be working with. Each document represents a specific inaugural address.

The line dfm <- dfm(data_corpus_inaugural) is essential for converting the textual data from the inaugural addresses corpus into a Document-Feature Matrix (DFM). The DFM is a structured representation where each row corresponds to a specific document (inaugural address), and each column represents a unique term present in the entire corpus. This transformation enables quantitative analysis of the textual content, including tasks such as identifying term frequencies and extracting valuable insights. By creating the DFM, the subsequent analysis and visualization steps become more accessible, allowing for the exploration of recurring themes and language patterns within the inaugural addresses of U.S. Presidents throughout history.

# Load the inaugural addresses dataset
data(data_corpus_inaugural)

# Display the first few documents in the dataset
head(data_corpus_inaugural)

## Corpus consisting of 6 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."

# Create a document-feature matrix
dfm <- dfm(data_corpus_inaugural)

## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

The next step is to get the most frequent terms:

topfeatures(dfm, n = 39): This line uses the topfeatures function from the quanteda package to get the top most frequent terms in the document-feature matrix (dfm). These terms are sorted by their frequency in the corpus.

We then should find the words we want to exclude (if any):

exclude_words \<- c("the", "of", ",", ... "be", "been"): This line creates a vector exclude_words containing a list of common English stopwords, punctuation, and other words that might not provide significant insights for analysis. The goal is to filter out these words from the results to focus on more meaningful terms.

filtered_terms \<- top_terms[!(names(top_terms) %in% exclude_words)]: This line creates a new vector filtered_terms by excluding the words specified in exclude_words from the top_terms. The %in% operator is used to identify which terms from top_terms match the words in exclude_words, and !(…) is used to negate the match, effectively excluding those words.

By excluding common and less informative words, the analysis focuses on the most meaningful terms that can provide insights into the content of the inaugural addresses. This step helps in generating a more refined visualization that highlights the distinctive language used by U.S. presidents in these historic speeches.

# Get the most frequent terms
top_terms <- topfeatures(dfm, n = 39)

# Omit specific words from the results
exclude_words <- c("the", "of", ",", "and", "to", "in", "a", "our", ".", "that", "is", "it", "for", "which", "with", "as", "this", "are", "but", "has", ";", "by", "its", "or", "their", "on", "have", "be", "been")
filtered_terms <- top_terms[!(names(top_terms) %in% exclude_words)]

Data Visualization: Bar Chart of Most Frequent Terms

With our data preprocessed, we can now create a bar chart of the most frequent terms in our Inaugural Addresses Dataset. The ggplot function is used to perform this task, marking each word with its own designated color.

# Create a bar chart of the most frequent terms
ggplot(data.frame(word = names(filtered_terms), frequency = filtered_terms),
       aes(x = reorder(word, -frequency), y = frequency, fill = word)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Most Frequent Terms in Inaugural Addresses (excluding specific words)",
       x = "Terms",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Analysis and Observations

From the visualization, we can see that the most common word in these addresses is “we”, which denotes that presidents giving their addresses are prone to address the nation as a whole with the pronoun. Other common words that are notable in this visualization are “people”, “government”, and “will”. “People” would also contribute to the hypothesis that presidents like to address their audience as a whole, perhaps to promote unity much like the word “we”. “Government” as a frequent word tells us the importance of this topic within speeches and its consistent relevance in various speeches throughout the years. Lastly, “will” can be interpreted as presidents’ desire to promise change or resilience in topics/matters of interest (eg. “We will…”).

Conclusion

Analyzing the most frequent terms in inaugural addresses provides insights into the recurring themes and priorities of U.S. presidents. By excluding common words such as articles and prepositions, we can focus on the distinctive language used by presidents to convey their messages. This exploration contributes to a better understanding of the historical and rhetorical aspects of these important speeches throughout U.S. history.

This analysis serves as a starting point for more in-depth investigations into the evolving language and priorities of U.S. presidents over time. It showcases the value of text analysis in uncovering patterns and themes in large datasets, contributing to a richer understanding of political discourse and leadership communication.