Introduction

Inaugural addresses are speeches given by the newly elected President of the United States at the beginning of their term. These speeches are significant as they often outline the president’s vision, goals, and values for their presidency. In this analysis, we explore a dataset containing inaugural addresses to identify the most frequent terms used by U.S. presidents in these historic speeches. We will also look into the positive-negative sentiment of words used in these speeches and the average count of words for each speech through the years.

Breakdown of Code

Downloading Packages

The following R Notebook will need the following packages downloaded: ‘quanteda’ is a package for quantitative text analysis. It provides a range of functions for creating and analyzing textual data, including tokenization, document-feature matrix creation, and various text processing tasks. In this notebook, quanteda is used for tokenizing and creating a document-feature matrix from the inaugural addresses dataset.

ggplot2’ is a powerful and flexible package for creating visualizations in R. It is based on the grammar of graphics, allowing users to create complex and customized plots. In this notebook, ggplot2 is used to generate a bar chart that visualizes the top 10 most frequent terms in the inaugural addresses, excluding certain common words.

# Install and load necessary packages
if (!require("quanteda")) install.packages("quanteda")
## Loading required package: quanteda
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "replValueSp"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "xMatrix"; definition not updated
## Warning in .recacheSubclasses(def@className, def, env): undefined subclass
## "pcorMatrix" of class "mMatrix"; definition not updated
## Package version: 3.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
if (!require("ggplot2")) install.packages("ggplot2")
## Loading required package: ggplot2
if (!require("ggplot2")) install.packages("tidyr")
library(quanteda)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)

Loading and Exploring the Dataset

The dataset used in this analysis is part of the quanteda package and is called data_corpus_inaugural. This dataset consists of the inaugural addresses of U.S. Presidents. An inaugural address is a significant speech delivered by a newly elected president at the commencement of their term, outlining their vision, goals, and values for their presidency. These addresses provide insights into the priorities and themes of different presidential administrations throughout U.S. history.

  1. Load the Dataset:

data(data_corpus_inaugural): This line loads the inaugural addresses dataset (data_corpus_inaugural) into the R environment. The dataset is now accessible for further analysis.

  1. Display the First Few Documents:

head(data_corpus_inaugural): This function displays the first few documents in the dataset. In this case, it shows a glimpse of the textual content of the inaugural addresses, giving an overview of the structure and format of the data.

The output of head(data_corpus_inaugural) allows you to see the initial documents in the corpus, providing a sample of the text data you’ll be working with. Each document represents a specific inaugural address.

The line dfm <- dfm(data_corpus_inaugural) is essential for converting the textual data from the inaugural addresses corpus into a Document-Feature Matrix (DFM). The DFM is a structured representation where each row corresponds to a specific document (inaugural address), and each column represents a unique term present in the entire corpus. This transformation enables quantitative analysis of the textual content, including tasks such as identifying term frequencies and extracting valuable insights. By creating the DFM, the subsequent analysis and visualization steps become more accessible, allowing for the exploration of recurring themes and language patterns within the inaugural addresses of U.S. Presidents throughout history.

# Load the inaugural addresses dataset
data(data_corpus_inaugural)

# Display the first few documents in the dataset
head(data_corpus_inaugural)
## Corpus consisting of 6 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
# Create a document-feature matrix
dfm <- dfm(data_corpus_inaugural)
## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

Data Visualization 1: Bar Chart of Most Frequent Terms

The next step is to get the most frequent terms:

topfeatures(dfm, n = 39): This line uses the topfeatures function from the quanteda package to get the top most frequent terms in the document-feature matrix (dfm). These terms are sorted by their frequency in the corpus.

We then should find the words we want to exclude (if any):

exclude_words \<- c("the", "of", ",", ... "be", "been"): This line creates a vector exclude_words containing a list of common English stopwords, punctuation, and other words that might not provide significant insights for analysis. The goal is to filter out these words from the results to focus on more meaningful terms.

filtered_terms \<- top_terms[!(names(top_terms) %in% exclude_words)]: This line creates a new vector filtered_terms by excluding the words specified in exclude_words from the top_terms. The %in% operator is used to identify which terms from top_terms match the words in exclude_words, and !(…) is used to negate the match, effectively excluding those words.

By excluding common and less informative words, the analysis focuses on the most meaningful terms that can provide insights into the content of the inaugural addresses. This step helps in generating a more refined visualization that highlights the distinctive language used by U.S. presidents in these historic speeches.

# Get the most frequent terms
top_terms <- topfeatures(dfm, n = 39)

# Omit specific words from the results
exclude_words <- c("the", "of", ",", "and", "to", "in", "a", "our", ".", "that", "is", "it", "for", "which", "with", "as", "this", "are", "but", "has", ";", "by", "its", "or", "their", "on", "have", "be", "been")
filtered_terms <- top_terms[!(names(top_terms) %in% exclude_words)]

With our data pre-processed, we can now create a bar chart of the most frequent terms in our Inaugural Addresses Dataset. The ggplot function is used to perform this task, marking each word with its own designated color.

# Create a bar chart of the most frequent terms
ggplot(data.frame(word = names(filtered_terms), frequency = filtered_terms),
       aes(x = reorder(word, -frequency), y = frequency, fill = word)) +
  geom_bar(stat = "identity") +
  labs(title = "Top 10 Most Frequent Terms in Inaugural Addresses (excluding specific words)",
       x = "Terms",
       y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Data Visualization 2: Sentiment of Speeches Based on Words Used

To get the positive-negative sentiments of words used in all speeches within the corpus, this code fetches sentiment scores for words from the Bing lexicon, displays the resulting sentiment scores table, and then creates a bar graph to visualize the distribution of sentiment scores.

# Get sentiment scores for each word
sentiment_scores <- tidytext::get_sentiments("bing")
# View the resulting sentiment_scores table
print(sentiment_scores)
## # A tibble: 6,786 × 2
##    word        sentiment
##    <chr>       <chr>    
##  1 2-faces     negative 
##  2 abnormal    negative 
##  3 abolish     negative 
##  4 abominable  negative 
##  5 abominably  negative 
##  6 abominate   negative 
##  7 abomination negative 
##  8 abort       negative 
##  9 aborted     negative 
## 10 aborts      negative 
## # ℹ 6,776 more rows
# Create a bar graph of sentiment scores
ggplot(sentiment_scores, aes(x = sentiment, fill = sentiment)) +
  geom_bar() +
  labs(title = "Distribution of Sentiment Scores", x = "Sentiment Category", y = "Count") +
  theme_minimal()

Data Visualization 3: Average Words per Speech Over Time

To view the average words of each speech over the years, this code performs text analysis on the inaugural address corpus, calculates total term frequencies for each document, and visualizes the evolution of term frequency over time using a line plot.

options(repos = c(CRAN = "https://cran.r-project.org"))
install.packages("slam")
## 
## The downloaded binary packages are in
##  /var/folders/bv/0q0q47mj5_b9cqh241y0_s4h0000gn/T//Rtmp9Lh09c/downloaded_packages
# Assuming 'Date' is a column representing the publication date
# Convert the corpus to a data frame with document-level information
df <- convert(data_corpus_inaugural, to = "data.frame")

# Create a document-feature matrix
dfm <- dfm(data_corpus_inaugural)
## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.
# Sum the term frequencies for each document using slam::row_sums
term_freq <- slam::row_sums(dfm)

# Combine the date information with term frequencies
df <- cbind(df, term_freq)

# Plot the evolution of term frequency over time
library(ggplot2)
ggplot(data = df, aes(x = Year, y = term_freq)) +
  geom_line(color = "blue") +
  labs(title = "Term Frequency Evolution Over Time", x = "Year", y = "Total Term Frequency")

Analysis and Observations

Data Visualization 1: Bar Chart of Most Frequent Terms

From the visualization, we can see that the most common word in these addresses is “we”, which denotes that presidents giving their addresses are prone to address the nation as a whole with the pronoun. Other common words that are notable in this visualization are “people”, “government”, and “will”. “People” would also contribute to the hypothesis that presidents like to address their audience as a whole, perhaps to promote unity much like the word “we”. “Government” as a frequent word tells us the importance of this topic within speeches and its consistent relevance in various speeches throughout the years. Lastly, “will” can be interpreted as presidents’ desire to promise change or resilience in topics/matters of interest (eg. “We will…”).

Data Visualization 2: Sentiment of Speeches Based on Words Used

In the bar chart, we see that a majority of the words used in these speeches is relatively negative in sentiment. With over 4,500 words being labeled as negative versus around 2,000 words being labeled positive, this may point towards speech topics fixating on the current problems of the country or worldly affairs at the time rather than its positive events or improvements. It can also point to the consistent preference or trend for presidents to point out these issues when addressing the people in their inaugural speeches.

Data Visualization 3: Average Words per Speech Over Time

In the line chart, we can deduce that the word count of speeches varies greatly over the years, with a spike in word count occurring somewhere in the 1840s and hitting its lowest count sometime before 1800. The number of words used seems to also plateau and stabilize after the 1950s. However, due to the great fluctuation of the word count, not much can be concluded from the information this visualization provides. Presidents may simply have different preferences and stylistic approaches when giving speeches, which may not correlate to any particular event going on at the time of its delivery, nor a trend in speech lengths.

Conclusion

Analyzing these core features in inaugural addresses provides insights into the recurring themes and priorities of U.S. presidents. By excluding common words such as articles and prepositions, reviewing positive-negative sentiment, and word count, we can focus on the distinctive language used by presidents to convey their messages throughout time. This exploration contributes to a better understanding of the historical and rhetorical aspects of these important speeches throughout U.S. history.

This analysis serves as a starting point for more in-depth investigations into the evolving language and priorities of U.S. presidents over time. It showcases the value of text analysis in uncovering patterns and themes in large datasets, contributing to a richer understanding of political discourse and leadership communication.