When we think of data, we are tempted to think of numbers, but there are many other forms of data! In this guide, we will delve into the fundamentals of Natural Language Processing (NLP) using R. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language. It enables us to analyze, interpret, and derive meaningful information from large amounts of textual data.
Our journey will include loading and preprocessing textual data, performing sentiment analysis, and exploring topic modeling. We will work with the Miller Center’s dataset of speeches given by U.S. presidents, focusing specifically on speeches by Donald Trump.
Before we begin, ensure you have the necessary packages installed. We will use:
tidyverse
for data manipulation.tidytext
for text processing.stopwords
for handling common words.textstem
for word lemmatization.topicmodels
for topic modeling.We also need the reshape2
package installed, though we
don’t need to activate it explicitly (tidytext
takes care
of this for us).
Install the packages (if you haven’t already) using:
install.packages(c("reshape2", "tidyverse", "tidytext", "stopwords", "textstem", "topicmodels"))
Now, load the libraries:
library(tidyverse)
library(stopwords)
library(textstem)
library(topicmodels)
library(tidytext)
We will use the file speeches.RDS
, which contains
speeches from U.S. presidents. Let’s load the dataset and focus on
speeches given by Donald Trump.
# Load the dataset
speeches <- readRDS("speeches.RDS") %>%
filter(president == "Donald Trump")
# Preview the data
str(speeches)
## 'data.frame': 43 obs. of 5 variables:
## $ doc_name : chr "october-26-2020-swearing-ceremony-honorable-amy-coney-barrett" "january-19-2021-farewell-address" "february-4-2020-state-union-address" "september-25-2018-address-73rd-session-united-nations-general" ...
## $ date : chr "2020-10-27" "2021-01-20" "2020-02-05" "2018-09-25" ...
## $ transcript: chr " \r\n\r\nTHE PRESIDENT: Thank you very much. Appreciate it. Thank you very much. Distinguished guests and m"| __truncated__ "My fellow Americans: Four years ago, we launched a great national effort to rebuild our country, to renew its s"| __truncated__ "Thank you very much. Thank you. Thank you very much.\r\n\r\nMadam Speaker, Mr. Vice President, members of Congr"| __truncated__ "THE PRESIDENT: Madam President, Mr. Secretary-General, world leaders, ambassadors, and distinguished delegates:"| __truncated__ ...
## $ president : chr "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
## $ title : chr "October 26, 2020: Swearing-In Ceremony of the Honorable Amy Coney Barrett to the US Supreme Court " "January 19, 2021: Farewell Address" "February 4, 2020: State of the Union Address" "September 25, 2018: Address at the 73rd Session of the United Nations General Assembly" ...
Two critical columns are:
title
: The title of each speech (which
serves as an identifier).transcript
: The full text of the
speech.Text data often contains noise and inconsistencies that can hinder analysis. Preprocessing transforms raw text into a clean and consistent format, which is crucial for effective NLP. The key preprocessing steps include:
Tokenization breaks the text into smaller components, making it
easier to analyze. We will use the unnest_tokens()
function
from the tidytext
package.
# Tokenize the transcripts
tidy_speeches <- speeches %>%
unnest_tokens(output = word, input = transcript)
# View the tokenized data
head(tidy_speeches$word, 30)
## [1] "the" "president" "thank" "you"
## [5] "very" "much" "appreciate" "it"
## [9] "thank" "you" "very" "much"
## [13] "distinguished" "guests" "and" "my"
## [17] "fellow" "citizens" "this" "is"
## [21] "a" "momentous" "day" "for"
## [25] "america" "for" "the" "united"
## [29] "states" "constitution"
This function creates a new dataframe where each row corresponds to a single word from the transcripts.
Text often contains punctuation and special characters that are not useful for analysis. We need to remove these to ensure our data consists only of meaningful words.
# Remove non-alphabetic characters
tidy_speeches <- tidy_speeches %>%
mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>% # Keep only letters
filter(word != "") # Remove empty strings
The pattern [^a-zA-Z]
matches any character that is not
an uppercase or lowercase letter. In concert with
str_replace_all()
, we replace any character that is not a
letter with an empty string.
Lemmatization reduces words to their base or dictionary form. For example, “running,” “runs,” and “ran” become “run.” This helps in grouping different forms of the same word, improving the consistency of our data.
# Lemmatize the words
tidy_speeches <- tidy_speeches %>%
mutate(word = lemmatize_words(word))
# View the lemmatized words
head(tidy_speeches$word, 30)
## [1] "the" "president" "thank" "you" "very"
## [6] "much" "appreciate" "it" "thank" "you"
## [11] "very" "much" "distinguish" "guest" "and"
## [16] "my" "fellow" "citizen" "this" "be"
## [21] "a" "momentous" "day" "for" "america"
## [26] "for" "the" "unite" "state" "constitution"
The first few words look similar to what they were before lemmatiziation, but they’re not identical! Notice that “guests” and “citizens” turned into “guest” and “citizen.” Lemmatization enhances the quality of our analysis by ensuring that variations of a word are treated as the same term.
Stop words are common words like “the,” “and,” “is,” which do not
contribute much to the meaning of the text. Removing them helps focus on
the more meaningful words. The stopwords
package gives us a
helpful list of stopwords that we can use.
# Get a list of English stop words
sw <- stopwords("en", source = "stopwords-iso")
# Remove stop words from the dataset
tidy_speeches <- tidy_speeches %>%
filter(!word %in% sw)
# Check the result
head(tidy_speeches$word, 30)
## [1] "president" "distinguish" "guest" "fellow" "citizen"
## [6] "momentous" "day" "america" "unite" "constitution"
## [11] "fair" "impartial" "rule" "law" "constitution"
## [16] "ultimate" "defense" "american" "liberty" "faithful"
## [21] "application" "law" "cornerstone" "republic" "president"
## [26] "solemn" "obligation" "honor" "appoint" "supreme"
By filtering out stop words, we concentrate on words that carry significant meaning.
A basic form of textual analysis is simply analyzing word frequencies, which can help identify common themes and topics within the speeches.
# Calculate word frequencies
word_counts <- tidy_speeches %>%
count(word, sort = TRUE)
# Display the most frequent words
head(word_counts, 10)
## word n
## 1 people 1238
## 2 president 1219
## 3 american 783
## 4 country 769
## 5 applause 676
## 6 time 497
## 7 nation 455
## 8 lot 436
## 9 unite 390
## 10 america 376
By looking at the most frequent words, we can gain some insights. Trump may be talking about uniting the nation, for example. We also see the word “applause” which is likely captured in the transcript at moments when audiences applaud. We could certainly add “applause” to the list of stopwords we exclude.
sw <- c(sw, "applause")
tidy_speeches <- tidy_speeches %>%
filter(!word %in% sw)
# Calculate word frequencies
word_counts <- tidy_speeches %>%
count(word, sort = TRUE)
# Display the most frequent words
head(word_counts, 10)
## word n
## 1 people 1238
## 2 president 1219
## 3 american 783
## 4 country 769
## 5 time 497
## 6 nation 455
## 7 lot 436
## 8 unite 390
## 9 america 376
## 10 job 342
We now see “job” as one of the top ten words.
Sentiment analysis assesses the emotional tone behind words, helping us understand the attitudes and emotions expressed in the text. While there are sophisticated methods that consider groups of words and context (like phrases or sentences), we will keep it simple and analyze the emotional tone of individual words.
Different sentiment analysis approaches offer various ways to classify words. Some provide binary classifications (positive or negative), while others offer multifaceted sentiment ratings. For our purposes, we will use a binary classification of positive and negative sentiments. Sentiment lexicons are curated lists of words associated with sentiment scores. These scores are often based on human annotations where people rate the sentiment of words.
We will use the Bing sentiment lexicon, which categorizes each words into either positive or negative sentiment (a binary choice).
# Load the Bing sentiment lexicon
sentiment_words <- get_sentiments("bing")
We join our cleaned words with the sentiment lexicon to assign sentiment labels to each word in our dataset. We can then peek at a sample.
# Join speeches with sentiment words
speech_sentiments <- tidy_speeches %>%
inner_join(sentiment_words, by = "word")
# View the sentiment-labeled words
speech_sentiments %>% select(word, sentiment) %>% sample_n(20)
## word sentiment
## 1 applaud positive
## 2 demean negative
## 3 nasty negative
## 4 brave positive
## 5 terrible negative
## 6 myth negative
## 7 honor positive
## 8 imperil negative
## 9 fantastic positive
## 10 trump positive
## 11 wrong negative
## 12 recommendation positive
## 13 devastate negative
## 14 incredible positive
## 15 thrill positive
## 16 opponent negative
## 17 powerful positive
## 18 tough positive
## 19 radical negative
## 20 trouble negative
Looking at this list shows us some of the limitations. For instance “defeat” is classified as negative, but if we’re talking about defeating an enemey rather than our own defeat, is it?
Regardless, we can calculate the net sentiment for each speech by subtracting the number of negative words from the number of positive words. This gives us an overall sentiment score for each speech.
# Compute sentiment scores per speech
speech_sentiment_scores <- speech_sentiments %>%
count(title, sentiment) %>%
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
mutate(net_sentiment = positive - negative)
# View the sentiment scores
head(speech_sentiment_scores)
## # A tibble: 6 × 4
## title negative positive net_sentiment
## <chr> <int> <int> <int>
## 1 "April 13, 2020: Coronavirus Task Force Brief… 399 348 -51
## 2 "April 15, 2020: Press Briefing with the Coro… 152 221 69
## 3 "April 23, 2020: Task Force Briefing on the C… 246 245 -1
## 4 "August 8, 2020: Press Conference on Executiv… 124 114 -10
## 5 "December 18, 2017: Remarks on National Secur… 89 120 31
## 6 "February 1, 2018: Remarks at the House and S… 58 187 129
This dataframe shows the number of positive and negative words in each speech and the net sentiment score. Let’s find the speeches with the highest positive and most negative net sentiment and look at what they say.
# Most positive speech
most_positive_speech <- speech_sentiment_scores %>%
filter(net_sentiment == max(net_sentiment)) %>%
pull(title)
# Most negative speech
most_negative_speech <- speech_sentiment_scores %>%
filter(net_sentiment == min(net_sentiment)) %>%
pull(title)
# Output the results
cat("Most Positive Speech (First 500 Characters):\n")
## Most Positive Speech (First 500 Characters):
most_positive_excerpt <- speeches %>%
filter(title == most_positive_speech) %>%
pull(transcript) %>%
substr(1, 500)
cat(most_positive_excerpt, "\n\n")
## THE PRESIDENT: Thank you, Paul and Mitch, for the introduction and for your tremendous leadership. You folks have done well. I just looked at some numbers. You’ve even done better than you thought, I think—(laughter)—based on what we just saw about 10 minutes ago.
##
## And I want to thank you, to the Governor of this incredible state, my very good friend, Jim Justice and his wonderful wife, Cathy, who are with us. And Jim is now a proud member of the Republican Party. He was a Democrat. He switche
cat("Most Negative Speech (First 500 Characters):\n")
## Most Negative Speech (First 500 Characters):
most_negative_excerpt <- speeches %>%
filter(title == most_negative_speech) %>%
pull(transcript) %>%
substr(1, 500)
cat(most_negative_excerpt, "\n")
## Thank you, thank you. So we begin, Oklahoma, we begin. Thank you, Oklahoma. And thank you to Vice President Mike Pence. We begin, we begin our campaign. Thank you. We begin our campaign and I just want to thank all of you, you are warriors. I’ve been watching the fake news for weeks now, and everything is negative. Don’t go, don’t come, don’t do anything. Today it was like, I’ve never seen anything like it. I’ve never seen anything like it. You are warriors, thank you. We had some very bad peopl
Topic modeling is a technique that can help uncover thematic structures within a collection of documents. It allows us to identify topics discussed across the speeches without us deciding in advance on a list of topics to look for. There are many algorithms for topic modeling, and they can be quite complex, involving advanced statistical and computational methods. We will use the Latent Dirichlet Allocation (LDA) algorithm, which is a widely used method in topic modeling. We will not delve into the intricate computer science and linguistic details of LDA. Instead, we’ll focus on applying it to our data.
It’s important to note that the parameter choices we make, such as the number of topics we’re looking for, are arbitrary and can be adjusted based on the analysis needs.
To use LDA, we need to create a Document-Term Matrix (DTM), where each row represents a document (speech), and each column represents a term (word). The entries in the matrix indicate the frequency of each term in each document.
# Create the Document-Term Matrix
dtm <- tidy_speeches %>%
count(title, word) %>%
cast_dtm(document = title, term = word, value = n)
We can now fit the LDA model to our DTM. We’ll specify the number of topics we want the model to identify.
# Set the number of topics
num_topics <- 2
# Fit the LDA model
lda_model <- LDA(dtm, k = num_topics, control = list(seed = 1234))
We can extract the top terms associated with each topic and visualize them to interpret the themes.
# Get the per-topic-per-word probabilities
topics <- tidytext::tidy(lda_model, matrix = "beta")
# Identify the top terms for each topic
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 10) %>% # Top 10 terms
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(title = "Top Terms in Each Topic",
x = "Term",
y = "Beta (Probability of Term in Topic)") +
theme_minimal()
I don’t know about you, but I actually do not find this terribly enlightening. Let’s see if we can improve by, instead of looking at single words, looking at sequences of words called n-grams. An n-gram is a sequence of n words that appear together in a text. For example:
Unigrams (1-word): “great,” “speech,” “freedom” Bigrams (2-word): “great speech,” “freedom fighters” Trigrams (3-word): “United States citizens”
N-grams are useful because they help capture multi-word phrases, such as “United States” or “American people,” that are more meaningful than individual words.
The approach is very similar to what we did before. However, very
early in our text preprocessing pipline, we used
unnest_tokens
to break the text into single words, and now,
we will instead use it to break the text into overlapping sets of n
words. For instance, the sentence “Chad wrote a natural language
processing guide” broken up into 3-grams would be “Chad wrote a,” “wrote
a natural,” “a natural language,” and so on.
Let’s perform an 2-gram based topic modeling of Trump’s most negative and most positive speeches. The code below is sophisticated. I wouldn’t expect you to be able to produce it, but I am sharing it here as a stimulating example.
cleaned_speeches <- speeches %>%
unnest_tokens(word, transcript) %>% # Tokenize words
mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>% # Remove non-alphabetic characters
filter(word != "") %>% # Remove empty entries
mutate(word = lemmatize_words(word)) %>% # Lemmatize words
filter(!word %in% sw) # Remove stopwords
perform_lda <- function(speech_title, num_topics = 2, ngram_size = 2) {
# Create n-grams dynamically based on user input
tidy_ngrams <- cleaned_speeches %>%
filter(title == speech_title) %>%
summarize(title = first(title), # Retain the title column
text = str_c(word, collapse = " ")) %>% # Re-combine words
unnest_tokens(ngram, text, token = "ngrams", n = ngram_size) # Create n-grams
# Create Document-Term Matrix
dtm_ngrams <- tidy_ngrams %>%
count(title, ngram) %>% # Count n-grams
cast_dtm(title, ngram, n) # Create DTM
# Fit LDA model
lda_model <- LDA(dtm_ngrams, k = num_topics, control = list(seed = 1234))
# Extract and display top terms
topics <- tidy(lda_model, matrix = "beta")
top_terms <- topics %>%
group_by(topic) %>%
slice_max(beta, n = 5) %>%
ungroup() %>%
arrange(topic, -beta)
# Visualize top n-grams
top_terms %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free_y") +
coord_flip() +
scale_x_reordered() +
labs(title = paste0 ("Top ", ngram_size, "-grams in Each Topic for speech:"),
subtitle = speech_title,
x = paste(ngram_size, "-gram"),
y = "Beta (Topic Probability)") +
theme_minimal()
}
perform_lda(most_positive_speech, num_topics = 2, ngram_size = 2)
perform_lda(most_negative_speech, num_topics = 2, ngram_size = 2)
In his most positive speech, Trump is talking about policy issues affecting the American people: taxes, immigration, education, and so forth. In his most negative speech, he’s talking about Joe Biden, fake news, bad people, and the radical left (actually “leave,” derived erroneously from “left” during our lemmatization).
In this guide, we have introduced the fundamental concepts of Natural Language Processing in R. We covered:
These techniques empower us to extract meaningful insights from textual data, which can be valuable in justice work where understanding narratives and discourse is crucial.