Natural Language Processing in R

Introduction

When we think of data, we are tempted to think of numbers, but there are many other forms of data! In this guide, we will delve into the fundamentals of Natural Language Processing (NLP) using R. NLP is a branch of artificial intelligence that focuses on the interaction between computers and human language. It enables us to analyze, interpret, and derive meaningful information from large amounts of textual data.

Our journey will include loading and preprocessing textual data, performing sentiment analysis, and exploring topic modeling. We will work with the Miller Center’s dataset of speeches given by U.S. presidents, focusing specifically on speeches by Donald Trump.

Getting Prepared

Before we begin, ensure you have the necessary packages installed. We will use:

tidyverse for data manipulation.
tidytext for text processing.
stopwords for handling common words.
textstem for word lemmatization.
topicmodels for topic modeling.

We also need the reshape2 package installed, though we don’t need to activate it explicitly (tidytext takes care of this for us).

Install the packages (if you haven’t already) using:

install.packages(c("reshape2", "tidyverse", "tidytext", "stopwords", "textstem", "topicmodels"))

Now, load the libraries:

library(tidyverse)
library(stopwords)
library(textstem)
library(topicmodels)
library(tidytext)

Text Preparation

Loading the Data

We will use the file speeches.RDS, which contains speeches from U.S. presidents. Let’s load the dataset and focus on speeches given by Donald Trump.

# Load the dataset
speeches <- readRDS("speeches.RDS") %>%
  filter(president == "Donald Trump")

# Preview the data
str(speeches)

## 'data.frame':    43 obs. of  5 variables:
##  $ doc_name  : chr  "october-26-2020-swearing-ceremony-honorable-amy-coney-barrett" "january-19-2021-farewell-address" "february-4-2020-state-union-address" "september-25-2018-address-73rd-session-united-nations-general" ...
##  $ date      : chr  "2020-10-27" "2021-01-20" "2020-02-05" "2018-09-25" ...
##  $ transcript: chr  " \r\n\r\nTHE PRESIDENT:  Thank you very much.  Appreciate it.  Thank you very much.  Distinguished guests and m"| __truncated__ "My fellow Americans: Four years ago, we launched a great national effort to rebuild our country, to renew its s"| __truncated__ "Thank you very much. Thank you. Thank you very much.\r\n\r\nMadam Speaker, Mr. Vice President, members of Congr"| __truncated__ "THE PRESIDENT: Madam President, Mr. Secretary-General, world leaders, ambassadors, and distinguished delegates:"| __truncated__ ...
##  $ president : chr  "Donald Trump" "Donald Trump" "Donald Trump" "Donald Trump" ...
##  $ title     : chr  "October 26, 2020: Swearing-In Ceremony of the Honorable Amy Coney Barrett to the US Supreme Court " "January 19, 2021: Farewell Address" "February 4, 2020: State of the Union Address" "September 25, 2018: Address at the 73rd Session of the United Nations General Assembly" ...

Two critical columns are:

title: The title of each speech (which serves as an identifier).
transcript: The full text of the speech.

Preprocessing the Text

Text data often contains noise and inconsistencies that can hinder analysis. Preprocessing transforms raw text into a clean and consistent format, which is crucial for effective NLP. The key preprocessing steps include:

Tokenization: Breaking down text into individual units (tokens), such as words.
Lowercasing: Converting all text to lowercase to ensure uniformity.
Stop Words Removal: Eliminating common words that carry little semantic meaning.
Removing Punctuation and Special Characters: Stripping out symbols that are not relevant for analysis.
Lemmatization: Reducing words to their base or dictionary form.

Tokenization

Tokenization breaks the text into smaller components, making it easier to analyze. We will use the unnest_tokens() function from the tidytext package.

# Tokenize the transcripts
tidy_speeches <- speeches %>%
  unnest_tokens(output = word, input = transcript)

# View the tokenized data
head(tidy_speeches$word, 30)

##  [1] "the"           "president"     "thank"         "you"          
##  [5] "very"          "much"          "appreciate"    "it"           
##  [9] "thank"         "you"           "very"          "much"         
## [13] "distinguished" "guests"        "and"           "my"           
## [17] "fellow"        "citizens"      "this"          "is"           
## [21] "a"             "momentous"     "day"           "for"          
## [25] "america"       "for"           "the"           "united"       
## [29] "states"        "constitution"

This function creates a new dataframe where each row corresponds to a single word from the transcripts.

Removing Punctuation and Special Characters

Text often contains punctuation and special characters that are not useful for analysis. We need to remove these to ensure our data consists only of meaningful words.

# Remove non-alphabetic characters
tidy_speeches <- tidy_speeches %>%
  mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>%  # Keep only letters
  filter(word != "")  # Remove empty strings

The pattern [^a-zA-Z] matches any character that is not an uppercase or lowercase letter. In concert with str_replace_all(), we replace any character that is not a letter with an empty string.

Lemmatization

Lemmatization reduces words to their base or dictionary form. For example, “running,” “runs,” and “ran” become “run.” This helps in grouping different forms of the same word, improving the consistency of our data.

# Lemmatize the words
tidy_speeches <- tidy_speeches %>%
  mutate(word = lemmatize_words(word))

# View the lemmatized words
head(tidy_speeches$word, 30)

##  [1] "the"          "president"    "thank"        "you"          "very"        
##  [6] "much"         "appreciate"   "it"           "thank"        "you"         
## [11] "very"         "much"         "distinguish"  "guest"        "and"         
## [16] "my"           "fellow"       "citizen"      "this"         "be"          
## [21] "a"            "momentous"    "day"          "for"          "america"     
## [26] "for"          "the"          "unite"        "state"        "constitution"

The first few words look similar to what they were before lemmatiziation, but they’re not identical! Notice that “guests” and “citizens” turned into “guest” and “citizen.” Lemmatization enhances the quality of our analysis by ensuring that variations of a word are treated as the same term.

Removing Stop Words

Stop words are common words like “the,” “and,” “is,” which do not contribute much to the meaning of the text. Removing them helps focus on the more meaningful words. The stopwords package gives us a helpful list of stopwords that we can use.

# Get a list of English stop words
sw <- stopwords("en", source = "stopwords-iso")

# Remove stop words from the dataset
tidy_speeches <- tidy_speeches %>%
  filter(!word %in% sw)

# Check the result
head(tidy_speeches$word, 30)

##  [1] "president"    "distinguish"  "guest"        "fellow"       "citizen"     
##  [6] "momentous"    "day"          "america"      "unite"        "constitution"
## [11] "fair"         "impartial"    "rule"         "law"          "constitution"
## [16] "ultimate"     "defense"      "american"     "liberty"      "faithful"    
## [21] "application"  "law"          "cornerstone"  "republic"     "president"   
## [26] "solemn"       "obligation"   "honor"        "appoint"      "supreme"

By filtering out stop words, we concentrate on words that carry significant meaning.

Exploring Word Frequencies

A basic form of textual analysis is simply analyzing word frequencies, which can help identify common themes and topics within the speeches.

# Calculate word frequencies
word_counts <- tidy_speeches %>%
  count(word, sort = TRUE)

# Display the most frequent words
head(word_counts, 10)

##         word    n
## 1     people 1238
## 2  president 1219
## 3   american  783
## 4    country  769
## 5   applause  676
## 6       time  497
## 7     nation  455
## 8        lot  436
## 9      unite  390
## 10   america  376

By looking at the most frequent words, we can gain some insights. Trump may be talking about uniting the nation, for example. We also see the word “applause” which is likely captured in the transcript at moments when audiences applaud. We could certainly add “applause” to the list of stopwords we exclude.

sw <- c(sw, "applause")

tidy_speeches <- tidy_speeches %>%
  filter(!word %in% sw)

# Calculate word frequencies
word_counts <- tidy_speeches %>%
  count(word, sort = TRUE)

# Display the most frequent words
head(word_counts, 10)

##         word    n
## 1     people 1238
## 2  president 1219
## 3   american  783
## 4    country  769
## 5       time  497
## 6     nation  455
## 7        lot  436
## 8      unite  390
## 9    america  376
## 10       job  342

We now see “job” as one of the top ten words.

Sentiment Analysis

Sentiment analysis assesses the emotional tone behind words, helping us understand the attitudes and emotions expressed in the text. While there are sophisticated methods that consider groups of words and context (like phrases or sentences), we will keep it simple and analyze the emotional tone of individual words.

Different sentiment analysis approaches offer various ways to classify words. Some provide binary classifications (positive or negative), while others offer multifaceted sentiment ratings. For our purposes, we will use a binary classification of positive and negative sentiments. Sentiment lexicons are curated lists of words associated with sentiment scores. These scores are often based on human annotations where people rate the sentiment of words.

We will use the Bing sentiment lexicon, which categorizes each words into either positive or negative sentiment (a binary choice).

# Load the Bing sentiment lexicon
sentiment_words <- get_sentiments("bing")

We join our cleaned words with the sentiment lexicon to assign sentiment labels to each word in our dataset. We can then peek at a sample.

# Join speeches with sentiment words
speech_sentiments <- tidy_speeches %>%
  inner_join(sentiment_words, by = "word")

# View the sentiment-labeled words
speech_sentiments %>% select(word, sentiment) %>% sample_n(20)

##              word sentiment
## 1         applaud  positive
## 2          demean  negative
## 3           nasty  negative
## 4           brave  positive
## 5        terrible  negative
## 6            myth  negative
## 7           honor  positive
## 8         imperil  negative
## 9       fantastic  positive
## 10          trump  positive
## 11          wrong  negative
## 12 recommendation  positive
## 13      devastate  negative
## 14     incredible  positive
## 15         thrill  positive
## 16       opponent  negative
## 17       powerful  positive
## 18          tough  positive
## 19        radical  negative
## 20        trouble  negative

Looking at this list shows us some of the limitations. For instance “defeat” is classified as negative, but if we’re talking about defeating an enemey rather than our own defeat, is it?

Regardless, we can calculate the net sentiment for each speech by subtracting the number of negative words from the number of positive words. This gives us an overall sentiment score for each speech.

# Compute sentiment scores per speech
speech_sentiment_scores <- speech_sentiments %>%
  count(title, sentiment) %>%
  pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>%
  mutate(net_sentiment = positive - negative)

# View the sentiment scores
head(speech_sentiment_scores)

## # A tibble: 6 × 4
##   title                                          negative positive net_sentiment
##   <chr>                                             <int>    <int>         <int>
## 1 "April 13, 2020: Coronavirus Task Force Brief…      399      348           -51
## 2 "April 15, 2020: Press Briefing with the Coro…      152      221            69
## 3 "April 23, 2020: Task Force Briefing on the C…      246      245            -1
## 4 "August 8, 2020: Press Conference on Executiv…      124      114           -10
## 5 "December 18, 2017: Remarks on National Secur…       89      120            31
## 6 "February 1, 2018: Remarks at the House and S…       58      187           129

This dataframe shows the number of positive and negative words in each speech and the net sentiment score. Let’s find the speeches with the highest positive and most negative net sentiment and look at what they say.

# Most positive speech
most_positive_speech <- speech_sentiment_scores %>%
  filter(net_sentiment == max(net_sentiment)) %>%
  pull(title)

# Most negative speech
most_negative_speech <- speech_sentiment_scores %>%
  filter(net_sentiment == min(net_sentiment)) %>%
  pull(title)

# Output the results

cat("Most Positive Speech (First 500 Characters):\n")

## Most Positive Speech (First 500 Characters):

most_positive_excerpt <- speeches %>%
  filter(title == most_positive_speech) %>%
  pull(transcript) %>%
  substr(1, 500)

cat(most_positive_excerpt, "\n\n")

## THE PRESIDENT: Thank you, Paul and Mitch, for the introduction and for your tremendous leadership. You folks have done well. I just looked at some numbers. You’ve even done better than you thought, I think—(laughter)—based on what we just saw about 10 minutes ago.
## 
## And I want to thank you, to the Governor of this incredible state, my very good friend, Jim Justice and his wonderful wife, Cathy, who are with us. And Jim is now a proud member of the Republican Party. He was a Democrat. He switche

cat("Most Negative Speech (First 500 Characters):\n")

## Most Negative Speech (First 500 Characters):

most_negative_excerpt <- speeches %>%
  filter(title == most_negative_speech) %>%
  pull(transcript) %>%
  substr(1, 500)

cat(most_negative_excerpt, "\n")

## Thank you, thank you. So we begin, Oklahoma, we begin. Thank you, Oklahoma. And thank you to Vice President Mike Pence. We begin, we begin our campaign. Thank you. We begin our campaign and I just want to thank all of you, you are warriors. I’ve been watching the fake news for weeks now, and everything is negative. Don’t go, don’t come, don’t do anything. Today it was like, I’ve never seen anything like it. I’ve never seen anything like it. You are warriors, thank you. We had some very bad peopl

Topic Modeling

Topic modeling is a technique that can help uncover thematic structures within a collection of documents. It allows us to identify topics discussed across the speeches without us deciding in advance on a list of topics to look for. There are many algorithms for topic modeling, and they can be quite complex, involving advanced statistical and computational methods. We will use the Latent Dirichlet Allocation (LDA) algorithm, which is a widely used method in topic modeling. We will not delve into the intricate computer science and linguistic details of LDA. Instead, we’ll focus on applying it to our data.

It’s important to note that the parameter choices we make, such as the number of topics we’re looking for, are arbitrary and can be adjusted based on the analysis needs.

To use LDA, we need to create a Document-Term Matrix (DTM), where each row represents a document (speech), and each column represents a term (word). The entries in the matrix indicate the frequency of each term in each document.

# Create the Document-Term Matrix
dtm <- tidy_speeches %>%
  count(title, word) %>%
  cast_dtm(document = title, term = word, value = n)

We can now fit the LDA model to our DTM. We’ll specify the number of topics we want the model to identify.

# Set the number of topics
num_topics <- 2

# Fit the LDA model
lda_model <- LDA(dtm, k = num_topics, control = list(seed = 1234))

We can extract the top terms associated with each topic and visualize them to interpret the themes.

# Get the per-topic-per-word probabilities
topics <- tidytext::tidy(lda_model, matrix = "beta")

# Identify the top terms for each topic
top_terms <- topics %>%
  group_by(topic) %>%
  slice_max(beta, n = 10) %>%  # Top 10 terms
  ungroup() %>%
  arrange(topic, -beta)

top_terms %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free_y") +
  coord_flip() +
  scale_x_reordered() +
  labs(title = "Top Terms in Each Topic",
       x = "Term",
       y = "Beta (Probability of Term in Topic)") +
  theme_minimal()

Topic Modeling with N-Grams

I don’t know about you, but I actually do not find this terribly enlightening. Let’s see if we can improve by, instead of looking at single words, looking at sequences of words called n-grams. An n-gram is a sequence of n words that appear together in a text. For example:

Unigrams (1-word): “great,” “speech,” “freedom” Bigrams (2-word): “great speech,” “freedom fighters” Trigrams (3-word): “United States citizens”

N-grams are useful because they help capture multi-word phrases, such as “United States” or “American people,” that are more meaningful than individual words.

The approach is very similar to what we did before. However, very early in our text preprocessing pipline, we used unnest_tokens to break the text into single words, and now, we will instead use it to break the text into overlapping sets of n words. For instance, the sentence “Chad wrote a natural language processing guide” broken up into 3-grams would be “Chad wrote a,” “wrote a natural,” “a natural language,” and so on.

Let’s perform an 2-gram based topic modeling of Trump’s most negative and most positive speeches. The code below is sophisticated. I wouldn’t expect you to be able to produce it, but I am sharing it here as a stimulating example.

cleaned_speeches <- speeches %>%
  unnest_tokens(word, transcript) %>%                         # Tokenize words
  mutate(word = str_replace_all(word, "[^a-zA-Z]", "")) %>%   # Remove non-alphabetic characters
  filter(word != "") %>%                                      # Remove empty entries
  mutate(word = lemmatize_words(word)) %>%                    # Lemmatize words
  filter(!word %in% sw)                                       # Remove stopwords

perform_lda <- function(speech_title, num_topics = 2, ngram_size = 2) {
  # Create n-grams dynamically based on user input
  tidy_ngrams <- cleaned_speeches %>%
    filter(title == speech_title) %>%
    summarize(title = first(title),  # Retain the title column
              text = str_c(word, collapse = " ")) %>%        # Re-combine words
    unnest_tokens(ngram, text, token = "ngrams", n = ngram_size)  # Create n-grams
  
  # Create Document-Term Matrix
  dtm_ngrams <- tidy_ngrams %>%
    count(title, ngram) %>%                                  # Count n-grams
    cast_dtm(title, ngram, n)                                # Create DTM
  
  # Fit LDA model
  lda_model <- LDA(dtm_ngrams, k = num_topics, control = list(seed = 1234))
  
  # Extract and display top terms
  topics <- tidy(lda_model, matrix = "beta")
  top_terms <- topics %>%
    group_by(topic) %>%
    slice_max(beta, n = 5) %>%
    ungroup() %>%
    arrange(topic, -beta)
  
  # Visualize top n-grams
  top_terms %>%
    mutate(term = reorder_within(term, beta, topic)) %>%
    ggplot(aes(term, beta, fill = factor(topic))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~ topic, scales = "free_y") +
    coord_flip() +
    scale_x_reordered() +
    labs(title = paste0 ("Top ", ngram_size, "-grams in Each Topic for speech:"),
         subtitle = speech_title,
         x = paste(ngram_size, "-gram"),
         y = "Beta (Topic Probability)") +
    theme_minimal()
}


perform_lda(most_positive_speech, num_topics = 2, ngram_size = 2)

perform_lda(most_negative_speech, num_topics = 2, ngram_size = 2)

In his most positive speech, Trump is talking about policy issues affecting the American people: taxes, immigration, education, and so forth. In his most negative speech, he’s talking about Joe Biden, fake news, bad people, and the radical left (actually “leave,” derived erroneously from “left” during our lemmatization).

Conclusion

In this guide, we have introduced the fundamental concepts of Natural Language Processing in R. We covered:

Text Preparation: Cleaning text through tokenization, stop word removal, and lemmatization, which is crucial for effective analysis.
Sentiment Analysis: Assessing the emotional tone of speeches using sentiment lexicons, understanding that more sophisticated methods exist.
Topic Modeling: Uncovering hidden thematic structures in text with LDA, recognizing that our parameter choices are arbitrary and that the algorithms can be complex.

These techniques empower us to extract meaningful insights from textual data, which can be valuable in justice work where understanding narratives and discourse is crucial.