Basic term frequency analysis in R

There’s an old joke about how algebra was created when Satan suggested putting the alphabet in math. R-based text analysis involves doing something like the opposite: Putting math in the alphabet.

The general idea is that the more often a word appears in the text of a document, the likelier it is that the document is about whatever idea that word represents.

Consider Lincoln’s 271-word Gettysburg Address. The most common non-trivial word in the address is “here,” which occurs eight times. Next is “nation” (five times), followed by “dedicated” (four times). Knowing these words and their frequency counts certainly cannot get the full meaning and nuance of the speech. But it can help you get the idea that Lincoln was saying something about “dedication” and a “here” that was important to the “nation.”

Examining Sen. Blackburn’s posts on X

R can quickly produce such word counts, even for documents many times the length of Lincoln’s famous address. As an introduction to what’s possible, let’s look at everything Tennessee Sen. Marsha Blackburn has posted in X, formerly called Twitter, between mid-April in 2022 and mid-April in 2024. There’s nothing especially significant about that time frame. It’s just a dataset I happen to have access to. I downloaded the posts using Brandwatch, a proprietary - and, I’m afraid, very expensive - social media content monitoring platform available to students and faculty in Middle Tennessee State University’s School of Journalism and Strategic Media.

Downloading the data

This code will download Blackburn’s posts from a file in my GitHub space:

mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")

I get 5,240 posts, with four columns showing each post’s Date, Url, Author (always “MarshaBlackburn,” which is Blackburn’s X/Twitter handle), and Full.Text.

Here’s a look at the dataset’s first few rows. They aren’t formatted very neatly here, because your web browser is probably wrapping the columns instead of letting you scroll them from left to right. But if you run this code in R Studio, you’ll be able to open the data frame in R Studio’s data frame viewer and see the data in something that looks more like a spreadsheet format.

head(mydata, n= 10)

##                     Date
## 1  2024-04-14 00:22:43.0
## 2  2024-04-13 23:29:25.0
## 3  2024-04-13 22:47:14.0
## 4  2024-04-13 22:03:00.0
## 5  2024-04-13 21:56:31.0
## 6  2024-04-13 21:50:24.0
## 7  2024-04-13 21:28:16.0
## 8  2024-04-13 20:46:57.0
## 9  2024-04-13 20:11:43.0
## 10 2024-04-13 18:48:21.0
##                                                                Url
## 1  http://twitter.com/MarshaBlackburn/statuses/1779304240151703768
## 2  http://twitter.com/MarshaBlackburn/statuses/1779290825597280447
## 3  http://twitter.com/MarshaBlackburn/statuses/1779280212204703960
## 4  http://twitter.com/MarshaBlackburn/statuses/1779269080664428875
## 5  http://twitter.com/MarshaBlackburn/statuses/1779267449805803972
## 6  http://twitter.com/MarshaBlackburn/statuses/1779265908327825679
## 7  http://twitter.com/MarshaBlackburn/statuses/1779260340015972748
## 8  http://twitter.com/MarshaBlackburn/statuses/1779249940344045970
## 9  http://twitter.com/MarshaBlackburn/statuses/1779241072989806912
## 10 http://twitter.com/MarshaBlackburn/statuses/1779220095824175510
##             Author
## 1  MarshaBlackburn
## 2  MarshaBlackburn
## 3  MarshaBlackburn
## 4  MarshaBlackburn
## 5  MarshaBlackburn
## 6  MarshaBlackburn
## 7  MarshaBlackburn
## 8  MarshaBlackburn
## 9  MarshaBlackburn
## 10 MarshaBlackburn
##                                                                                                                                                                                                                           Full.Text
## 1                                                                                                                                                                                                              Pray for Israel 🇮🇱🇺🇸
## 2                                                                                                                                                                         Joe Biden’s policies have funded Iran’s attack on Israel.
## 3                                                                            Under President Trump, Iran was broke. President Biden gifted them billions of dollars and then naively said “don’t.” “Don’t” is not a foreign policy.
## 4                                                                                                                                                                                 Never back down to terrorists. Stand with Israel.
## 5                                                                                                                        RT @AIPAC Thank you @MarshaBlackburn America must stand with Israel as it confronts this attack from Iran.
## 6    RT @MorganOrtagus Reminder: Biden allowed the UN sanctions on Iran's drones and ballistic missiles to expire less than six months ago. The very same drones and missiles en route to Israel right now. https://t.co/KKIpjBWD0j
## 7                                                                    The Iranian regime is no longer relying on its proxy terrorist groups to carry out a war on Israel. They are now directly involved. We MUST stand with Israel.
## 8                                                                                                Iran has begun launching drone strikes on Israel. @POTUS — we must move quickly and launch aggressive retaliatory strikes on Iran.
## 9                                                                                                                                                                                         Praying for Israel and the Jewish people.
## 10 Under a new “Ability to Pay” bond policy, the suspect in the murder of a Memphis police officer was released back onto the streets instead of remaining behind bars. This is unconscionable. Stop bail reform and cashless bail.

Required packages

Analyzing the text of all these tweets will require not only the now-familiar tidyverse package but also a new package called tidytext. This code installs both, if needed, and activates them:

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")

library(tidyverse)
library(tidytext)

Extracting each word and term

The first step is to get a list of each word or term in the document, and count the number of times each one occurs. This code does both and puts the results into a data frame called tidytext. It also sorts the words and terms by frequency, in descending order.

tidy_text <- mydata %>% 
  unnest_tokens(word,Full.Text) %>% 
  count(word, sort = TRUE)

Deleting stop words

Open tidy_text, and you’ll see that the most frequently occurring words are “the,” “to,” “and,” “of,” “is,” “in,” and “a.” That doesn’t tell us much; such words are used so often in English that they can’t do much to indicate a particular context or idea. In text analysis, these are called “stop words,” and the usual approach is to delete them.

Likewise, certain terms, like “https,” “t.co,” and “rt,” occur often in X/Twitter content but offer no insights. Again, the usual approach is the delete them. This code handles both tasks:

# Deleting standard stop words
data("stop_words")
tidy_text <- tidy_text %>%
  anti_join(stop_words)

## Joining with `by = join_by(word)`

# Deleting custom stop words
my_stopwords <- tibble(word = c("https",
                                "t.co",
                                "rt"))
tidy_text <- tidy_text %>% 
  anti_join(my_stopwords)

## Joining with `by = join_by(word)`

Time for a peek

Finally, let’s have a look at what Blackburn posts about most often on X/Twitter. Open the tidy_text data frame, and you’ll see that her favorite terms include some version of President Joe Biden’s name, followed by “border,” “american,” and “china.” Here are the data frame’s first 20 rows:

head(tidy_text, n = 20)

##               word    n
## 1            biden 1450
## 2           border  854
## 3          biden’s  525
## 4              joe  491
## 5         american  394
## 6            china  388
## 7        communist  373
## 8  marshablackburn  359
## 9        americans  356
## 10             u.s  333
## 11  administration  328
## 12           hamas  296
## 13       president  292
## 14         illegal  275
## 15       democrats  269
## 16          people  267
## 17       tennessee  243
## 18             day  232
## 19          israel  232
## 20          crisis  230

If you know much about Blackburn, these results make sense. These words reference some of her favorite topics. At the other end of the spectrum, she talks about “trump” less often, and barely mentions “abortion.” Again, if you’re paying attention to current national politics, you will understand why.

Counting topic indicators

Let’s get more formal about counting how often Blackburn refers to one of her favorite topics: Joe Biden. She refers to Biden in a number of ways: “Joe” and “Biden,” of course, sometimes in combination, sometimes separately. But “administration” and “president” can also allude to Biden.

This code will add a variable called Biden to the mydata data frame, then look through the data frame’s Full.Text column and make Biden a 1 if it finds “Biden,” “Joe,” “president” or “administration” and a 0 if it doesn’t. Then, it will report the sum of Biden, which ends up being the total number of posts that allude to Joe Biden.

searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
                               mydata$Full.Text,
                               ignore.case = TRUE),1,0)
sum(mydata$Biden)

## [1] 2085

Goodness … Blackburn posted about Biden as many as 2,085 times, or in about 38 percent of her 5,420 posts over the last two years.

Here’s the whole script, all in one place:

# Load packages

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")

library(tidyverse)
library(tidytext)

# Read the data

mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")

# Extract individual words to a "tidytext" data frame

tidy_text <- mydata %>% 
  unnest_tokens(word,Full.Text) %>% 
  count(word, sort = TRUE)
  
# Delete standard stop words

data("stop_words")
tidy_text <- tidy_text %>%
  anti_join(stop_words)

# Delete custom stop words

my_stopwords <- tibble(word = c("https",
                                "t.co",
                                "rt"))
tidy_text <- tidy_text %>% 
  anti_join(my_stopwords)
  
# Define search terms and count items that include them
# "Biden" terms are used as an example

searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
                               mydata$Full.Text,
                               ignore.case = TRUE),1,0)
sum(mydata$Biden)

Extracting multi-word phrases

One last trick: Adding the token="ngrams",n=2 arguments to the unnest() function will let you parse the text into two-word phrases rather than individual words. Change n=2 to n=3, and you’ll get three-word phrases, and so on. Here’s an example, with the code set to extract two-word phrases:

# Load packages

if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")

library(tidyverse)
library(tidytext)

# Read the data

mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")

# Extract individual words to a "tidytext" data frame

tidy_text <- mydata %>% 
  unnest_tokens(word,Full.Text,token="ngrams",n=2) %>% 
  count(word, sort = TRUE)
  
# Delete standard stop words

data("stop_words")
tidy_text <- tidy_text %>%
  anti_join(stop_words)

# Delete custom stop words

my_stopwords <- tibble(word = c("https",
                                "t.co",
                                "rt"))
tidy_text <- tidy_text %>% 
  anti_join(my_stopwords)
  
# Define search terms and count items that include them
# "Biden" terms are used as an example

searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
                               mydata$Full.Text,
                               ignore.case = TRUE),1,0)
sum(mydata$Biden)

## [1] 2085

And here’s a peek at the tidy_text data frame, this time containing counts for all two-word phrases:

head(tidy_text, n = 20)

##                    word    n
## 1            https t.co 1859
## 2                in the  429
## 3                of the  408
## 4             the biden  406
## 5                to the  376
## 6             joe biden  318
## 7                on the  284
## 8       communist china  258
## 9                  is a  249
## 10 biden administration  242
## 11              this is  237
## 12              for the  225
## 13      southern border  224
## 14              the u.s  207
## 15           the border  204
## 16              and the  202
## 17               at the  190
## 18              we must  180
## 19            thank you  167
## 20              we need  152

Your turn

Nike took a beating on social media after this April 11, 2024 post shared images of two Nike-designed uniforms for U.S. track athletes who will be competing in the 2024 Summer Olmpics in Paris, France.

The contrast between the male uniform’s mid-thigh compression-shorts and the female uniform’s high-cut leg openings prompted varied - and spicy - denounciations of Nike as sexist. In fairness to Nike, the full line of athlete uniforms includes some 50 pieces of various cuts and styles, so athletes will have more than just these two choices. But Nike picked, or at least didn’t keep someone from picking, these particular two uniforms as examples.

Brandwatch can capture and download all replies to a particular post on X/Twitter. You can use this code to download all 5,479 that had been posted as of around 9 p.m. on April 14:

mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/NikeUniforms.csv")

Use your new term-frequency-analysis skills to identify at least three of the more common themes in user responses to the post. I’ll stop you at about 12:15 p.m. and ask you to summarize what you found.

As you might imagine, many of the posts use crude language and explicit terms. Please express your summary using language that would be appropriate for a general-circulation news outlet.