There’s an old joke about how algebra was created when Satan suggested putting the alphabet in math. R-based text analysis involves doing something like the opposite: Putting math in the alphabet.
The general idea is that the more often a word appears in the text of a document, the likelier it is that the document is about whatever idea that word represents.
Consider Lincoln’s 271-word Gettysburg Address. The most common non-trivial word in the address is “here,” which occurs eight times. Next is “nation” (five times), followed by “dedicated” (four times). Knowing these words and their frequency counts certainly cannot get the full meaning and nuance of the speech. But it can help you get the idea that Lincoln was saying something about “dedication” and a “here” that was important to the “nation.”
R can quickly produce such word counts, even for documents many times the length of Lincoln’s famous address. As an introduction to what’s possible, let’s look at everything Tennessee Sen. Marsha Blackburn has posted in X, formerly called Twitter, between mid-April in 2022 and mid-April in 2024. There’s nothing especially significant about that time frame. It’s just a dataset I happen to have access to. I downloaded the posts using Brandwatch, a proprietary - and, I’m afraid, very expensive - social media content monitoring platform available to students and faculty in Middle Tennessee State University’s School of Journalism and Strategic Media.
This code will download Blackburn’s posts from a file in my GitHub space:
mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")
I get 5,240 posts, with four columns showing each post’s Date, Url, Author (always “MarshaBlackburn,” which is Blackburn’s X/Twitter handle), and Full.Text.
Here’s a look at the dataset’s first few rows. They aren’t formatted very neatly here, because your web browser is probably wrapping the columns instead of letting you scroll them from left to right. But if you run this code in R Studio, you’ll be able to open the data frame in R Studio’s data frame viewer and see the data in something that looks more like a spreadsheet format.
head(mydata, n= 10)
## Date
## 1 2024-04-14 00:22:43.0
## 2 2024-04-13 23:29:25.0
## 3 2024-04-13 22:47:14.0
## 4 2024-04-13 22:03:00.0
## 5 2024-04-13 21:56:31.0
## 6 2024-04-13 21:50:24.0
## 7 2024-04-13 21:28:16.0
## 8 2024-04-13 20:46:57.0
## 9 2024-04-13 20:11:43.0
## 10 2024-04-13 18:48:21.0
## Url
## 1 http://twitter.com/MarshaBlackburn/statuses/1779304240151703768
## 2 http://twitter.com/MarshaBlackburn/statuses/1779290825597280447
## 3 http://twitter.com/MarshaBlackburn/statuses/1779280212204703960
## 4 http://twitter.com/MarshaBlackburn/statuses/1779269080664428875
## 5 http://twitter.com/MarshaBlackburn/statuses/1779267449805803972
## 6 http://twitter.com/MarshaBlackburn/statuses/1779265908327825679
## 7 http://twitter.com/MarshaBlackburn/statuses/1779260340015972748
## 8 http://twitter.com/MarshaBlackburn/statuses/1779249940344045970
## 9 http://twitter.com/MarshaBlackburn/statuses/1779241072989806912
## 10 http://twitter.com/MarshaBlackburn/statuses/1779220095824175510
## Author
## 1 MarshaBlackburn
## 2 MarshaBlackburn
## 3 MarshaBlackburn
## 4 MarshaBlackburn
## 5 MarshaBlackburn
## 6 MarshaBlackburn
## 7 MarshaBlackburn
## 8 MarshaBlackburn
## 9 MarshaBlackburn
## 10 MarshaBlackburn
## Full.Text
## 1 Pray for Israel 🇮🇱🇺🇸
## 2 Joe Biden’s policies have funded Iran’s attack on Israel.
## 3 Under President Trump, Iran was broke. President Biden gifted them billions of dollars and then naively said “don’t.” “Don’t” is not a foreign policy.
## 4 Never back down to terrorists. Stand with Israel.
## 5 RT @AIPAC Thank you @MarshaBlackburn America must stand with Israel as it confronts this attack from Iran.
## 6 RT @MorganOrtagus Reminder: Biden allowed the UN sanctions on Iran's drones and ballistic missiles to expire less than six months ago. The very same drones and missiles en route to Israel right now. https://t.co/KKIpjBWD0j
## 7 The Iranian regime is no longer relying on its proxy terrorist groups to carry out a war on Israel. They are now directly involved. We MUST stand with Israel.
## 8 Iran has begun launching drone strikes on Israel. @POTUS — we must move quickly and launch aggressive retaliatory strikes on Iran.
## 9 Praying for Israel and the Jewish people.
## 10 Under a new “Ability to Pay” bond policy, the suspect in the murder of a Memphis police officer was released back onto the streets instead of remaining behind bars. This is unconscionable. Stop bail reform and cashless bail.
Analyzing the text of all these tweets will require not only the now-familiar tidyverse package but also a new package called tidytext. This code installs both, if needed, and activates them:
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")
library(tidyverse)
library(tidytext)
The first step is to get a list of each word or term in the document,
and count the number of times each one occurs. This code does both and
puts the results into a data frame called tidytext
. It also
sorts the words and terms by frequency, in descending order.
tidy_text <- mydata %>%
unnest_tokens(word,Full.Text) %>%
count(word, sort = TRUE)
Open tidy_text, and you’ll see that the most frequently occurring words are “the,” “to,” “and,” “of,” “is,” “in,” and “a.” That doesn’t tell us much; such words are used so often in English that they can’t do much to indicate a particular context or idea. In text analysis, these are called “stop words,” and the usual approach is to delete them.
Likewise, certain terms, like “https,” “t.co,” and “rt,” occur often in X/Twitter content but offer no insights. Again, the usual approach is the delete them. This code handles both tasks:
# Deleting standard stop words
data("stop_words")
tidy_text <- tidy_text %>%
anti_join(stop_words)
## Joining with `by = join_by(word)`
# Deleting custom stop words
my_stopwords <- tibble(word = c("https",
"t.co",
"rt"))
tidy_text <- tidy_text %>%
anti_join(my_stopwords)
## Joining with `by = join_by(word)`
Finally, let’s have a look at what Blackburn posts about most often on X/Twitter. Open the tidy_text data frame, and you’ll see that her favorite terms include some version of President Joe Biden’s name, followed by “border,” “american,” and “china.” Here are the data frame’s first 20 rows:
head(tidy_text, n = 20)
## word n
## 1 biden 1450
## 2 border 854
## 3 biden’s 525
## 4 joe 491
## 5 american 394
## 6 china 388
## 7 communist 373
## 8 marshablackburn 359
## 9 americans 356
## 10 u.s 333
## 11 administration 328
## 12 hamas 296
## 13 president 292
## 14 illegal 275
## 15 democrats 269
## 16 people 267
## 17 tennessee 243
## 18 day 232
## 19 israel 232
## 20 crisis 230
If you know much about Blackburn, these results make sense. These words reference some of her favorite topics. At the other end of the spectrum, she talks about “trump” less often, and barely mentions “abortion.” Again, if you’re paying attention to current national politics, you will understand why.
Let’s get more formal about counting how often Blackburn refers to one of her favorite topics: Joe Biden. She refers to Biden in a number of ways: “Joe” and “Biden,” of course, sometimes in combination, sometimes separately. But “administration” and “president” can also allude to Biden.
This code will add a variable called Biden
to the
mydata
data frame, then look through the data frame’s
Full.Text
column and make Biden
a
1
if it finds “Biden,” “Joe,” “president” or
“administration” and a 0
if it doesn’t. Then, it will
report the sum of Biden
, which ends up being the total
number of posts that allude to Joe Biden.
searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
mydata$Full.Text,
ignore.case = TRUE),1,0)
sum(mydata$Biden)
## [1] 2085
Goodness … Blackburn posted about Biden as many as 2,085 times, or in about 38 percent of her 5,420 posts over the last two years.
Here’s the whole script, all in one place:
# Load packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")
library(tidyverse)
library(tidytext)
# Read the data
mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")
# Extract individual words to a "tidytext" data frame
tidy_text <- mydata %>%
unnest_tokens(word,Full.Text) %>%
count(word, sort = TRUE)
# Delete standard stop words
data("stop_words")
tidy_text <- tidy_text %>%
anti_join(stop_words)
# Delete custom stop words
my_stopwords <- tibble(word = c("https",
"t.co",
"rt"))
tidy_text <- tidy_text %>%
anti_join(my_stopwords)
# Define search terms and count items that include them
# "Biden" terms are used as an example
searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
mydata$Full.Text,
ignore.case = TRUE),1,0)
sum(mydata$Biden)
One last trick: Adding the token="ngrams",n=2
arguments
to the unnest()
function will let you parse the text into
two-word phrases rather than individual words. Change n=2 to n=3, and
you’ll get three-word phrases, and so on. Here’s an example, with the
code set to extract two-word phrases:
# Load packages
if (!require("tidyverse")) install.packages("tidyverse")
if (!require("tidytext")) install.packages("tidytext")
library(tidyverse)
library(tidytext)
# Read the data
mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/BlackburnX.csv")
# Extract individual words to a "tidytext" data frame
tidy_text <- mydata %>%
unnest_tokens(word,Full.Text,token="ngrams",n=2) %>%
count(word, sort = TRUE)
# Delete standard stop words
data("stop_words")
tidy_text <- tidy_text %>%
anti_join(stop_words)
# Delete custom stop words
my_stopwords <- tibble(word = c("https",
"t.co",
"rt"))
tidy_text <- tidy_text %>%
anti_join(my_stopwords)
# Define search terms and count items that include them
# "Biden" terms are used as an example
searchterms <- "Biden|Joe|president|administration"
mydata$Biden <- ifelse(grepl(searchterms,
mydata$Full.Text,
ignore.case = TRUE),1,0)
sum(mydata$Biden)
## [1] 2085
And here’s a peek at the tidy_text data frame, this time containing counts for all two-word phrases:
head(tidy_text, n = 20)
## word n
## 1 https t.co 1859
## 2 in the 429
## 3 of the 408
## 4 the biden 406
## 5 to the 376
## 6 joe biden 318
## 7 on the 284
## 8 communist china 258
## 9 is a 249
## 10 biden administration 242
## 11 this is 237
## 12 for the 225
## 13 southern border 224
## 14 the u.s 207
## 15 the border 204
## 16 and the 202
## 17 at the 190
## 18 we must 180
## 19 thank you 167
## 20 we need 152
Nike took a beating on social media after this April 11, 2024 post shared images of two Nike-designed uniforms for U.S. track athletes who will be competing in the 2024 Summer Olmpics in Paris, France.
The contrast between the male uniform’s mid-thigh compression-shorts and the female uniform’s high-cut leg openings prompted varied - and spicy - denounciations of Nike as sexist. In fairness to Nike, the full line of athlete uniforms includes some 50 pieces of various cuts and styles, so athletes will have more than just these two choices. But Nike picked, or at least didn’t keep someone from picking, these particular two uniforms as examples.
Brandwatch can capture and download all replies to a particular post on X/Twitter. You can use this code to download all 5,479 that had been posted as of around 9 p.m. on April 14:
mydata <- read.csv("https://raw.githubusercontent.com/drkblake/Data/main/NikeUniforms.csv")
Use your new term-frequency-analysis skills to identify at least three of the more common themes in user responses to the post. I’ll stop you at about 12:15 p.m. and ask you to summarize what you found.
As you might imagine, many of the posts use crude language and explicit terms. Please express your summary using language that would be appropriate for a general-circulation news outlet.