So far, we have dealt with data sets that contain text strings, or short sentences of text. But text content usually arrives in non-tabular format - think literature, speeches, lyrics, social media content. To analyze these formats, we must first convert their text to be tabular data.
Lucky for us, Julia Silge created the tidytext package, which not only converts our text to a format we can analyze, but also creates the table using a one token per row format.
Token? A token is whatever digestible portion of text we want to analyze: characters, words, bi-grams, sentences, paragraphs, pages, books. For out purposes, we’ll usually create tables formatted as one word per row, but the package allows for other interesting possibilities.
library(tidyverse)
install.packages('tidytext')
install.packages('textdata')
library(tidytext)
library(textdata)
Text analysis can often point to a writer or speaker’s underlying desires and motivations. While rhetorical flourish and satire can often undermine specific observations, much can still be gleaned from:
What could one say about the most frequent words and phrases in:
Additionally, words can be classified by using lexicons, or collections of words that have similarities. There are existing lexicons for sentiment, assigning an emotional value to words. Folks can make their own lexicons, like a list of ‘swear words’ to see how ‘dirty’ a text is, or a list of ‘love words’ to see how much a set of lyrics are about relationships.
All of this is largely quantitative, of course - we cannot analyze the words of a song to determine with certainty the intent of the lyricist. But like many analyses, the more data we have, the more accurate insights we can make.
So, rather than analyzing, say, a song’s lyrics, why not analyze all of the song lyrics in popular music? Instead of analyzing the most recent Inaugural Address of the President, what if we looked at all of the Inaugural speeches ever, and tried to find differences based on political affiliation? What if, instead of analyzing a handful of Tweets by a celebrity or politician, we instead analyzed their entire Twitter feed since it began?
Llike most tidyverse packages, tidytext combines a number of different functions and uses that could previously only be performed by using multiple, potentially incompatible packages authored independently of one another.
What can we do with it?
Let’s start by creating some data to work with! Let’s create a variable that contains a text string.
sentence <- "The quick brown fox jumps over the lazy dog."
sentence
## [1] "The quick brown fox jumps over the lazy dog."
In order to analyze a text, we need to break it out of its prose format and into a tidy format, which means one observation per row. In other words, we need to break up our sentence into words: one word per row, in a dataframe. (Again, we could also break it up by letter, sentence, paragraph or page, but let’s start with words.)
The tidytext package does this with a function called unnest_tokens(), a token being the element we are breaking the text down into (letter, word, sentence, etc.). So, to break our sentence into words, we’d start by converting it to a data frame - right now, it’s just a character string. This is a Base R function:
as.data.frame(sentence) -> sentence_df
And now we can extract the words from ‘sentence_df’ like so:
sentence_df %>%
unnest_tokens(word, sentence)
## word
## 1 the
## 2 quick
## 3 brown
## 4 fox
## 5 jumps
## 6 over
## 7 the
## 8 lazy
## 9 dog
Great! You can see that our sentence has been broken into a column with one word in each row. Now let’s try that with a much bigger dataset: all of the lyrics from the Billboard Top 100, going back to the 1960’s:
install.packages('billboard')
library(billboard)
View(billboard)
This has a lot of data about when songs were on the charts, but where are the lyrics? Inside a lyrics variable, according to the documentation:
data(lyrics)
lyrics %>%
unnest_tokens(word, lyrics) -> billboard_lyrics
By doing this, we’ve drastically expanded the size of the Billboard dataset, as we’ve put each song lyric in its own row. Let’s count them, and see what’s most common:
billboard_lyrics %>%
count(word, sort = TRUE) %>%
head(10)
## word n
## 1 you 55899
## 2 i 48484
## 3 the 43658
## 4 to 30490
## 5 and 27618
## 6 me 26071
## 7 a 24564
## 8 it 19684
## 9 my 19405
## 10 in 15115
Unsurprisingly, the most common words are pronouns and conjunctions, which are referred to as ‘stop words.’ Let’s get rid of these ‘stop words’ and re-count:
billboard_lyrics %>%
anti_join(stop_words) %>%
count(word, sort = TRUE) %>%
head(10)
## Joining, by = "word"
## word n
## 1 love 14180
## 2 baby 8503
## 3 chorus 7009
## 4 yeah 6163
## 5 verse 5813
## 6 time 4852
## 7 girl 3792
## 8 gonna 3717
## 9 wanna 3676
## 10 night 3204
This line of code says, ‘take the billboard lyrics data frame, and remove any instance of a word that also appears in the ’stop_words’ list provided by the tidytext package - and then count the remaining words.” The results are much more interesting - but you’ll also notice some errors: the words ‘chorus’ and ‘verse’ are not actual lyrics in these songs, so let’s filter them out by using a ‘!’ to reverse the filter() command:
billboard_lyrics %>%
anti_join(stop_words) %>%
filter(!word %in% c("verse", "chorus")) %>%
count(word, sort = TRUE) %>%
head(10)
## Joining, by = "word"
## word n
## 1 love 14180
## 2 baby 8503
## 3 yeah 6163
## 4 time 4852
## 5 girl 3792
## 6 gonna 3717
## 7 wanna 3676
## 8 night 3204
## 9 feel 3080
## 10 1 2958
Love, baby - yeah.
We could plot this:
billboard_lyrics %>%
anti_join(stop_words) %>%
filter(!word %in% c("verse", "chorus", "1", "2","3", "4")) %>%
count(word, sort = TRUE) %>%
head(10) %>%
ggplot(aes(reorder(word, n),n)) + geom_col() +
coord_flip()
## Joining, by = "word"
I prefer to avoid making users read text vertically, so get used to a lot of coord_flip() coming up. I’m also limiting our data to 10 results - more would be nice, but at a certain point creating a list or table would make more sense.
Our quick analysis shows that most popular songs are about love - specifically, love narrated from a straight male perspective (baby and girl).
[Side note: what if we wanted to plot the frequency of these words over time? Unfortunately, the ‘year’ columns supplied here is a character string - it’s not (currently) recognized as a date - so we can’t plot over time without reformatting the year column. As we’ll learn soon, working with dates can be very challenging, so we can re-visit this concept later. ]
Let’s create a lexicon of ‘gender words’ and see if this male love theory rings true:
female_words <- c("girl", "baby", "honey", "sister", "babe", "woman", "lady", "gal", "chick", "mama", "she")
billboard_lyrics %>%
anti_join(stop_words) %>%
filter(word %in% female_words ) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## word n
## 1 baby 8503
## 2 girl 3792
## 3 woman 775
## 4 babe 608
## 5 lady 577
## 6 mama 505
## 7 honey 430
## 8 sister 104
## 9 chick 89
## 10 gal 51
male_words <- c("boy", "fella", "man", "dude", "gent", "brother", "he", "sir", "bro", "chap", "lad", "bloke", "gentleman" )
billboard_lyrics %>%
anti_join(stop_words) %>%
filter(word %in% male_words ) %>%
count(word, sort = TRUE)
## Joining, by = "word"
## word n
## 1 boy 1088
## 2 brother 184
## 3 sir 72
## 4 dude 33
## 5 fella 18
## 6 gentleman 15
## 7 bro 7
## 8 bloke 3
## 9 lad 1
I believe we have shown an imbalance of narratives. While my lists of female words and male words are arguably complete, I doubt any adjustment to their content would show different results.
Again, it would be more interesting to see if this trend has developed over time - is there more nuanced approach to love in modern popular music? But that gets complicated - reformatting dates is tricky. skip to chapter [x] if you can’t wait.
How do different political parties express their vision of the county? Let’s install the Quanteda package, which has the text of every American President’s Inaugural speech, going back to George Washington.
install.packages('quanteda')
library(quanteda)
## Package version: 3.2.1
## Unicode version: 14.0
## ICU version: 70.1
## Parallel computing: 16 of 16 threads used.
## See https://quanteda.io for tutorials and examples.
library(ggthemes)
Load the inaugural data - this is a strange way to do it, but it’s how the ‘quanteda’ package instructs us to load its data:
tidy(data_corpus_inaugural) -> inaugural_df
#let’s look only at speeches since 1960:
inaugural_df %>%
filter(Year > '1960') %>%
unnest_tokens(word, text) -> inaugural_text
#Most frequent words, by party:
inaugural_text %>%
anti_join(stop_words) %>%
# filter(Party %in% "Democratic") %>%
count(word, President,Party, sort = TRUE) %>%
arrange(desc(n)) %>%
head(30) %>%
ggplot(aes(reorder(word,n), n,fill = Party)) +
geom_col() +
coord_flip() +
facet_wrap(~Party, scales = "free_y")
## Joining, by = "word"
#Rather than splitting up the graph in two, let’s try a stacked bar chart with all of the data together:
inaugural_text %>%
anti_join(stop_words) %>%
group_by(Party) %>%
count(word, sort = TRUE) %>%
head(30) %>%
ggplot(aes(reorder(word,n),n, fill = Party)) + geom_col() +
coord_flip() +
theme_economist()
## Joining, by = "word"
Let’s look for specific words:
inaugural_text %>%
filter(word %in% c("war", "terrorism", "terror", "military", "battle", "fight", "enemy")) %>%
count(word, Party, sort = TRUE) %>%
arrange(desc(word))
## # A tibble: 13 × 3
## word Party n
## <chr> <fct> <int>
## 1 war Democratic 21
## 2 war Republican 12
## 3 terrorism Republican 2
## 4 terrorism Democratic 1
## 5 terror Democratic 4
## 6 terror Republican 1
## 7 military Republican 3
## 8 fight Democratic 4
## 9 fight Republican 4
## 10 enemy Democratic 4
## 11 enemy Republican 1
## 12 battle Democratic 3
## 13 battle Republican 3
inaugural_text %>%
filter(word %in% c("drugs", "economy", "union", "change", "advance", "protest", "love")) %>%
count(word, Party, sort = TRUE) %>%
arrange(desc(word))
## # A tibble: 12 × 3
## word Party n
## <chr> <fct> <int>
## 1 union Democratic 16
## 2 union Republican 8
## 3 love Republican 15
## 4 love Democratic 7
## 5 economy Republican 12
## 6 economy Democratic 9
## 7 drugs Republican 3
## 8 drugs Democratic 2
## 9 change Democratic 29
## 10 change Republican 2
## 11 advance Democratic 6
## 12 advance Republican 4
##Sentiment##
As mentioned earlier, there are existing lexicons for sentiment, or measuring emotion. Two are already built in to the ‘tidytext’ package, and a third is added with the ‘textdata’ package. They are:
AFINN: Ascribes a numeric value for sentiment, ranging from -5 to 5, with 0 being the midpoint.
None of these sentiment lexicons are perfect - they all misunderstand irony, satire, slang and context, and tend towards more traditional definitions of words than newer usages. But again, the more text data we process, the less these styles of prose will affect our results.
To calculate the sentiments of a body of text, we have to merge the sentiment lexicon to the text data. We do this by performing an inner_join(), which translates as ‘only keep words that appear in both the original text and the sentiment lexicon.’ So we’re going to lose a lot of words that don’t have a sentiment score, like proper nouns, place names, dates, and the like. That’s fine, but note the word counts before and after merging with a sentiment lexicon.
inaugural_text %>%
group_by(Party) %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('bing')) -> inaugural_bing
## Joining, by = "word"
inaugural_bing %>%
group_by(Party) %>%
count(sentiment, sort = TRUE) %>%
ggplot(aes(reorder(sentiment, n),n, fill = Party)) + geom_col() +
coord_flip()
That’s a very simple chart - let’s use a different lexicon for better results.
inaugural_text %>%
group_by(Party) %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('nrc')) -> inaugural_nrc
## Joining, by = "word"
inaugural_nrc %>%
group_by(Party) %>%
count(sentiment, sort = TRUE) %>%
ggplot(aes(reorder(sentiment, n),n, fill = Party)) + geom_col() +
coord_flip()
Finally, the ‘AFINN’ lexicon creates a ‘value’ column with the numeric ‘score’ of each word:
inaugural_text %>%
group_by(Party) %>%
count(word, sort = TRUE) %>%
inner_join(get_sentiments('afinn')) -> inaugural_afinn
## Joining, by = "word"
inaugural_afinn %>%
group_by(Party) %>%
count(value, sort = TRUE) %>%
ggplot(aes(value,n ,fill = Party)) + geom_col()
None of these visualizations are very informative - perhaps both political parties give similar inaugural speeches. Or perhaps we can see differences on the margins: what are the most- and least- negative words each party says?
inaugural_afinn %>%
filter(value == -3) %>%
head(20) %>%
ggplot(aes(word, n, fill = Party)) + geom_col() +
coord_flip()
The colors for each party are a little confusing - ggplot() chooses them automatically, since we didn’t specify any. Let’s do so - and also use a theme to make it look better:
inaugural_afinn %>%
filter(value == -3) %>%
head(20) %>%
ggplot(aes(word, n, fill = Party)) + geom_col() +
coord_flip() -> inaug
inaug +
scale_fill_manual(values = c("blue", "red")) +
theme_economist()