Introduction to Text Analysis in R

So far, we have dealt with data sets that contain text strings, or short sentences of text. But text content usually arrives in non-tabular format - think literature, speeches, lyrics, social media content. To analyze these formats, we must first convert their text to be tabular data.

Lucky for us, Julia Silge created the tidytext package, which not only converts our text to a format we can analyze, but also creates the table using a one token per row format.

Token? A token is whatever digestible portion of text we want to analyze: characters, words, bi-grams, sentences, paragraphs, pages, books. For out purposes, we’ll usually create tables formatted as one word per row, but the package allows for other interesting possibilities.

library(tidyverse)

install.packages('tidytext')
install.packages('textdata')

library(tidytext)
library(textdata)

Text analysis can often point to a writer or speaker’s underlying desires and motivations. While rhetorical flourish and satire can often undermine specific observations, much can still be gleaned from:

A count of the most frequent words
An analysis of the use and frequency of word types, such as pronouns
A study of what words come directly before or after certain words, or an ‘n-gram’ analysis

What could one say about the most frequent words and phrases in:

Popular music
Political speech
Online reviews
Classic literature
Television transcripts
Social Media

Additionally, words can be classified by using lexicons, or collections of words that have similarities. There are existing lexicons for sentiment, assigning an emotional value to words. Folks can make their own lexicons, like a list of ‘swear words’ to see how ‘dirty’ a text is, or a list of ‘love words’ to see how much a set of lyrics are about relationships.

All of this is largely quantitative, of course - we cannot analyze the words of a song to determine with certainty the intent of the lyricist. But like many analyses, the more data we have, the more accurate insights we can make.

So, rather than analyzing, say, a song’s lyrics, why not analyze all of the song lyrics in popular music? Instead of analyzing the most recent Inaugural Address of the President, what if we looked at all of the Inaugural speeches ever, and tried to find differences based on political affiliation? What if, instead of analyzing a handful of Tweets by a celebrity or politician, we instead analyzed their entire Twitter feed since it began?

Llike most tidyverse packages, tidytext combines a number of different functions and uses that could previously only be performed by using multiple, potentially incompatible packages authored independently of one another.

What can we do with it?

Let’s start by creating some data to work with! Let’s create a variable that contains a text string.

sentence <- "The quick brown fox jumps over the lazy dog."

sentence

## [1] "The quick brown fox jumps over the lazy dog."

In order to analyze a text, we need to break it out of its prose format and into a tidy format, which means one observation per row. In other words, we need to break up our sentence into words: one word per row, in a dataframe. (Again, we could also break it up by letter, sentence, paragraph or page, but let’s start with words.)

The tidytext package does this with a function called unnest_tokens(), a token being the element we are breaking the text down into (letter, word, sentence, etc.). So, to break our sentence into words, we’d start by converting it to a data frame - right now, it’s just a character string. This is a Base R function:

as.data.frame(sentence) -> sentence_df

And now we can extract the words from ‘sentence_df’ like so:

  sentence_df %>% 
unnest_tokens(word, sentence)

##    word
## 1   the
## 2 quick
## 3 brown
## 4   fox
## 5 jumps
## 6  over
## 7   the
## 8  lazy
## 9   dog

Great! You can see that our sentence has been broken into a column with one word in each row. Now let’s try that with a much bigger dataset: all of the lyrics from the Billboard Top 100, going back to the 1960’s:

install.packages('billboard')
library(billboard)
View(billboard)

This has a lot of data about when songs were on the charts, but where are the lyrics? Inside a lyrics variable, according to the documentation:

data(lyrics)

lyrics %>% 
  unnest_tokens(word, lyrics) -> billboard_lyrics

By doing this, we’ve drastically expanded the size of the Billboard dataset, as we’ve put each song lyric in its own row. Let’s count them, and see what’s most common:

billboard_lyrics %>% 
  count(word, sort = TRUE) %>% 
  head(10)

##    word     n
## 1   you 55899
## 2     i 48484
## 3   the 43658
## 4    to 30490
## 5   and 27618
## 6    me 26071
## 7     a 24564
## 8    it 19684
## 9    my 19405
## 10   in 15115

Unsurprisingly, the most common words are pronouns and conjunctions, which are referred to as ‘stop words.’ Let’s get rid of these ‘stop words’ and re-count:

billboard_lyrics %>% 
  anti_join(stop_words) %>% 
  count(word, sort = TRUE) %>%
  head(10)

## Joining, by = "word"

##      word     n
## 1    love 14180
## 2    baby  8503
## 3  chorus  7009
## 4    yeah  6163
## 5   verse  5813
## 6    time  4852
## 7    girl  3792
## 8   gonna  3717
## 9   wanna  3676
## 10  night  3204

This line of code says, ‘take the billboard lyrics data frame, and remove any instance of a word that also appears in the ’stop_words’ list provided by the tidytext package - and then count the remaining words.” The results are much more interesting - but you’ll also notice some errors: the words ‘chorus’ and ‘verse’ are not actual lyrics in these songs, so let’s filter them out by using a ‘!’ to reverse the filter() command:

billboard_lyrics %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c("verse", "chorus")) %>% 
  count(word, sort = TRUE) %>% 
  head(10)

## Joining, by = "word"

##     word     n
## 1   love 14180
## 2   baby  8503
## 3   yeah  6163
## 4   time  4852
## 5   girl  3792
## 6  gonna  3717
## 7  wanna  3676
## 8  night  3204
## 9   feel  3080
## 10     1  2958

Love, baby - yeah.

We could plot this:

billboard_lyrics %>% 
  anti_join(stop_words) %>% 
  filter(!word %in% c("verse", "chorus", "1", "2","3", "4")) %>% 
  count(word, sort = TRUE) %>% 
  head(10) %>% 
  ggplot(aes(reorder(word, n),n)) + geom_col() +
  coord_flip()

## Joining, by = "word"

I prefer to avoid making users read text vertically, so get used to a lot of coord_flip() coming up. I’m also limiting our data to 10 results - more would be nice, but at a certain point creating a list or table would make more sense.

Our quick analysis shows that most popular songs are about love - specifically, love narrated from a straight male perspective (baby and girl).

[Side note: what if we wanted to plot the frequency of these words over time? Unfortunately, the ‘year’ columns supplied here is a character string - it’s not (currently) recognized as a date - so we can’t plot over time without reformatting the year column. As we’ll learn soon, working with dates can be very challenging, so we can re-visit this concept later. ]

Let’s create a lexicon of ‘gender words’ and see if this male love theory rings true:

female_words <- c("girl", "baby", "honey", "sister", "babe", "woman", "lady", "gal", "chick", "mama", "she")

billboard_lyrics %>% 
  anti_join(stop_words) %>% 
  filter(word %in% female_words ) %>%
  count(word, sort = TRUE)

## Joining, by = "word"

##      word    n
## 1    baby 8503
## 2    girl 3792
## 3   woman  775
## 4    babe  608
## 5    lady  577
## 6    mama  505
## 7   honey  430
## 8  sister  104
## 9   chick   89
## 10    gal   51

male_words <- c("boy", "fella", "man", "dude", "gent", "brother", "he", "sir", "bro", "chap", "lad", "bloke", "gentleman" )

billboard_lyrics %>% 
  anti_join(stop_words) %>% 
  filter(word %in% male_words ) %>% 
  count(word, sort = TRUE)

## Joining, by = "word"

##        word    n
## 1       boy 1088
## 2   brother  184
## 3       sir   72
## 4      dude   33
## 5     fella   18
## 6 gentleman   15
## 7       bro    7
## 8     bloke    3
## 9       lad    1

I believe we have shown an imbalance of narratives. While my lists of female words and male words are arguably complete, I doubt any adjustment to their content would show different results.

Again, it would be more interesting to see if this trend has developed over time - is there more nuanced approach to love in modern popular music? But that gets complicated - reformatting dates is tricky. skip to chapter [x] if you can’t wait.

Comparing Two or More Documents

How do different political parties express their vision of the county? Let’s install the Quanteda package, which has the text of every American President’s Inaugural speech, going back to George Washington.

install.packages('quanteda')

library(quanteda)

## Package version: 3.2.1
## Unicode version: 14.0
## ICU version: 70.1

## Parallel computing: 16 of 16 threads used.

## See https://quanteda.io for tutorials and examples.

library(ggthemes)

Load the inaugural data - this is a strange way to do it, but it’s how the ‘quanteda’ package instructs us to load its data:

tidy(data_corpus_inaugural) -> inaugural_df

#let’s look only at speeches since 1960:

inaugural_df %>% 
  filter(Year > '1960') %>% 
  unnest_tokens(word, text) -> inaugural_text

#Most frequent words, by party:

inaugural_text %>% 
  anti_join(stop_words) %>% 
 #  filter(Party %in% "Democratic") %>% 
  count(word, President,Party, sort = TRUE) %>% 
  arrange(desc(n)) %>% 
  head(30) %>% 
  ggplot(aes(reorder(word,n), n,fill = Party)) + 
  geom_col() + 
  coord_flip()  +
facet_wrap(~Party, scales = "free_y")

## Joining, by = "word"

#Rather than splitting up the graph in two, let’s try a stacked bar chart with all of the data together:

inaugural_text %>% 
  anti_join(stop_words) %>% 
  group_by(Party) %>% 
  count(word, sort = TRUE) %>% 
  head(30) %>% 
  ggplot(aes(reorder(word,n),n, fill = Party)) + geom_col() +
  coord_flip() +
  theme_economist()

## Joining, by = "word"

Let’s look for specific words:

   inaugural_text %>% 
     filter(word %in% c("war", "terrorism", "terror", "military", "battle", "fight", "enemy")) %>% 
     count(word, Party, sort = TRUE) %>% 
     arrange(desc(word))

## # A tibble: 13 × 3
##    word      Party          n
##    <chr>     <fct>      <int>
##  1 war       Democratic    21
##  2 war       Republican    12
##  3 terrorism Republican     2
##  4 terrorism Democratic     1
##  5 terror    Democratic     4
##  6 terror    Republican     1
##  7 military  Republican     3
##  8 fight     Democratic     4
##  9 fight     Republican     4
## 10 enemy     Democratic     4
## 11 enemy     Republican     1
## 12 battle    Democratic     3
## 13 battle    Republican     3

   inaugural_text %>% 
     filter(word %in% c("drugs", "economy", "union", "change", "advance", "protest", "love")) %>% 
     count(word, Party, sort = TRUE) %>% 
     arrange(desc(word))

## # A tibble: 12 × 3
##    word    Party          n
##    <chr>   <fct>      <int>
##  1 union   Democratic    16
##  2 union   Republican     8
##  3 love    Republican    15
##  4 love    Democratic     7
##  5 economy Republican    12
##  6 economy Democratic     9
##  7 drugs   Republican     3
##  8 drugs   Democratic     2
##  9 change  Democratic    29
## 10 change  Republican     2
## 11 advance Democratic     6
## 12 advance Republican     4

##Sentiment##

As mentioned earlier, there are existing lexicons for sentiment, or measuring emotion. Two are already built in to the ‘tidytext’ package, and a third is added with the ‘textdata’ package. They are:

Bing: A ‘boolean’ lexicon that simply ascribes ‘positive’ or ‘negative’ to each word. Developed by xxx.
NRC: A lexicon that evaluates words to be either ‘positive’ or ‘negative,’ but also adds ‘trust,’ ‘fear,’ ‘anticipation,’ ‘joy,’ ‘sadness,’ ‘anger,’ ‘surprise’ and ‘disgust.’
AFINN: Ascribes a numeric value for sentiment, ranging from -5 to 5, with 0 being the midpoint.

None of these sentiment lexicons are perfect - they all misunderstand irony, satire, slang and context, and tend towards more traditional definitions of words than newer usages. But again, the more text data we process, the less these styles of prose will affect our results.

To calculate the sentiments of a body of text, we have to merge the sentiment lexicon to the text data. We do this by performing an inner_join(), which translates as ‘only keep words that appear in both the original text and the sentiment lexicon.’ So we’re going to lose a lot of words that don’t have a sentiment score, like proper nouns, place names, dates, and the like. That’s fine, but note the word counts before and after merging with a sentiment lexicon.

inaugural_text %>% 
  group_by(Party) %>% 
  count(word, sort = TRUE) %>% 
   inner_join(get_sentiments('bing')) -> inaugural_bing

## Joining, by = "word"

inaugural_bing %>% 
  group_by(Party) %>% 
  count(sentiment, sort = TRUE) %>% 
  ggplot(aes(reorder(sentiment, n),n, fill = Party)) + geom_col() + 
  coord_flip()

That’s a very simple chart - let’s use a different lexicon for better results.

inaugural_text %>% 
  group_by(Party) %>% 
  count(word, sort = TRUE) %>% 
   inner_join(get_sentiments('nrc')) -> inaugural_nrc

## Joining, by = "word"

inaugural_nrc %>% 
  group_by(Party) %>% 
  count(sentiment, sort = TRUE) %>% 
  ggplot(aes(reorder(sentiment, n),n, fill = Party)) + geom_col() + 
  coord_flip()

Finally, the ‘AFINN’ lexicon creates a ‘value’ column with the numeric ‘score’ of each word:

inaugural_text %>% 
  group_by(Party) %>% 
  count(word, sort = TRUE) %>% 
   inner_join(get_sentiments('afinn')) -> inaugural_afinn

## Joining, by = "word"

inaugural_afinn %>% 
  group_by(Party) %>% 
  count(value, sort = TRUE) %>% 
  ggplot(aes(value,n ,fill = Party)) + geom_col()

None of these visualizations are very informative - perhaps both political parties give similar inaugural speeches. Or perhaps we can see differences on the margins: what are the most- and least- negative words each party says?

inaugural_afinn %>% 
  filter(value == -3) %>% 
  head(20) %>% 
  ggplot(aes(word, n, fill = Party)) + geom_col() +
  coord_flip()

The colors for each party are a little confusing - ggplot() chooses them automatically, since we didn’t specify any. Let’s do so - and also use a theme to make it look better:

inaugural_afinn %>% 
  filter(value == -3) %>% 
  head(20) %>% 
  ggplot(aes(word, n, fill = Party)) + geom_col() +
  coord_flip() -> inaug

  inaug +
    scale_fill_manual(values = c("blue", "red")) +
    theme_economist()

Introduction to Text Analysis in R

Brian Walsh

2022-06-01

Comparing Two or More Documents