Text Analysis in R

NOTE: Much of this tutorial is adapted or copied from the wonderful (free!) book Text Mining with R: a Tidy Approach by Julia Silge and David Robinson. I also highly recommend going through that book and Julia Silge’s recent Text mining with tidy data principles interactive tutorial if you want to take your tidy text analysis skills further. The tutorial’s exercises are accessible, have a built-in feedback mechanism, and will jumpstart your ability to work with text in R!

I obtained the data for this tutorial using the geniusr Genius API interface for R. Genius is a website that hosts song lyrics and user-contributed analyses of those lyrics. If you want to see how I obtained this data, I’ve provided a (poorly commented) pdf for your convenience.

To go through this workshop, either download the repository as a zip file here, or clone it on github.com/connor-french/intro_text_analysis.

Introduction

Using tidy data principles is a powerful way to make handling data easier and more effective, and this is no less true when it comes to dealing with text. As described by Hadley Wickham (Wickham 2014), tidy data has a specific structure:

Each variable is a column
Each observation is a row
Each type of observational unit is a table

Tidy text format as is defined as a table with one-token-per-row. A token is a meaningful unit of text, such as a word, sentence, or n-gram, that we are interested in using for analysis, and tokenization is the process of splitting text into tokens. This format may be new to those who have performed text analysis using other methods, but hopefully by the end you are convinced of the utility of tidy text. The tidytext R package, in concert with the tidyverse series of packages, will help us reach the goal of turning our text into tidy text.

A typical text analysis workflow looks like this:

Tidytext workflow We will follow this workflow to get you up and running with your own text analyses! If we have time at the end, we will also walk through a more involved use-case that you’ll probably see in the wild to turn unstructured text into something that you can analyze.

Get started

Today, we’re going to analyze the lyrics of two very different musical artists- the light and lilting indie-Americana musician Buck Meek and the merciless, pounding deathgrind band Full of Hell. We’re going to see if the music matches up with the words- are Buck Meek’s lyrics more positive than Full of Hell’s? Or do their musical differences not match up with their lyrical differences? To answer this question, I obtained the lyrics from their most recent albums using the geniusr API. Other than what the API does natively, I’ve performed minimal processing of the data.

To begin, we need to load the essential packages.

# for data manipulation and plotting
library(tidyverse)
# for working with text data
library(tidytext)
# for obtaining the sentiment analysis lexicons
library(textdata)
# for file path management
library(here)

Now, let’s load the data! We’ll call this lyrics. We have a few different variables. The most relevant variables for today’s analysis are:

line: the lyrics, where each row is a line of lyrics
section_name: The section of the song the lyrics are in, which in most cases is something like “Chorus”, “Verse”, etc. but it occasionally diverges
song_name: The name of the song
artist_name: the name of the song
line_number: The line number each line of the song is associated with. This is a useful identifier for when we split this data set into words!

lyrics <- read_csv(here("data", "lyrics.csv"))
  
glimpse(lyrics)

## Rows: 472
## Columns: 7
## $ line            <chr> "Pareidolia", "With your head upon my lap on the buffa…
## $ section_name    <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ section_artist  <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_name       <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ artist_name     <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_lyrics_url <chr> "https://genius.com/Buck-meek-pareidolia-lyrics", "htt…
## $ line_number     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

Tidying our data

To work with this as a tidy dataset, we need to restructure it in the one-token-per-row format, which is done with the unnest_tokens() function. With this function, the first argument is the name of the output column, the second argument is the name of the input column, and the third argument is the type of token you want to split your data into (there are quite a few options, use ?unnest_tokens() to see them!).

tidy_lyrics <- lyrics %>% 
  unnest_tokens(word, 
                line,
                token = "words") 

glimpse(tidy_lyrics)

## Rows: 2,689
## Columns: 7
## $ section_name    <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ section_artist  <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_name       <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ artist_name     <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_lyrics_url <chr> "https://genius.com/Buck-meek-pareidolia-lyrics", "htt…
## $ line_number     <dbl> 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, …
## $ word            <chr> "pareidolia", "with", "your", "head", "upon", "my", "l…

Notice that our data frame grew quite a bit! Each line was split into it’s word components. We also know which line each word belongs to with the line_number variable. You might also notice that there are a lot of not-so-interesting words in the data set. Often in text analysis, we will want to remove these “stop words”; stop words are words that are not useful for an analysis, typically extremely common words such as “the”, “of”, “to”, and so forth in English. We can remove stop words (kept in the tidytext dataset stop_words) with an anti_join(). anti_join() removes rows where values of a key match between two data sets. In this case, we’re using the word columns as our key, so words that match between the tidy_lyrics data and the stop_words data are removed. Notice the dramatic reduction in the number of rows in our data set!

lyrics_no_stop <- tidy_lyrics %>% 
  anti_join(stop_words, by = "word")

glimpse(lyrics_no_stop)

## Rows: 1,195
## Columns: 7
## $ section_name    <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ section_artist  <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_name       <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ artist_name     <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_lyrics_url <chr> "https://genius.com/Buck-meek-pareidolia-lyrics", "htt…
## $ line_number     <dbl> 1, 2, 2, 2, 2, 3, 3, 3, 4, 5, 6, 6, 6, 7, 7, 7, 7, 8, …
## $ word            <chr> "pareidolia", "head", "lap", "buffalo", "grass", "clou…

Explore our data

One of the most fundamental ways to explore our data is through counting words. Fortunately, dplyr has a function that makes this easy. We tack on the sort = TRUE argument to sort the output text. Let’s explore a few different subsets of our data.

First, let’s count the whole dataset.

lyrics_no_stop %>% 
  count(word, sort = TRUE)

## # A tibble: 711 x 2
##    word      n
##    <chr> <int>
##  1 blue     13
##  2 eyes     13
##  3 love     13
##  4 jewel    10
##  5 time     10
##  6 mind      9
##  7 doors     8
##  8 heart     8
##  9 eye       7
## 10 hold      7
## # … with 701 more rows

Now, let’s look only at Buck Meek’s most common words. These are some pleasant words.

lyrics_no_stop %>% 
  filter(artist_name == "Buck Meek") %>% 
  count(word, sort = TRUE)

## # A tibble: 370 x 2
##    word      n
##    <chr> <int>
##  1 blue     13
##  2 eyes     13
##  3 love     11
##  4 mind      9
##  5 hold      7
##  6 time      7
##  7 left      6
##  8 moons     6
##  9 gold      5
## 10 hole      5
## # … with 360 more rows

While numbers are great and all, a quick data visualization makes patterns pop out. Here is a bar plot of the same data as above, with only the words that appear 4 or more times in the album. I rearranged the bars so that they appear in descending order of frequency.

lyrics_no_stop %>%
  filter(artist_name == "Buck Meek") %>% 
  count(word, sort = TRUE) %>%
  filter(n > 3) %>% 
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(n, word)) +
  geom_col() +
  labs(y = NULL)

Now, let’s take a look at the most common words for Full of Hell! These seem quite a bit darker. Without even cracking open our favorite sentiment lexicon, we can see that the words used by the two bands have quite a different vibe.

lyrics_no_stop %>% 
  filter(artist_name == "Full of Hell") %>% 
  count(word, sort = TRUE)

## # A tibble: 389 x 2
##    word         n
##    <chr>    <int>
##  1 jewel       10
##  2 heart        7
##  3 doors        6
##  4 eternal      6
##  5 rapture      6
##  6 dead         5
##  7 obsidian     5
##  8 weeping      5
##  9 cavern       4
## 10 crime        4
## # … with 379 more rows

Sentiment Analysis

When human readers approach a text, we use our understanding of the emotional intent of words to infer whether a section of text is positive or negative, or perhaps characterized by some other more nuanced emotion like surprise or disgust. We can use sentiment analysis to approach the emotional content of text programmatically.

One way to analyze the sentiment of a text is to consider the text as a combination of its individual words and the sentiment content of the whole text as the sum of the sentiment content of the individual words. This isn’t the only way to approach sentiment analysis, but it is an often-used approach, and an approach that naturally takes advantage of the tidy tool ecosystem.

To evaluate the sentiment of a text, we use dictionaries that map words or phrases to a particular sentiment. For instance, the word “sunshine” may be considered a positive word. These are called lexicons. There are many, where each is created for a particular context. When you select an existing lexicon or create your own, it is important to understand its particular biases and nuances. The word “sunshine” may be considered positive when interpreting children’s book texts, but negative when interpreting accounts of the Dust Bowl.

For this workshop, we’re going to use the AFINN and Bing lexicons. These lexicons are based on unigrams, i.e., single words. They contain many English words and the words are assigned scores for positive/negative sentiment. The AFINN lexicon assigns words with a score that runs between -5 and 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment. The bing lexicon categorizes words in a binary fashion into positive and negative categories. Although we won’t do it here, I encourage you to explore these dictionaries and find places where the sentiment assignments make sense or don’t make sense for the lyrics we’re analyzing.

First, we need to obtain the lexicons. Some lexicons have licenses associated with them, so make sure that the license is appropriate for your project. We don’t need to worry about license permissions for this workshop.

afinn_sent <- get_sentiments("afinn")
bing_sent <- get_sentiments("bing")

The AFINN lexicon has a column for words, word, and the AFINN score, value.

glimpse(afinn_sent)

## Rows: 2,477
## Columns: 2
## $ word  <chr> "abandon", "abandoned", "abandons", "abducted", "abduction", "ab…
## $ value <dbl> -2, -2, -2, -2, -2, -2, -3, -3, -3, -3, 2, 2, 1, -1, -1, 2, 2, 2…

The bing lexicon has a column for words, word, and the binary sentiment, sentiment. There are quite a few more words in the bing lexicon relative to the AFINN lexicon.

glimpse(bing_sent)

## Rows: 6,786
## Columns: 2
## $ word      <chr> "2-faces", "abnormal", "abolish", "abominable", "abominably"…
## $ sentiment <chr> "negative", "negative", "negative", "negative", "negative", …

AFINN Analysis

Let’s take a look at the AFINN data set first. To analyze our data, we need to combine the lyrics with the lexicon. We will do that with an inner_join(), which only keeps rows where the key matches between the two data sets. We’re also adding a unique identifier for each word with the index column and renaming the value column to afinn.

Notice that the rows are dramatically reduced- 185 words match between our lyrics data and the AFINN lexicon. If we were doing research, we may want to investigate the non-overlapping words and see if there is a different, more inclusive, lexicon for the lyrics.

afinn_df <- lyrics_no_stop %>% 
  inner_join(afinn_sent, by = "word") %>% 
  # unique identifier for each word
  mutate(index = row_number()) %>% 
  # a more useful name for the afinn score
  rename(afinn = value)

glimpse(afinn_df)

## Rows: 185
## Columns: 9
## $ section_name    <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ section_artist  <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_name       <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ artist_name     <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_lyrics_url <chr> "https://genius.com/Buck-meek-pareidolia-lyrics", "htt…
## $ line_number     <dbl> 8, 19, 24, 25, 7, 12, 15, 15, 20, 24, 8, 9, 10, 11, 12…
## $ word            <chr> "paradise", "ghost", "hell", "lucky", "heaven", "love"…
## $ afinn           <dbl> 3, -1, -4, 3, 2, 3, -3, -2, 3, 3, -1, 1, 3, 1, 3, -3, …
## $ index           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

There are many ways to visualize the sentiment of our data. Since we have continuous values that have a defined midpoint (0), a diverging bar plot will give us a sense of the frequency and magnitude of positive and negative words in both artists.

Looks like we have some support for our hypothesis! The death metal band Full of Hell appears to have more words with negative connotations than Buck Meek. The artist_index also roughly corresponds to the word’s position in the album, so Full of Hell seems to get more positive as the album progresses.

afinn_df %>% 
  group_by(artist_name) %>% 
  mutate(
    # create a unique index per artist
    artist_index = row_number(),
    # create a binary positive/negative variable to color bars with and emphasize the positive vs negative relationship
    overall_sent = if_else(
      afinn >= 0, "positive", "negative"
      )
    ) %>% 
  ggplot(aes(y = afinn, x = artist_index, fill = overall_sent)) +
  geom_col() +
  # unique pane per artist
  facet_wrap(~artist_name, nrow = 2)

Aggregating single words may not be good enough to get a sense of the sentiment of a body of text. Each song is divided into sections, like the verse, chorus, etc. (although it is not a perfect divide). The overall sentiment of each section may give us a better sense of what feeling the artist is going for. Let’s group the words by their song and section, then summarize the sentiment of each section by taking the sum. We’ll also take a different approach to visualization- a histogram to compare the distributions of sentiment values without retaining their order in the album. This gives us a more direct look at the average and variation in sentiment of the two artists.

It looks like more support for our hypothesis! Although the magnitude of the difference is not quite as high as I would have imagined.

afinn_df %>% 
  group_by(artist_name, song_name, section_name) %>% 
  summarize(avg_sent = sum(afinn)) %>% 
  ggplot(aes(x = avg_sent, color = artist_name)) +
  geom_density()

Bing Analysis

Now let’s take a look at the bing binary lexicon.

We will join the lexicon with the lyrics in a similar manner as earlier.

It looks like the bing lexicon contains a few more words in common with the song lyrics than the AFINN data set.

bing_df <- lyrics_no_stop %>% 
  inner_join(bing_sent, by = "word") %>% 
  mutate(index = row_number())

glimpse(bing_df)

## Rows: 230
## Columns: 9
## $ section_name    <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ section_artist  <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_name       <chr> "Pareidolia", "Pareidolia", "Pareidolia", "Pareidolia"…
## $ artist_name     <chr> "Buck Meek", "Buck Meek", "Buck Meek", "Buck Meek", "B…
## $ song_lyrics_url <chr> "https://genius.com/Buck-meek-pareidolia-lyrics", "htt…
## $ line_number     <dbl> 3, 8, 13, 20, 24, 25, 28, 3, 4, 6, 7, 12, 15, 15, 20, …
## $ word            <chr> "fast", "paradise", "burning", "froze", "hell", "lucky…
## $ sentiment       <chr> "positive", "positive", "negative", "negative", "negat…
## $ index           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

Since we’re dealing with categorical data (a binary “positive”/“negative” label), some sort of frequency chart is appropriate. Let’s count the number of each sentiment associated with the artists.

We can count the sentiments like we counted words earlier! And we can even create a similar bar chart. It definitely looks like Full of Hell has more negative words than Buck Meek, and that they have a higher negative to positive ratio. But, this relationship is exaggerated because Full of Hell has more words overall compared to Buck Meek. To compare the relative number of negative vs positive words between artists, we need to use proportions!

sent_count_df <- bing_df %>% 
  group_by(artist_name) %>% 
  count(sentiment, sort = TRUE) 

# plot the counts
sent_count_df %>% 
  ggplot(aes(x = artist_name, y = n, fill = sentiment)) +
  geom_col(position = "dodge")

To convert to proportions, we need to divide the count (per artist) of each sentiment by the total. Then we can compare the proportions with a stacked bar plot.

The pattern is as expected, but the relative proportions are more clear.

sent_prop_df <- sent_count_df %>% 
  group_by(artist_name) %>% 
  mutate(prop_sent = n / sum(n)) %>% 
  ungroup() 

sent_prop_df %>% 
  ggplot(aes(x = artist_name, y = prop_sent, fill = sentiment)) +
  geom_col()

#### Going further This is a special section with a bit more advanced code that I won’t take too long to explain. Look at the code comments for brief explanation!

We can even summarize counts by section. One way is to only take the most frequent sentiment as the overall sentiment of the section.

It looks like there were no positive sections for Full of Hell, while Buck Meek had close to a 50/50 split. Looks about right!

common_sent_df <- bing_df %>% 
  # group by section name
  group_by(artist_name, section_name) %>% 
  # count the number of each sentiment (# positive, # negative)
  count(sentiment) %>% 
  # convert the data frame so each sentiment count has its own column (they are named "positive" and "negative")
  pivot_wider(names_from = sentiment, values_from = n) %>%
  mutate(
    # some sections don't have a particular sentiment, which returns an NA. We want these to show up as 0 instead
    positive = replace_na(positive, 0),
    negative = replace_na(negative, 0),
    # Now, we determine which is most common using an if_else statement.
    common_sent = if_else(
      positive > negative, "positive", "negative"
  ))

common_sent_df %>% 
  group_by(artist_name) %>% 
  count(common_sent) %>% 
  ggplot(aes(x = artist_name, y = n, fill = common_sent)) +
  geom_col(position = "dodge")

More serious data wrangling

This is how text may appear if you don’t get to use a fancy API to obtain your data. It’s a single text string with whitespace characters (\n) and extraneous classifiers ([Verse 1], [Instrumental Break]). We want to get this into a tidy format, where each word is an observation and we have the line number for each word. To do this, we will use the powerful stringr package.

These are Buck Meek lyrics from the album we analyzed earlier (the song is Candle! I scraped these lyrics from the Genius web page, using the rvest package. This is effectively what the Genius API does, but the API does some helpful transformation under the hood that we’ll do here!

candle_lyrics <- "[Verse 1]\nInnocence is a light beam, you're doing your thing\nWith your arm out your window up Highway 9\nWhen it's too much to handle, burn me a candle\nIf you don't have a candle, let me burn on your mind\n[Verse 2]\nThe song of the sirens caught up with me downwind\nMy nose started bleeding by the second note\nHeaven is a motel with a telephone seashell\nWell, check-out's at eleven, and don't ask for more time\n[Chorus]\nWell, did your eyes change? I remember them blue\nOr were they always hazel?\nStill the same face with a line or two\nThe same love I always knew\n[Verse 3]\nI try not to call, but I think I'm being followed\nIt's been about an hour or so\nI hate for you to hear me scared, otherwise, I'm well\nI guess you're still the first place I go\n[Chorus]\nDid your eyes change? I remember them blue\nOr were they always hazel?\nStill the same face with a line or two\nThe same love I always knew\n[Instrumental Break]\n[Chorus]\nDid your eyes change? I remember them blue\nOr were they always hazel?\nStill the same face with a line or two\nThe same love I always knew"

I will split the process up into separate steps, then present them as a cohesive flow at the end. There are multiple ways you could parse this text, so don’t feel like this is the “one way” to do it. And if you’re a regex superhero, I would definitely like to hear your more optimal solution.

Since we’re interested in a single line of lyrics per row, we want to split this string by each line break. Fortunately, \n indicates where a line break occurs! To split this single string into a vector with a single line per observation, we will use str_split function. The first argument in str_split() needs to be a vector and the second needs to be a string pattern to match. Here, we’re specifying \n, but we need to add an additional slash in the front. The initial slash “escapes” the second slash, since R considers slashes special characters. We also tack on unlist() at the end, because str_split returns a list of character vectors, rather than a single vector.

candle_lyrics %>%
  str_split("\\n") %>% 
  unlist()

##  [1] "[Verse 1]"                                               
##  [2] "Innocence is a light beam, you're doing your thing"      
##  [3] "With your arm out your window up Highway 9"              
##  [4] "When it's too much to handle, burn me a candle"          
##  [5] "If you don't have a candle, let me burn on your mind"    
##  [6] "[Verse 2]"                                               
##  [7] "The song of the sirens caught up with me downwind"       
##  [8] "My nose started bleeding by the second note"             
##  [9] "Heaven is a motel with a telephone seashell"             
## [10] "Well, check-out's at eleven, and don't ask for more time"
## [11] "[Chorus]"                                                
## [12] "Well, did your eyes change? I remember them blue"        
## [13] "Or were they always hazel?"                              
## [14] "Still the same face with a line or two"                  
## [15] "The same love I always knew"                             
## [16] "[Verse 3]"                                               
## [17] "I try not to call, but I think I'm being followed"       
## [18] "It's been about an hour or so"                           
## [19] "I hate for you to hear me scared, otherwise, I'm well"   
## [20] "I guess you're still the first place I go"               
## [21] "[Chorus]"                                                
## [22] "Did your eyes change? I remember them blue"              
## [23] "Or were they always hazel?"                              
## [24] "Still the same face with a line or two"                  
## [25] "The same love I always knew"                             
## [26] "[Instrumental Break]"                                    
## [27] "[Chorus]"                                                
## [28] "Did your eyes change? I remember them blue"              
## [29] "Or were they always hazel?"                              
## [30] "Still the same face with a line or two"                  
## [31] "The same love I always knew"

This gets us most of the way there! We don’t really want the section headers, like “[Chorus]”, “[Verse 1]”, etc. We could label each section with these headers, but for the sake of this exercise, lets just remove them.

To remove them, we need to use some regular expressions! We want to remove the brackets [], letters, spaces, and digits. The regex expression \D+ means “remove all non-digit characters”, while \d+ means “remove all digits”. The extra slashes are used to escape these special characters. The entire pattern means “match anything that has letters, whitespace, or digits that is encased by brackets and remove the brackets as well”. The str_remove() function removes this pattern from any line that contains it.

candle_lyrics %>%
  str_split("\\n") %>% 
  unlist() %>% 
  str_remove("\\[\\D+\\d+\\]")

##  [1] ""                                                        
##  [2] "Innocence is a light beam, you're doing your thing"      
##  [3] "With your arm out your window up Highway 9"              
##  [4] "When it's too much to handle, burn me a candle"          
##  [5] "If you don't have a candle, let me burn on your mind"    
##  [6] ""                                                        
##  [7] "The song of the sirens caught up with me downwind"       
##  [8] "My nose started bleeding by the second note"             
##  [9] "Heaven is a motel with a telephone seashell"             
## [10] "Well, check-out's at eleven, and don't ask for more time"
## [11] "[Chorus]"                                                
## [12] "Well, did your eyes change? I remember them blue"        
## [13] "Or were they always hazel?"                              
## [14] "Still the same face with a line or two"                  
## [15] "The same love I always knew"                             
## [16] ""                                                        
## [17] "I try not to call, but I think I'm being followed"       
## [18] "It's been about an hour or so"                           
## [19] "I hate for you to hear me scared, otherwise, I'm well"   
## [20] "I guess you're still the first place I go"               
## [21] "[Chorus]"                                                
## [22] "Did your eyes change? I remember them blue"              
## [23] "Or were they always hazel?"                              
## [24] "Still the same face with a line or two"                  
## [25] "The same love I always knew"                             
## [26] "[Instrumental Break]"                                    
## [27] "[Chorus]"                                                
## [28] "Did your eyes change? I remember them blue"              
## [29] "Or were they always hazel?"                              
## [30] "Still the same face with a line or two"                  
## [31] "The same love I always knew"

You may notice that there are a couple headers left! These headers don’t contain digits. Regex patterns are picky, so we need to specify the same pattern, but with only non-digits in between brackets.

Great! Now we just need to get rid of those empty lines.

candle_lyrics %>%
  str_split("\\n") %>% 
  unlist() %>% 
  str_remove("\\[\\D+\\d+\\]") %>% 
  str_remove("\\[\\D+\\]")

##  [1] ""                                                        
##  [2] "Innocence is a light beam, you're doing your thing"      
##  [3] "With your arm out your window up Highway 9"              
##  [4] "When it's too much to handle, burn me a candle"          
##  [5] "If you don't have a candle, let me burn on your mind"    
##  [6] ""                                                        
##  [7] "The song of the sirens caught up with me downwind"       
##  [8] "My nose started bleeding by the second note"             
##  [9] "Heaven is a motel with a telephone seashell"             
## [10] "Well, check-out's at eleven, and don't ask for more time"
## [11] ""                                                        
## [12] "Well, did your eyes change? I remember them blue"        
## [13] "Or were they always hazel?"                              
## [14] "Still the same face with a line or two"                  
## [15] "The same love I always knew"                             
## [16] ""                                                        
## [17] "I try not to call, but I think I'm being followed"       
## [18] "It's been about an hour or so"                           
## [19] "I hate for you to hear me scared, otherwise, I'm well"   
## [20] "I guess you're still the first place I go"               
## [21] ""                                                        
## [22] "Did your eyes change? I remember them blue"              
## [23] "Or were they always hazel?"                              
## [24] "Still the same face with a line or two"                  
## [25] "The same love I always knew"                             
## [26] ""                                                        
## [27] ""                                                        
## [28] "Did your eyes change? I remember them blue"              
## [29] "Or were they always hazel?"                              
## [30] "Still the same face with a line or two"                  
## [31] "The same love I always knew"

To remove the empty lines, we just need to convert the blank space into NA values, then remove those.

candle_lyrics %>%
  str_split("\\n") %>% 
  unlist() %>% 
  str_remove("\\[\\D+\\d+\\]") %>% 
  str_remove("\\[\\D+\\]") %>% 
  na_if("") %>% 
  na.omit()

##  [1] "Innocence is a light beam, you're doing your thing"      
##  [2] "With your arm out your window up Highway 9"              
##  [3] "When it's too much to handle, burn me a candle"          
##  [4] "If you don't have a candle, let me burn on your mind"    
##  [5] "The song of the sirens caught up with me downwind"       
##  [6] "My nose started bleeding by the second note"             
##  [7] "Heaven is a motel with a telephone seashell"             
##  [8] "Well, check-out's at eleven, and don't ask for more time"
##  [9] "Well, did your eyes change? I remember them blue"        
## [10] "Or were they always hazel?"                              
## [11] "Still the same face with a line or two"                  
## [12] "The same love I always knew"                             
## [13] "I try not to call, but I think I'm being followed"       
## [14] "It's been about an hour or so"                           
## [15] "I hate for you to hear me scared, otherwise, I'm well"   
## [16] "I guess you're still the first place I go"               
## [17] "Did your eyes change? I remember them blue"              
## [18] "Or were they always hazel?"                              
## [19] "Still the same face with a line or two"                  
## [20] "The same love I always knew"                             
## [21] "Did your eyes change? I remember them blue"              
## [22] "Or were they always hazel?"                              
## [23] "Still the same face with a line or two"                  
## [24] "The same love I always knew"                             
## attr(,"na.action")
## [1]  1  6 11 16 21 26 27
## attr(,"class")
## [1] "omit"

This is something we can work with! The last action we need to take is to convert this into a data frame. We can do this with enframe(). The “names” of the vector (in this case the row numbers) correspond with the line numbers of the song, so we’re naming this variable line_number and the value is the line of lyrics, which we’re calling line.

candle_df <- candle_lyrics %>%
  str_split("\\n") %>% 
  unlist() %>% 
  str_remove("\\[\\D+\\d+\\]") %>% 
  str_remove("\\[\\D+\\]") %>% 
  na_if("") %>% 
  na.omit() %>% 
  enframe(name = "line_number", value = "line")

candle_df

## # A tibble: 24 x 2
##    line_number line                                                    
##          <int> <chr>                                                   
##  1           1 Innocence is a light beam, you're doing your thing      
##  2           2 With your arm out your window up Highway 9              
##  3           3 When it's too much to handle, burn me a candle          
##  4           4 If you don't have a candle, let me burn on your mind    
##  5           5 The song of the sirens caught up with me downwind       
##  6           6 My nose started bleeding by the second note             
##  7           7 Heaven is a motel with a telephone seashell             
##  8           8 Well, check-out's at eleven, and don't ask for more time
##  9           9 Well, did your eyes change? I remember them blue        
## 10          10 Or were they always hazel?                              
## # … with 14 more rows

Now, all you have to do to make this tidy is use unnest_tokens()!

candle_df %>% 
  unnest_tokens(word, line)

## # A tibble: 200 x 2
##    line_number word     
##          <int> <chr>    
##  1           1 innocence
##  2           1 is       
##  3           1 a        
##  4           1 light    
##  5           1 beam     
##  6           1 you're   
##  7           1 doing    
##  8           1 your     
##  9           1 thing    
## 10           2 with     
## # … with 190 more rows