library(geniusr) # This package gets lyrics
library(tidyverse)
library(tidytext)
library(wordcloud2)
genius_token()
[1] "r6MbEOV_SiYagrS_8dNLGuji0Eh6CPwyYHGYyk7Zn0hlfmkcre6XD8C1C0cDx2A1"
- I want to look at lyrics by The Vines, an Australian rock band that were successful briefly in the early 2000’s, specifically from their debut album Highly Evolved.
To do so, I am using the Genius.com API, which contains a repository of song lyrics.
I remember the name of a song from the album, “Get Free,” and I want to see what other songs might be called by that name first, and make sure I choose the right one.
search_song("get free")
Okay, I see the song ID I want for the correct artists, so I’ll tell R to show me information about this song, including the album ID.
get_song_meta(897254)
Great, now I know the album ID for the album I want to look at, Highly Evolved. So I’ll tell R to “scrape” a track list from that album.
evolved_tracks <- scrape_tracklist(214505)
evolved_tracks
Now that I have the list of all songs from that album, I can tell R to load the lyrics from each of them.
evolved_lyrics <- map_df(evolved_tracks$song_lyrics_url, scrape_lyrics_url)
evolved_lyrics
NA
As we can see, these lyrics are all in fairly long strings. I want to analyze individual words from these lyrics, so I’ll tell R to unnest them.
evolved_words <- evolved_lyrics %>%
unnest_tokens(word, line) %>%
select(song_name, word)
evolved_words
That’s better! But there’s a lot of stopwords in there (so, a, etc.)… therefore I wish to…
- Remove all of the stopwords from these lyrics.
evolved_words %>%
anti_join(get_stopwords()) %>%
count(word, sort = T)
Joining, by = "word"
That looks better! But what would this new list look like in the form of a wordcloud? Let’s tell R to show us…
evolved_words %>%
anti_join(get_stopwords()) %>%
count(word, sort = T) %>%
top_n(200) %>%
wordcloud2(size = .5)
Joining, by = "word"
Selecting by n
- Now I want to use sentiment analysis to see which words, for example, carry positive connotations and which carry negative connotations from these lyrics.
In order to a sentiment analysis, we need a dictionary to inform us what the sentiment of any given word is likely to be (obviously, this is an imperfect art).
First, I will use the bing sentiment dictionary.
bing <- get_sentiments("bing")
bing
Now that I have loaded in the bing sentiment dictionary (which we can see above), I want to apply its contents to the lyrics.
evolved_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
This looks fairly straight-forward to me. Let’s graph it!
evolved_words %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "The Vines's Highly Evolved album: Words that contribute the most to each sentiment",
x = NULL) +
scale_fill_viridis_d() +
coord_flip() +
theme_minimal()
Joining, by = "word"
Selecting by n

Now I want to look at sentiments using a different sentiment dictionary, called NRC.
nrc <- get_sentiments("nrc")
nrc
evolved_words %>%
inner_join(nrc) %>%
count(word, sentiment, sort = TRUE)
Joining, by = "word"
evolved_words %>%
inner_join(nrc) %>%
count(word, sentiment, sort = TRUE) %>%
group_by(sentiment) %>%
top_n(3) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(vars(sentiment), scales = "free") +
labs(y = "The Vines's Highly Evolved album: Words that contribute the most to each sentiment",
x = NULL) +
scale_fill_viridis_d() +
coord_flip() +
theme_minimal()
Joining, by = "word"
Selecting by n

- Instead of only examining individual words, let’s take a look at lyrics that are paired together.
evolved_lyrics %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2) %>%
select(bigram)
NA
This looks good, except we get stopwords again. Let’s get rid of those.
evolved_lyrics %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2) %>%
select(bigram) -> evolved_bigrams
evolved_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ")
NA
That looks better! Now we can tell R to count how many times each pairing occurs.
evolved_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ") %>%
count(bigram, sort = T)
NA
As we can see, the word pairs “gonna get” and “get free” are among the most common.
- I would now like to see what word tends to immediately follow the word “I” and what word tends to immediately follow the word “you,” to see if any intersting difference is apparent.
first_word <- c("i", "you") # these need to be lowercase
evolved_bigrams %>%
count(bigram, sort = T) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separate the two words
filter(word1 %in% first_word) %>% # find first words from our list
count(word1, word2, wt = n, sort = TRUE) %>%
rename(total = nn)
Error: Can't rename columns that don't exist.
x The column `nn` doesn't exist.
Finally, I would like to look at that on a graph.
first_word <- c("i", "you") # these need to be lowercase
evolved_bigrams %>%
count(bigram, sort = T) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>% # separate the two words
filter(word1 %in% first_word) %>% # find first words from our list
count(word1, word2, wt = n, sort = TRUE) %>%
rename(total = nn)
mutate(word2 = factor(word2, levels = rev(unique(word2)))) %>% # put the words in order
group_by(word1) %>%
top_n(5) %>%
ggplot(aes(word2, total, fill = word1)) + #
scale_fill_viridis_d() + # set the color palette
geom_col(show.legend = FALSE) +
labs(x = NULL, y = NULL, title = "Word following:") +
facet_wrap(~word1, scales = "free") +
coord_flip() +
theme_minimal()
