library(readr)
library(tidytext)
library(wordcloud)
data(stop_words)
library(tidyverse)
<- "https://raw.githubusercontent.com/tonythor/cuny-datascience/develop/"
github_proj <- read_csv(paste0(github_proj, "data/taylor_swift_lyrics.csv")) %>%
song_lyrics unnest_tokens(word, lyric)
Tidytext NLP using Taylor Swift Lyrics
Loading data
I started this project with this Taylor Swift lyrics dataset up on kaggle. It was a little out of date and my girlfriend is a Swiftie, so I had to spend several hours adding our albums worth of song lyrics.
A lyrics word cloud
= 20
wc_count <- song_lyrics %>%
wc_df select(word) %>%
anti_join(stop_words, by = "word") %>%
count(word) %>%
filter(n >= wc_count) %>%
arrange(desc(n))
#View(wc_df) <- and wow!
%>% with(wordcloud(words = word, freq = n, min.freq = 1,
wc_df random.order = FALSE, rot.per = 0.35,
colors = brewer.pal(8, "Dark2")))
This word cloud was so awesome looking that I immediately posted it on Facebook. In my post, I asked people if they recognized it. In order, the first three responses were:
- Pop music
- Smokey Robinson & The Miracles
- The queen
Isn’t #2 a fascinating answer?
Sentiment by album
<- get_sentiments("nrc")
nrc
%>%
song_lyrics inner_join(nrc, by = "word", relationship = "many-to-many") %>%
count(album, sentiment) %>%
group_by(album) %>%
mutate(total = sum(n)) %>%
ungroup() %>%
mutate(percentage = n / total) %>%
ggplot(aes(x = album, y = percentage, fill = sentiment)) +
geom_bar(stat = "identity", position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
labs(title = "Sentiment Distribution by Album",
x = "Album",
y = "Percentage",
fill = "Sentiment")
As the code says, I took the NRC sentiment lexicon1, inner joined it with non stop-word lyrics, and then summed the NRC sentiment score per album. Given they are roughly so equal, the question this leaves me with is, is this about the right mix of emotions for a blockbuster pop album?
The most positive and negatives songs
<- get_sentiments("bing")
bing <- 10
top_n_count
%>%
song_lyrics inner_join(get_sentiments("bing"), by = "word", relationship = "many-to-many") %>%
filter(sentiment %in% c("positive", "negative")) %>%
count(track_title, sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(net_sentiment = positive - negative) %>%
arrange(desc(net_sentiment)) %>%
slice(c(1:top_n_count, (n() - top_n_count + 1):n())) %>%
ungroup() %>%
ggplot(aes(x = reorder(track_title, net_sentiment), y = net_sentiment,
fill = ifelse(net_sentiment >= 0, "Positive", "Negative"))) +
geom_bar(stat = "identity") +
scale_fill_manual(values = c("Positive" = "gold", "Negative" = "steelblue")) +
coord_flip() +
labs(title = "Most Positive and Negative Songs",
x = "Net Positive Sentiment Count",
y = "Song Title",
fill = "Sentiment") +
theme_minimal()
Though great looking, this is not the most advanced analysis. As the code suggests, I joined lyric words with the Bing sentiment lexicon2, then I grouped by song and took the top and bottom 10. This does not take into account satire, double negatives, etc.