I recently had the urge to create a playlist of my all-time favorite songs, soon followed by another one to do something more productive with my time. This led me to the obvious conclusion that I could analyze my playlist with R and kill two birds with stone. As a result, in this post I’ll use data analysis to better understand the characteristics of my favorite songs, drawing primarily on the spotifyR and geniusR packages, which facilitate API access to these sites in R. Looking at related work done by others was helpful for getting started.

The first step of this analysis was getting my playlist into R. To do this, I created my playlist using Spotify and then read it into R below. This also allowed me to collect some interesting information associated with each of the songs and artists created by Spotify which I’ll focus on more later.

library(spotifyr)
library(tidyverse)
library(janitor)

# first need to authenticate api access keys
Sys.setenv(SPOTIFY_CLIENT_ID = xxx)
Sys.setenv(SPOTIFY_CLIENT_SECRET = xxx)

# read in playlist and useful attributes
top_playlist <- get_my_playlists() %>%
  filter(str_detect(description, "Top 100"))

playlist_data <- get_playlist_audio_features(
  username = "natebeans",
  playlist_uris = str_remove(top_playlist$uri, "spotify:playlist:")
) %>%
  clean_names() %>%
  mutate(
    artist = map_chr(track_album_artists, function(x) x$name[1]), # this is borrowed from one of the linked arts
    album_cover = map_chr(track_album_images, function(x) x$url[3]),
    song = track_name,
    album = track_album_name,
    popularity = track_popularity,
    release_date = track_album_release_date,
    length = track_duration_ms / 1000,
    minutes = floor(length / 60),
    seconds = round(length %% 60, digits = 0),
    length = paste0(minutes, ":", ifelse(nchar(seconds) > 1, seconds, paste0("0", seconds))),
    .keep = "unused"
  ) %>%
  select(song, artist, album, release_date, length, danceability:tempo, popularity, album_cover, track_id) %>%
  mutate(
    song = ifelse(song == "Holiday - Live at Milton Keynes National Bowl, Milton Keynes, England, 6/18/05",
      "Holiday (Live)", song
    ),
    song = ifelse(song == "Ghost Love Score - Live, at Wacken, 2013", "Ghost Love Score (Live)", song),
    song = ifelse(song == "You Think I Ain't Worth A Dollar, But I Feel Like A Millionaire",
      "You Think I Ain't Worth A Dollar", song
    ),
    album = ifelse(album == "Rage Against The Machine - XX (20th Anniversary Special Edition)",
      "Rage Against The Machine", album
    ),
    album = ifelse(album ==
      "Astro Creep: 2000 Songs Of Love, Destruction And Other Synthetic Delusions Of The Electric Head",
    "Astro Creep", album
    )
  )


# read in genre data

top_artists <- playlist_data %>%
  select(artist) %>%
  distinct()

artist_ids <- map_df(top_artists$artist, function(x) {
  get_artist_audio_features(x) %>%
    select(artist_name, artist_id) %>%
    distinct()
})

artist_genres <- spotifyr::get_artists(as.vector(artist_ids$artist_id[1:50])) %>% # max 50, run twice
  select(name, genres) %>%
  as.data.frame()

genre <- rbind(genre, artist_genres)

genre_vec <- as.vector(genre$genres)

final_genre <- data.frame()

for (x in 1:63) {
  obs <- pluck(genre_vec, x)
  obs <- as.data.frame(obs)
  mutate(obs, row = x)
  final_genre <- rbind(final_genre, obs)
}

The table below shows all of the songs that earned a spot in my playlist, which was just as difficult to whittle down as I expected. It ended up being a good mix of my favorite songs when I was younger and favorites from the last few years. I did decide to exclude purely instrumental songs, because their inclusion would have made the selection process even harder.

# add album image
library(kableExtra)
library(reactable)

playlist_data %>%
  mutate(Rank = paste0("#", 1:nrow(playlist_data))) %>%
  select(Rank, song, artist, album, length) %>%
  reactable(outlined = TRUE, striped = TRUE, highlight = TRUE)
# , columns= list(
# Player= colDef(cell= function(value){
# image <- img(src = sprintf("images/%s.jpg", value), height = "24px", alt = value)
# tagList(
# div(style = list(display = "inline-block", width = "45px"), image),
# value)})))

Now, let’s start learning about these songs. A good place to start the exploration is looking at the genres associated with the artists that made the list. Spotify had genre information for nearly all of the artists on the playlist, and typically associated each artist with several different ones. The wordcloud below captures all of the genres that appeared at least once. I used the PNWcolors package throughout this post to create some soft, earthy color palettes. Overall, my list is dominated by different types of rock- not surprising to me. It also features some funny and obscure genres such as stomp and holler and bow pop.

library(wordcloud)
library(PNWColors)

word_pal <- pnw_palette("Sailboat", n = 12)

genres <- final_genre %>%
  group_by(genre) %>%
  summarise(freq = n()) %>%
  mutate(genre = ifelse(genre == "alternative rock", "alt rock", genre))

wordcloud(
  words = genres$genre, freq = genres$freq, min.freq = 2,
  max.words = 150, random.order = FALSE, colors = word_pal
)

Besides genre, Spotify creates several different measures that capture the attributes of the songs it hosts, which I assume are important for its recommendation algorithms. Spotify defines and explains each of these measures on its site. I chose a sample of these attributes that seemed the most interesting to explore more in this post, briefly explained below:

There are other attributes that are worth exploring as well, but these will provide some interesting insight for this short post. While interesting, it can be hard to tell what these attributes mean in a vacuum. For instance, what does a speechiness score of 0.2 mean? To give these measures a little context, I sliced the Top 100 most played songs in 2020 from a Spotify playlist for comparison. The density plots below compare the distribution of each of measures between my top 100 playlist and the 100 most popular songs from this year.

library(patchwork)
library(extrafont)
# first, format the data

figure_data_me <- playlist_data %>%
  select(song, danceability:energy, loudness, speechiness:tempo) %>%
  pivot_longer(-song, names_to = "attribute", values_to = "score") %>%
  mutate(type = "My Top Songs")

# get comparison playlist
baseline <- get_playlist("64QxysD2w5x5EMLgcoT7fa")

base_features <- get_playlist_audio_features(
  username = "redmusiccompany",
  playlist_uris = str_remove(baseline$uri, "spotify:playlist:")
) %>%
  slice(1:100) %>%
  clean_names() %>%
  mutate(
    song = track_name,
    type = "2020 Top Songs"
  ) %>%
  select(song, type, danceability:tempo) %>%
  pivot_longer(-c(song, type), names_to = "attribute", values_to = "score")

figure_data <- rbind(figure_data_me, base_features)

# make vector of attributes

attributes <- figure_data %>%
  select(attribute) %>%
  distinct()

attributes <- as.vector(attributes$attribute)

# create plot function

att_pal <- pnw_palette("Sailboat", n = 2)
att_pal <- c("#015b58", "#ba7999") # pnw sailboat

compare_attributes <- function(att) {
  figure_data %>%
    filter(attribute == att) %>%
    ggplot(aes(x = score, fill = type)) +
    geom_density(alpha = 0.8) +
    scale_y_continuous(expand = expansion(mult = c(0, .1))) +
    scale_fill_manual(values = att_pal) +
    theme_classic() +
    labs(
      y = "",
      x = "",
      fill = "",
      title = str_to_title(att)
    ) +
    theme(
      text = element_text(family = "Roboto Condensed"),
      legend.text = element_text(size = 12),
      legend.position = "bottom"
    )
}

for (att in attributes) {
  plot <- compare_attributes(att)
  assign(paste0("plot_", att), plot)
}

plot_danceability + plot_energy + plot_loudness + plot_speechiness + plot_tempo + plot_valence + plot_layout(ncol = 2, guides = "collect") & theme(legend.position = "bottom")

I felt tremendous vindication seeing these charts for the first time. As it clearly shows, my poor dancing is actually a result of listening to less danceable music than average. A huge relief. In addition to its danceability, the combination of these measures paints a good picture of the type of music I like. Another very clear difference was that my music was higher energy and faster. My music was also a bit louder and less positive than the 2020 playlist, although the figure alone isn’t enough to tell how significant the difference is.

So far, we’ve found out my list is generally made up of fast-paced rock songs. Not a big surprise. Most of these attributes relate to overall and instrumental characteristics, but another important song is the content of the lyrics. To learn more about the lyrics of my playlist I used the geniusR package which provides an R interface to the Genius API, which is a huge repository of song lyrics. Once the lyrics are downloaded, I used tidytext and some other basic R packages to do text analysis of the lyrics. A few of the electronic songs on the playlist are entirely instrumental, so these are obviously not included in this analysis.

First, I created a dataframe by requesting the lyrics for each song in the playlist I’ve already loaded in. Some were missing or spelled differently than in Spotify, to I needed to manually change the song information for a handful of songs. Genius also didn’t have informaiton for a handful of songs, some because they’re entirely instrumental, so those were excluded.

library(geniusr)

Sys.setenv(genius_token = xxx)

song_artist <- playlist_data %>%
  select(song, artist) %>%
  mutate(song= ifelse(song=="Tegernakô", "Tegernako", song),
         song= ifelse(song=="Blodtørst", "blodtrst", song),
         song= ifelse(song=="Legend on Horseback", "Legend on the Horseback", song),
         song= ifelse(song=="Many of Horror", "Many of Horror (When We Collide)", song),
         song= ifelse(song=="Old Thing Back (feat. Ja Rule and Ralph Tresvant)",
                      "Old Thing Back", song),
         song= ifelse(song=="Outsider - Apocalypse Remix", "The Outsider", song),
         song= ifelse(song=="A-Punk", "A Punk", song),
         artist= ifelse(song=="Blow Me Away", "Breaking Benjamin", artist),
         artist= ifelse(song=="Medicine", "Daughter", artist),
         artist= ifelse(artist=="blink-182", "blink 182", artist),
         song= ifelse(song== "You Think I Ain't Worth A Dollar", "You Think I Ain't Worth A Dollar, But I Feel Like A 
                      Millionaire", song) %>% 
  filter(song != "Chant of the Cavalry", song != "Galloping to the Great Land", song != "Lucid", song != "Capture Me", song != "Remi - Essáy Remix", song != "Regulus", song != "Hatef--k")

song_lyrics <- data.frame()

for (x in 1:length(song_artist$song)){
  tryCatch({
    lyrics <- get_lyrics_search(artist_name = song_artist[x, 2], song_title = song_artist[x, 1])
    song_lyrics <- rbind(song_lyrics, lyrics)
    },
    error= function(e){print(paste("ERROR:", song_artist[x, 1], conditionMessage(e)))})
}

Once the lyrics have been read in, we can do some sentiment analysis. This technique can associate words with different emotions or other states - like positivity, anger, or sadness- and thus allows us an additional way to look at the attributes of my playlist. I’ll be looking at the sentiments of individual words here, which is of course an imperfect method considering word combinations and the variety of usage, but is still fairly accurate. Before using the data I removed a variety of common words like “and” or “not” to reduce the words the most meaningful ones. One difficulty is that many songs repeat the same word quite a few times, which can skew the results based on just a single song. One approach would be to limit each word to only once per song, but I chose a limit of 5 per song as a middle ground. The figure below shows the most common words from my playlist.

library(tidytext)

tidy_lyrics <- song_lyrics %>%
  group_by(song_name) %>%
  mutate(song_line = row_number()) %>%
  ungroup() %>%
  unnest_tokens(word, line) %>%
  anti_join(stop_words) %>%
  group_by(song_name, word) %>%
  mutate(word_repeat = row_number()) %>%
  filter(word_repeat <= 5) %>%
  ungroup()

tidy_lyrics %>%
  count(word, sort = TRUE) %>%
  slice_max(n, n = 15, with_ties = FALSE) %>%
  ggplot(aes(x = n, y = reorder(word, n))) +
  geom_col(alpha = 0.8, fill = "#625a94") +
  scale_x_continuous(breaks = c(0, 20, 40, 60, 80), limits = c(0, 80), expand = c(0, 0)) +
  # scale_y_continuous(expand = c(0, 0))+
  theme_minimal() +
  # scale_color_manual(labels= c("Articles", "States"), values = c("#482677ff", "#2d708eff"))+
  labs(
    title = "Most Common Words",
    x = "Count",
    y = ""
  ) +
  theme(
    text = element_text(family = "Roboto Condensed"),
    panel.grid.minor = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.major = element_line(colour = "light grey"),
    legend.text = element_text(size = 14),
    axis.title.y = element_text(size = 13),
    axis.text.y = element_text(size = 14, color = "black"),
    axis.text.x = element_text(size = 12, color = "black"),
    plot.title = element_text(size = 14, hjust = 0.5),
    legend.position = "bottom"
  )

However the figure doesn’t get into the sentiment of these words. There are several different dictionaries that can be used to map sentiments to words, and I’ll three different ones to learn a little more. The first, the Bing sentiment lexicon, makes the binary assignment of either positive or negative to words. This process misses words not in the lexicon, and as a result it reduces the words from 7,440 to just 1,444. This is a big loss of potential meaning, and it would make me hesitant to draw conclusion, but still gives some interesting results for this post. Of the remaining words, just 30% are postive versus 70% negative. Ouch.

library(textdata)

bing_lyrics <- tidy_lyrics %>%
  inner_join(get_sentiments("bing"), by = "word") %>%
  group_by(sentiment) %>%
  summarise(count = n())

One aspect missed by the initial approach is that not all words that express either sentiment are the same. Two words can be negative, but one can be much more so than the other. The “afinn” lexicon avoids this obstacle by classifying the positivist of a word from -5 to 5. Using this lexicon results in an almost identical loss of words. I used this dictionary to get a better sense of the most negative and positive songs. The table below shows the 5 songs songs that were most negative and positive, based on the average sentiment of their lyrics.

tidy_lyrics %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(song_name) %>%
  summarize(sentiment = mean(value, na.rm = TRUE)) %>% # median is a bit dif
  arrange(desc(sentiment)) %>%
  slice(1:5, 81:85) %>%
  inner_join(tidy_lyrics, by = "song_name") %>%
  select(song_name, artist_name, sentiment) %>%
  distinct() %>%
  reactable()

The table offers some interesting insight. Some of the songs are obviously good fits. Welcome Home, Killing in the Name, and Havoc definitely all stand out as negative songs. Soul Meets Body is definitely a positive one, although I’m less sure about the others. While they may have some more positive lyrics, most of those songs are quite aggressive and it seems odd to place them with Soul Meets Body. Even the negative songs suffer from that lack of specificity. Welcome Home, Son is more of an acoustic, sorrowful song while Killing in the Name is angry.

To make a final distinction, I wanted to combine these sentiment scores with the earlier data about the songs musical attributes. Hopefully that will create a better picture of the affect of each song. We looked at several relevant attributes earlier, but energy seems like the most useful. This measure ends up being very similar to the measure for anger in one of the articles linked earlier. The interactive chart below shows how the songs score on these measures.

library(plotly)
# a bunch of songs had name changes or were dropped. fix the name changes
figure_data <- tidy_lyrics %>%
  inner_join(get_sentiments("afinn"), by = "word") %>%
  group_by(song_name) %>%
  summarize(sentiment = mean(value)) %>%
  inner_join(figure_data_me, by = c("song_name" = "song")) %>%
  filter(attribute == "energy") %>%
  mutate(my_var = score * sentiment)


ggplotly(
  ggplot(figure_data, aes(x = sentiment, y = score, text = paste0(
    song_name, "<br>",
    "Energy:", score, "<br>",
    "Positivity:", sentiment
  ))) +
    geom_point() +
    theme_minimal(),
  tooltip = "text"
) %>%
  style(hoverlabel = list(bgcolor = "white"), hoveron = "fill") %>%
  config(displayModeBar = FALSE)