“Number Crunching” Vampire Weekend: Music, Lyrics and Setlists

Introduction to the Author

I’m a PhD candidate in Communication who loves live music. Still learning R, and this is meant to be more of a walkthrough than a tutorial so… please be gentle and sent helpful comments if you so desire :)

Introduction to the Band

Vampire Weekend is a grammy-award winning band currently comprised of Ezra Koenig, Chris Tomson, and Chris Baio. They have released four albums: Vampire Weekend, Contra, Modern Vampires of the City, and most recently, Father of the Bride. Father of the Bride was released last May, and both the album itself and the tour promoting it have received critical acclaim. The album and tour have also received my own personal acclaim (“Stranger” was my Spotify song of the year in 2019 and I was lucky enough to see them perform in Philly, NYC, San Francisco and Portland last year!).

To explore the music of Vampire Weekend, we’ll first look at some of the instrumental qualities of the songs (provided via spotifyR). We’ll also examine the lyrical composition of those songs using Genius. Finally, we’ll take a deeper dive into the setlits from the FOTB tour (exluding the most recent Australia shows, since my data collection began mid-December 2019). Let’s start!

Data Collection and Prepartion

[Note: if you’re not interested in the nitty-gritty pre-processing of the data, feel free to skip far ahead to the “Music” section!]

In order to retrieve setlist data that I could then look up instrumental and lyrical information for, I used the repertorio setlist.fm API wrapper for Python. Although there are some wrappers for R, I found repertorio the easiest to implement (even with my limited Python skills). The following code retrieves the first 5 pages of setlist information from the Vampire Weekend’s setlist.fm profile. Since the resulting data is deeply nested, the json_normalize and flat_table.normalize functions are used to access the data in a dataframe format instead of the json list. After formatting the 5 pages of data, we concantenate the dataframes vertically and save them as a csv. You can access the Python code here.

Even with the formatting changes, the raw file was a bit unwieldy. Setlists data pulled from setlist.fm seems to keep every iteration of the setlist (so if users add changes they appear as different setlists).

I manually removed the repetitions from the dataset, but I would love any feedback on how to do so efficiently in R.

I also added rows for songs that were only mentioned in the setlist notes so that the dataset was as accurate as possible. Once the dataset was a bit cleaner, I imported the file into R using tidyverse’s read_csv function and loaded all the necessary packages. This project was in part an exercise to familiarize myself with the tidyverse, so you’ll see it’s functions pop up regularly :)

library(tidyverse)
library(spotifyr)
library(genius)
library(tidytext)
library(textdata)
library(scales)
library(ggridges)
library(ggradar)
library(wordcloud)
library(igraph)
library(ggraph)
library(reshape2)
library(proxy)
FOTB_tour = read_csv('/Users/Hannah/Downloads/FOTB_tour.csv')

Next, I assigned a unique value to each show (starting with 1 for the most recent show at the time - in Portland, OR), and then a unique value to each song within that show (starting with 1 for the first song played at the Portland, OR show - “Harmony Hall”). The addition of these variables allowed me to conduct future analyses on setlist structure within each show.

FOTB_tour$show_num <- match(FOTB_tour$id, unique(FOTB_tour$id))

FOTB_tour = FOTB_tour %>% 
  group_by(show_num) %>% 
  mutate(song_num = row_number())

I then used the spotifyR package to gather data from the Spotify API. Many thanks to RCharlie for making this publicly available, as it has encouraged many projects like this one! First I gathered song information from Vampire Weekend’s albums and singles, removing duplicates.

#Sys.setenv(SPOTIFY_CLIENT_ID = 'spotify-client-id')
#Sys.setenv(SPOTIFY_CLIENT_SECRET = 'spotify-client-secret')
vw_spotify = get_artist_audio_features('vampire weekend', include_groups = c("album","single"))
vw_spotify = vw_spotify %>% distinct(track_name, .keep_all= TRUE)

I then edited some of the song names so that they matched the names in the setlist.fm data. This is a process that I had to perform regular checks on, since even a lowercase/uppercase change made a difference. I suppose I should have guessed that this would be the case, but it was still a bit annoying.

vw_spotify = vw_spotify %>% 
  mutate(track_name = recode(track_name, 
                             'We Belong Together (feat. Danielle Haim)' = 'We Belong Together', 
                             'Hold You Now (feat. Danielle Haim)' = 'Hold You Now', 
                             'Married in a Gold Rush (feat. Danielle Haim)' = 'Married in a Gold Rush',
                             'Sunflower (feat. Steve Lacy)' = 'Sunflower',
                             'Flower Moon (feat. Steve Lacy)' = 'Flower Moon',
                             "One (Blake's Got A New Face)" = "One (Blake's Got a New Face)",
                             "Diplomat’s Son" = "Diplomat’s Son",
                             "The Kids Don't Stand A Chance" = "The Kids Don't Stand a Chance",
                             'Giving Up The Gun' = 'Giving Up the Gun',
                             'Boston (Ladies of Cambridge)' = 'Boston (Ladies of Cambridge)'))

Because Vampire Weekend has a prolific covers catalog, I also needed to retrieve Spotify data for these songs as well. As such, I retrieved all song information from the following artists, combined the dataframes vertically and removed any duplicates.

ezra = get_artist_audio_features('ezra koenig', include_groups = 'appears_on')

bruce = get_artist_audio_features('bruce springsteen') %>% filter(track_name == "I'm Goin' Down")

bob_d = get_artist_audio_features('bob dylan')

paul_simon = get_artist_audio_features('paul simon')

dusty = get_artist_audio_features('dusty springfield')

sublime = get_artist_audio_features('sublime')

the_doors = get_artist_audio_features('the doors')

seinfeld = get_artist_audio_features('TV Sounds Unlimited')

crowded = get_artist_audio_features('crowded house')

fleetwood = get_artist_audio_features('fleetwood mac')

thin = get_artist_audio_features('thin lizzy')

steve = get_artist_audio_features('steve lacy')

baio = get_artist_audio_features('baio')

neil = get_artist_audio_features('neil young') %>% filter(!track_name == "How Long?")

labi = get_artist_audio_features('Labi Siffre')

branigan = get_artist_audio_features('Laura Branigan')

VU = get_artist_audio_features('the velvet underground')

toots = get_artist_audio_features('Toots & The Maytals')

covers = do.call("rbind", list(bruce, ezra, bob_d, paul_simon, dusty, sublime, the_doors,
                               crowded, fleetwood, thin, steve, baio, neil, labi, branigan, VU,
                               toots, seinfeld))

covers = covers %>% distinct(track_name, .keep_all= TRUE)

I then combined the Spotify information from the Vampire Weekend and covers dataframes before I fixed some of the aforementioned name-matching issues.

total = do.call("rbind", list(vw_spotify, covers))

total = total %>% mutate(track_name = recode(track_name, 
                                             'NEW DORP. NEW YORK' = 'New Dorp. New York',
                                             'Theme from "Seinfeld"' = 'Theme from Seinfeld',
                                             'Everywhere - 2018 Remaster' = 'Everywhere',
                                             'The Boys Are Back In Town' = 'The Boys Are Back in Town',
                                             'I Got The' = 'I Got The...'))

When I tried to join the setlist.fm dataframe and the newly created Spotify dataframe, I ran into some technical difficulties. Although the joining function worked correctly and I removed unnecessary columns successfully, I noticed that there were some songs that did not match properly (even with name checks) or songs with “unretrievable data”. As such, I had to do a few workarounds and a bit more manual work in the csv file (with audio information) itself - copying and pasting the song info from the “extra” songs to the correct song in the main dataframe. Fortunately, there weren’t too many discrepancies, so the matching process didn’t take too long.

FOTB_w_audio =left_join(FOTB_tour, total, by = c("sets.set.song.name" = "track_name")) %>%
  select(-c('album_images', 'available_markets', 'artists'))

write_csv(FOTB_w_audio, '/Users/Hannah/Downloads/FOTB_w_audio_NC.csv')

smooth = get_track_audio_features('0n2SEXB2qoRQg171q7XqeW') %>%
  mutate(track_name = 'Smooth')

saturday = get_track_audio_features('4OJFkrRQqol4FsPesF8eu4') %>% 
  mutate(track_name = 'Saturday in the Park')

solsbury = get_track_audio_features('1CM1wOqD2AIjt2MWd31LV2') %>% 
  mutate(track_name = "Solsbury hill")

martha = get_track_audio_features('1swmf4hFMJYRNA8Rq9PVaW') %>% 
  mutate(track_name = "Martha My Dear")

mountain_brews = get_track_audio_features('2m6DKnju21P9R9WrghyH0h') %>% 
  mutate(track_name = 'Mountain Brews')

here_comes = get_track_audio_features('6dGnYIeXmHdcikdzNNDMm2') %>% 
  mutate(track_name = 'Here Comes the Sun')

diplomat = get_track_audio_features('0ybF6SJTJw1RCpsF0QagFw') %>% 
  mutate(track_name = "Diplomat's Son")

dark_red = get_track_audio_features('37y7iDayfwm3WXn5BiAoRk') %>% 
  mutate(track_name = 'Dark Red')

streets = get_track_audio_features('3fbnbn6A5O5RNb08tlUEgd') %>% 
  mutate(track_name = 'Streets of Philadelphia')

waka = get_track_audio_features('2D0fFvKKxLfXDfdAuwsnWn') %>% 
  mutate(track_name = 'Waka Waka')

extras = do.call("rbind", list(smooth, saturday, solsbury, martha, mountain_brews, here_comes, diplomat, dark_red, streets, waka))
write_csv(extras, '/Users/Hannah/Downloads/extras_NC.csv')

I read the fixed file it back into R. Then I removed some of the unnecessary column names and renamed the variable for song titles. I also added a variable that demonstrates how far along a certain song occurs within a show (e.g. the last song in a set has a value of 1, because the show is 100% complete after that song).

FOTB_fixed = read_csv('/Users/Hannah/Downloads/FOTB_fixed.csv')

FOTB_fixed = FOTB_fixed %>% 
  select(-c('external_urls.spotify', 'track_uri', 'track_preview_url', 'is_local',
            'track_href','explicit', 'disc_number','analysis_url',
            'album_release_date_precision'))

FOTB_fixed = FOTB_fixed %>% rename('track_name' = 'sets.set.song.name')

FOTB_fixed = FOTB_fixed %>% 
  group_by(show_num) %>% 
  mutate(perc_complete = song_num/max(song_num))

At this point you might be wondering, holy sh*t, are we done? Not yet, my friends! There are still lyrics to be examined! spotifyR is a great package for this if you’re only looking at one artist’s fully discography (you can get the music and lyric info with one function - get_discography!). However, because of the discrepancy issues I was having earlier, I decided to go right to the source - the genius package! Using this package, I retrieved all of the lyrics for the songs in the dataset.

fotb_lyrics = genius_album(artist = "Vampire Weekend", album = 'Father of the Bride')

mvotc_lyrics = genius_album(artist = 'Vampire Weekend', album = 'Modern Vampires of the City')

contra_lyrics = genius_album(artist = 'Vampire Weekend', album = 'Contra')

vw_lyrics = genius_album(artist = 'Vampire Weekend', album = 'Vampire Weekend')

extras_songs <- tribble(
  ~artist, ~track,
  "SBTRKT", "NEW DORP. NEW YORK",
  "Mountain Brews", "Mountain Brews",
  "Bob Dylan", "Jokerman",
  "Paul Simon", "Late in the Evening",
  "Dusty Springfield", "Son of a Preacher Man",
  "Sublime", "Santeria",
  "The Doors", "Peace Frog",
  "The Beatles", "Here Comes the Sun",
  "The Beatles", "Martha My Dear",
  "Santana", "Smooth",
  "Shakira", "Waka Waka (This Time for Africa)",
  "Peter Gabriel", "Solsbury Hill",
  "Chicago", "Saturday in the Park",
  "Crowded House", "Don't Dream It's Over",
  "Fleetwood Mac", "Everywhere",
  "Thin Lizzy", "The Boys are Back in Town",
  "Steve Lacy", "Dark Red",
  "Bruce Springsteen", "Streets of Philadelphia",
  "Bruce Springsteen", "I'm Goin' Down",
  "Baio", "Sister of Pearl",
  "Neil Young", "Vampire Blues",
  "Labi Siffre", "I Got The...",
  "Laura Branigan", "Gloria",
  "The Velvet Underground", "Sunday Morning",
  "Toots & The Maytals", "Pressure Drop",
  "Vampire Weekend", "Giant",
  "Vampire Weekend", "Ottoman",
  "Vampire Weekend", "Ladies of Cambridge",
  "Vampire Weekend", "Houston, Dubai",
  "L'Homme Run", "Pizza Party",
  "Vampire Weekend", "California English Pt. 2",
  "Vampire Weekend", "Arrows",
  "Major Lazer", "Jessica",
  "Vampire Weekend", "Jonathan Low",
  "Bob Dylan", "Little Drummer Boy"
)
extras_songs = extras_songs %>%
  add_genius(artist, track, type = "lyrics")

I added artist information to all of the Vampire Weekend songs, which I concantenated vertically into its own dataframe. I removed extra columns that were not in both the Vampire Weekend and “Extras” dataset in order to bind them together. Finally, I removed rows with a lyric value of ‘NA.’

all_vw_lyrics = do.call("rbind", list(fotb_lyrics, mvotc_lyrics, contra_lyrics, vw_lyrics))

all_vw_lyrics = all_vw_lyrics %>% mutate(artist = "Vampire Weekend") %>% select(-c(track_n))

extras_songs = extras_songs %>% select(-c(track))

all_lyrics = do.call("rbind", list(all_vw_lyrics, extras_songs))

all_lyrics = all_lyrics %>% drop_na()

Unfortunately, I ran into a few more issues when trying to change the names of songs in the Genius dataset so that they matched the original dataset. For some odd reason, the recode method that I used previously did not work for all song titles. I resorted to a bit of grepl-in’, which seemed to get the job done.

all_lyrics = all_lyrics %>% 
  mutate(track_title = recode(track_title, 
                              "Ladies of Cambridge" = "Boston (Ladies of Cambridge)",
                              'NEW DORP. NEW YORK.' = 'New Dorp. New York',
                              'The Boys Are Back In Town' = 'The Boys Are Back in Town',
                              'I Got The' = 'I Got The...',
                              "Late In The Evening" = "Late in the Evening"))

all_lyrics$track_title[grepl("Hold You Now",all_lyrics$track_title)]<-"Hold You Now"

all_lyrics$track_title[grepl("Married in a Gold Rush",all_lyrics$track_title)]<-"Married in a Gold Rush"

all_lyrics$track_title[grepl("We Belong Together",all_lyrics$track_title)]<-"We Belong Together"

all_lyrics$track_title[grepl("Sunflower",all_lyrics$track_title)]<-"Sunflower"

all_lyrics$track_title[grepl("Flower Moon",all_lyrics$track_title)]<-"Flower Moon"

all_lyrics$track_title[grepl("Down",all_lyrics$track_title)]<-"I'm Goin' Down"

all_lyrics$track_title[grepl("Waka",all_lyrics$track_title)]<-"Waka Waka"

all_lyrics$track_title[grepl("Santeria",all_lyrics$track_title)]<-"Santeria"

all_lyrics$track_title[grepl("Don't Dream It's Over",all_lyrics$track_title)]<-"Don't Dream It's Over"

We are almost able to join our lyric data with the rest of the data! In the dataset’s current form, each lyric of the song is a different row, and songs are repeated multiple times. This makes it more difficult to conduct text analysis, so first I aggregated all the lyrics from one song into one row.

lyric_only_analysis = aggregate(all_lyrics$lyric, list(all_lyrics$track_title, all_lyrics$artist), paste, collapse=" ")

lyric_only_analysis = lyric_only_analysis %>% rename("Song"= "Group.1") %>% rename("Artist" = "Group.2") %>% rename("Lyrics" = "x")

Then I performed some basic sentiment analysis on the dataset. This website provides an excellent resource for anyone interested in text analysis - I think I refer to it on a weekly basis. Essentially, this code takes each word from a song and places it into it’s own row. Then I remove common stop words, and some unnecessary words that are specific to these lyrics.

tidy_lyrics = lyric_only_analysis %>% unnest_tokens(word, Lyrics)

tidy_lyrics = tidy_lyrics %>% anti_join(stop_words)

custom_stop_words <- bind_rows(tibble(word = c("ooh", "la", "ah", "pum", NA), 
                                      lexicon = c("custom")), stop_words)

tidy_lyrics = tidy_lyrics %>% anti_join(custom_stop_words)

Next, I use the NRC sentiment database to extract words in each of its 10 categories: negative, positive, fear, anger, trust, sadness, disgust, anticipation, surprise, and joy.

sentiments = get_sentiments("nrc")

angry_words <- sentiments %>%
  filter(sentiment == "anger") %>%
  select(word) %>%
  mutate(anger = TRUE)

positive_words <- sentiments %>%
  filter(sentiment == "positive") %>%
  select(word) %>%
  mutate(positive = TRUE)

negative_words <- sentiments %>%
  filter(sentiment == "negative") %>%
  select(word) %>%
  mutate(negative = TRUE)

anticipation_words <- sentiments %>%
  filter(sentiment == "anticipation") %>%
  select(word) %>%
  mutate(anticipation = TRUE)

disgust_words <- sentiments %>%
  filter(sentiment == "disgust") %>%
  select(word) %>%
  mutate(disgust = TRUE)

fear_words <- sentiments %>%
  filter(sentiment == "fear") %>%
  select(word) %>%
  mutate(fear = TRUE)

joy_words <- sentiments %>%
  filter(sentiment == "joy") %>%
  select(word) %>%
  mutate(joy = TRUE)

sadness_words <- sentiments %>%
  filter(sentiment == "sadness") %>%
  select(word) %>%
  mutate(sadness = TRUE)

surprise_words <- sentiments %>%
  filter(sentiment == "surprise") %>%
  select(word) %>%
  mutate(surprise = TRUE)

trust_words <- sentiments %>%
  filter(sentiment == "trust") %>%
  select(word) %>%
  mutate(trust = TRUE)

Then I join the lyric database with the sentiment databases. The resulting database has only the sentiment words that appear in the lyrics.

lyrics_sentiment = tidy_lyrics %>% 
  left_join(angry_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(positive_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(negative_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(anticipation_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(disgust_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(fear_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(joy_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(sadness_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(surprise_words, by = "word")

lyrics_sentiment = lyrics_sentiment %>% 
  left_join(trust_words, by = "word")

Lastly, I calculate the proportions of each type of sentiment in each song. The reason this is a more robust measure of sentiment is because an example song may have 50 angry words, but 200 words in total. Therefore, the song would be only 25% “angry”, whereas another song may only have 5 words and 4 of them are angry - so that song would be 80% “angry.” I repurposed this method from this tutorial.

lyrics_sentiment = lyrics_sentiment %>% 
  group_by(Song, Artist) %>% 
  summarize(percent_angry = sum(anger, na.rm = TRUE) / n(), 
            percent_positive = sum(positive, na.rm = TRUE) / n(),
            percent_negative = sum(negative, na.rm = TRUE) / n(),
            percent_anticipation = sum(anticipation, na.rm = TRUE) / n(),
            percent_disgust = sum(disgust, na.rm = TRUE) / n(),
            percent_fear = sum(fear, na.rm = TRUE) / n(),
            percent_joy = sum(joy, na.rm = TRUE) / n(),
            percent_sadness = sum(sadness, na.rm = TRUE) / n(),
            percent_surprise = sum(surprise, na.rm = TRUE) / n(),
            percent_trust = sum(trust, na.rm = TRUE) / n(),
            word_count = n())

Finally, I join the lyrical sentiment data with the main dataset. Felt like this day would never come!

FOTB_sentiment = left_join(FOTB_fixed, lyrics_sentiment, by = c("track_name" = "Song"))

We now have all of the setlist data, music data from Spotify, lyrics data from Genius and a bit of NRC sentiment data. I think that should be enough for any analyses we might want to do ;) So let’s (finally) see what story the data has to tell!

Music

For our music analyses, we only want certain variables and we’ll focus specifically on songs that are on one of Vampire Weekend’s four albums. A bit of data-tidying (because Diplomat’s Son continues to be deviant), and re-scaling of Spotify’s output variables so that they’re all on the same scale, and we’re good to go. If you want to know more about the variables Spotify provides, check them out here.

vw_only_music = FOTB_sentiment %>% 
  filter(artist_name == "Vampire Weekend") %>% 
  filter(album_name == "Vampire Weekend"|
           album_name == "Contra" |
           album_name == "Modern Vampires of the City" |
           album_name == "Father of the Bride" |
           track_name == "Diplomat's Son") %>% 
  mutate(album_name = recode(album_name, 'Modern Vampires of the City' = 'MVOTC')) %>% 
  distinct(track_name, .keep_all = TRUE) %>% 
  select(track_name, artist_name, album_name, danceability, energy, loudness, 
         speechiness, acousticness, instrumentalness, liveness, tempo, valence,
         duration_ms, time_signature, key_name, mode_name, key_mode, track_number) %>% 
  mutate(track_number=replace(track_number, track_name=="Diplomat's Son", 8.5)) %>% 
  mutate(album_name=replace(album_name, track_name=="Diplomat's Son", "Contra"))

vw_only_music$tempo_rescale = rescale(vw_only_music$tempo, to = c(0, 1), from = range(vw_only_music$tempo, na.rm = TRUE, finite = TRUE))

vw_only_music$energy_rescale = rescale(vw_only_music$energy, to = c(0, 1), from = range(vw_only_music$energy, na.rm = TRUE, finite = TRUE))

vw_only_music$danceability_rescale = rescale(vw_only_music$danceability, to = c(0, 1), from = range(vw_only_music$danceability, na.rm = TRUE, finite = TRUE))

vw_only_music$loudness_rescale = rescale(vw_only_music$loudness, to = c(0, 1), from = range(vw_only_music$loudness, na.rm = TRUE, finite = TRUE))

vw_only_music$valence_rescale = rescale(vw_only_music$valence, to = c(0, 1), from = range(vw_only_music$valence, na.rm = TRUE, finite = TRUE))

vw_only_music$speechiness_rescale = rescale(vw_only_music$speechiness, to = c(0, 1), from = range(vw_only_music$speechiness, na.rm = TRUE, finite = TRUE))

vw_only_music$acousticness_rescale = rescale(vw_only_music$acousticness, to = c(0, 1), from = range(vw_only_music$acousticness, na.rm = TRUE, finite = TRUE))

vw_only_music$instrumentalness_rescale = rescale(vw_only_music$instrumentalness, to = c(0, 1), from = range(vw_only_music$instrumentalness, na.rm = TRUE, finite = TRUE))

vw_only_music$liveness_rescale = rescale(vw_only_music$liveness, to = c(0, 1), from = range(vw_only_music$liveness, na.rm = TRUE, finite = TRUE))

First, let’s look at the estimated distributions of different variables across the four albums. In this plot, made with the ggridges package, the higher the ridge, the more songs there are with danceability scores of that value. For example, the ridge is highest around .40 for the self-titled album, which indicates that most of the songs on the album are decently easy to dance to (an inference I would agree with).

vw_only_music %>% 
  arrange(album_name) %>%
  mutate(album = factor(album_name, levels=c("Vampire Weekend", "Contra", "MVOTC", "Father of the Bride"))) %>%
  ggplot(aes(danceability_rescale, album, fill = stat(x))) +
  geom_density_ridges_gradient(scale = 1.5, rel_min_height = 0.01) +
  scale_fill_viridis_c(name = "Danceability", option = "C") +
  labs(title = 'Distribution of Danceability in VW Albums') +
  ylab("Album") +
  xlab("Danceability")

We can also look at how danceability changes over the course of the albums. Through these plots, we can see that Contra apparently becomes harder to dance to as the album progresses. “I Think Ur A Contra” will always be one of my favorites though, regardless of how easily I can groove to it.

vw_only_music %>% 
  ggplot(aes(track_number, danceability_rescale, color = album_name)) +
  geom_point() +
  geom_line() +
  facet_grid(rows = vars(album_name)) +
  labs(title = "Danceability Patterns in VW Albums", color = "Album Name") +
  xlab("Track Number") +
  ylab("Danceability")

If we’re interested in comparing multiple musical variables at once for all of the albums, we can generate an average value for each album and use a radar plot. Surprisingly, according to Spotify’s data, the albums seem to follow similar patterns, with the self-titled having generally higher mean values on all the variables. Starting off strong in 2008!

vw_only_music %>% 
  group_by(album_name) %>% 
  summarise(tempo = mean(tempo_rescale),
            energy = mean(energy_rescale),
            danceability = mean(danceability_rescale),
            valence = mean(valence_rescale),
            loudness = mean(loudness_rescale)) %>% 
  ggradar(label.gridline.max = FALSE, label.gridline.min = FALSE, label.gridline.mid = FALSE, base.size = 3)

Spotify also provides data about the key a song is played in. My musical theory knowledge is a bit weak (read: non-existent) so unfortunately it is difficult for me to confirm the accuracy of this. If you notice something funky, let me know!

vw_only_music %>% 
  arrange(album_name) %>%
  mutate(Album = factor(album_name, levels=c("Vampire Weekend", "Contra", "MVOTC", "Father of the Bride"))) %>%
  filter(!is.na(key_name)) %>% 
  ggplot() + 
  geom_histogram(aes(x = Album, fill = key_name), colour = "white", 
                 width = 1, stat = "count", position = "fill") +
  scale_y_continuous(expand = c(0,0)) +
  scale_x_discrete(expand = c(0,0)) +
  theme(axis.text.y = element_blank(),
        axis.ticks = element_blank(),
        axis.title.y = element_blank()) +
  labs(title = "Keys in VW Albums", fill = "Key Name")

Inspired by RCharlie yet again, and also the conversations I have with my close friend Emma (a die-hard VW-head) I have also created my own variable called the “Banger Index.” Yes, I really called it that. Everyone knows a banger when they hear it… but what do genuine bangers consist of? Well, my best guess is…

vw_only_music = vw_only_music %>% mutate(banger_index = tempo_rescale + loudness_rescale + energy_rescale)

vw_only_music$banger_index_rescale = rescale(vw_only_music$banger_index, to = c(0, 1), from = range(vw_only_music$banger_index, na.rm = TRUE, finite = TRUE))

A nice mix of loud, energetic and high tempo music generally feels like a banger to me. And when I test it out on the data, it seems to fit pretty well, seeing as A-Punk and Cousins are right at the top of the list.

vw_only_music %>% 
  group_by(track_name) %>%
  arrange(desc(banger_index_rescale)) %>% 
  ungroup() %>%
  distinct(track_name, .keep_all = TRUE) %>% 
  select(track_name, banger_index_rescale)

## # A tibble: 50 x 2
##    track_name        banger_index_rescale
##    <chr>                            <dbl>
##  1 A-Punk                           1    
##  2 Cousins                          0.971
##  3 Diplomat's Son                   0.961
##  4 Bryn                             0.961
##  5 Worship You                      0.909
##  6 Mansard Roof                     0.890
##  7 Holiday                          0.861
##  8 Giving Up the Gun                0.858
##  9 Don't Lie                        0.835
## 10 This Life                        0.823
## # … with 40 more rows

We’ll return to bangers later. For now, let’s move on to some lyrical analyses!

Lyrics

First, let’s see which words are most common in Vampire Weekend songs (and give the word cloud a nice tasteful color palette of the 1970s).

pal <- brewer.pal(8,"Dark2")
tidy_lyrics %>% 
  filter(Artist == "Vampire Weekend") %>% 
  count(word, sort = TRUE) %>% 
  with(wordcloud(word, n, random.order = FALSE, max.words = 100, colors = pal))

Baby, baby, baby, baby right on time…

Let’s also check out what words are commonly found with other words. I choose “trigrams” (or groups of three words) for the purpose of showing more lyrical patterns, but you could choose any n-gram (bigrams, quadrigrams(?), etc). I removed the same common and custom stop words from earlier, and then filtered out extremely uncommon trigrams. This analyses is inspired by Tidy Text Mining, again. Told you I relied on them often!

lyric_trigrams <- lyric_only_analysis %>%
  filter(Artist == "Vampire Weekend") %>% 
  unnest_tokens(trigram, Lyrics, token = "ngrams", n = 3)

trigrams_separated <- lyric_trigrams %>%
  separate(trigram, c("word1", "word2", "word3"), sep = " ")

trigrams_filtered <- trigrams_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>% 
  filter(!word3 %in% stop_words$word) %>% 
  filter(!word2 %in% custom_stop_words$word) %>% 
  filter(!word1 %in% custom_stop_words$word) %>% 
  filter(!word3 %in% custom_stop_words$word)

trigram_counts <- trigrams_filtered %>% 
  count(word1, word2, word3, sort = TRUE)

trigram_graph <- trigram_counts %>%
  filter(n > 1) %>% 
  graph_from_data_frame()

set.seed(2021)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(trigram_graph, layout = "fr") +
  geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
                 arrow = a, end_cap = circle(.07, 'inches')) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

I got a laugh out of the “unbearably” “white” “women” combo. We’ve also got a nice little Ya Hey circle. There are also some stragglers (“boy”, for example) that apparently occur so often next to the same word that they’re off in their own corner.

Next, I wanted to conduct some similar analyses to the music ones above, but with lyrical data instead. To do so, I chose the relevant variables from the main dataset and made sure it was tidy.

vw_only_lyrics = FOTB_sentiment %>% 
  filter(artist_name == "Vampire Weekend") %>% 
  filter(album_name == "Vampire Weekend"|
           album_name == "Contra" |
           album_name == "Modern Vampires of the City" |
           album_name == "Father of the Bride" |
           track_name == "Diplomat's Son") %>% 
  mutate(album_name = recode(album_name, 'Modern Vampires of the City' = 'MVOTC')) %>% 
  distinct(track_name, .keep_all = TRUE) %>% 
  select(track_name, artist_name, album_name, percent_angry, percent_anticipation, percent_positive,percent_negative,
         percent_disgust, percent_trust, percent_surprise, percent_joy, percent_fear, percent_sadness, track_number) %>% 
  mutate(track_number=replace(track_number, track_name=="Diplomat's Son", 8.5)) %>% 
  mutate(album_name=replace(album_name, track_name=="Diplomat's Son", "Contra"))

I was first interested in what kind of lyrical patterns occured within each album, so I chose 3 of the variables that showcase what percentage of words in each song reflect a certain sentiment (in this case, I chose anticipation, fear and surprise.)

vw_only_lyrics %>% 
  ggplot(aes(track_number, x = track_number)) +
  geom_point(aes(y= percent_fear, color= "Fear")) +
  geom_line(aes(y= percent_fear, color= "Fear")) +
  geom_point(aes(y= percent_anticipation, color= "Anticipation")) +
  geom_line(aes(y= percent_anticipation, color= "Anticipation")) +
  geom_point(aes(y= percent_surprise, color= "Surprise")) +
  geom_line(aes(y= percent_surprise, color= "Surprise")) +
  facet_grid(rows = vars(album_name)) +
  labs(title = "Emotional Word Patterns in VW Albums", color = "Word Category") +
  xlab("Track Number") +
  ylab("Percentage of Emotion Words")

When we overlay these 3 variables, we can look at patterns in comparison to each other. For example, the percentage of surprise words per song in Father of the Bride stay relatively constant, but percentages of fear words change more frequently. Obvious Bicycle (one of my all time favorite VW songs) has a much greater percentage of anticipation words (e.g “wait”) than all of the other songs. This is also indicative of how much I anticipate hearing it relative to other songs - I’ve also requested it at two shows… because I really like it.

Moving on to a final radar plot, we can again see some surprising results. I’ve always thought of MVOTC as a more contemplative and moody album, but apparently it has the highest average percentage of positive and joy words, whereas Father of the Bride has the highest average percentage of negative words. How my mental tables have turned.

vw_only_lyrics %>% 
  group_by(album_name) %>% 
  summarise(angry = mean(percent_angry),
            anticipation = mean(percent_anticipation),
            positive = mean(percent_positive),
            negative = mean(percent_negative),
            disgust = mean(percent_disgust),
            trust = mean(percent_trust),
            surprise = mean(percent_surprise),
            joy = mean(percent_joy),
            fear = mean(percent_fear),
            sadness = mean(percent_sadness)) %>% 
  ggradar(grid.max = .25, label.gridline.max = FALSE, label.gridline.min = FALSE)

‘Father of the Bride’ Tour Setlists

Now for the moment we’ve all been waiting for/the original reason I embarked on this project! If you’ve stuck with me this far, thank you.

Let’s see what the setlist data has to say (after we add the rescaled variables to the dataset like we did before… can’t forget about the banger index!)

FOTB_sentiment$tempo_rescale = rescale(FOTB_sentiment$tempo, to = c(0, 1), from = range(FOTB_sentiment$tempo, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$energy_rescale = rescale(FOTB_sentiment$energy, to = c(0, 1), from = range(FOTB_sentiment$energy, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$danceability_rescale = rescale(FOTB_sentiment$danceability, to = c(0, 1), from = range(FOTB_sentiment$danceability, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$loudness_rescale = rescale(FOTB_sentiment$loudness, to = c(0, 1), from = range(FOTB_sentiment$loudness, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$valence_rescale = rescale(FOTB_sentiment$valence, to = c(0, 1), from = range(FOTB_sentiment$valence, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$speechiness_rescale = rescale(FOTB_sentiment$speechiness, to = c(0, 1), from = range(FOTB_sentiment$speechiness, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$acousticness_rescale = rescale(FOTB_sentiment$acousticness, to = c(0, 1), from = range(FOTB_sentiment$acousticness, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$instrumentalness_rescale = rescale(FOTB_sentiment$instrumentalness, to = c(0, 1), from = range(FOTB_sentiment$instrumentalness, na.rm = TRUE, finite = TRUE))

FOTB_sentiment$liveness_rescale = rescale(FOTB_sentiment$liveness, to = c(0, 1), from = range(FOTB_sentiment$liveness, na.rm = TRUE, finite = TRUE))

FOTB_sentiment = FOTB_sentiment %>% mutate(banger_index = tempo_rescale + loudness_rescale + energy_rescale)

FOTB_sentiment$banger_index_rescale = rescale(FOTB_sentiment$banger_index, to = c(0, 1), from = range(FOTB_sentiment$banger_index, na.rm = TRUE, finite = TRUE))

First, let’s look at which songs were the most common “openers” or first songs of the shows. We can see quite a bit of variety here, with Bambina, Sympathy, Sunflower and Harmony Hall being the most common. Gotta love a Sympathy opener!

FOTB_sentiment %>% 
  group_by(track_name) %>% 
  filter(perc_complete == 1) %>%
  ungroup() %>% 
  count(track_name, sort = TRUE) %>% 
  mutate(track_name = reorder(track_name, n)) %>%
  ggplot(aes(track_name, n, fill = track_name)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme(legend.position = "none") +
  ylab("# of Times Played") +
  ggtitle("FOTB Tour Closers")

Next, let’s look at which songs were the most common “closers” or final songs on the setlists. For anyone who has seen Vampire Weekend on this tour, “Walcott” and “Ya Hey” clearly dominate the end of the set. Personally, I prefer the coveted Worship You -> Ya Hey -> Walcott, but you can’t always get what you want.

FOTB_sentiment %>% 
  group_by(track_name) %>% 
  filter(song_num == 1) %>% 
  ungroup() %>% 
  count(track_name, sort = TRUE) %>% 
  mutate(track_name = reorder(track_name, n)) %>%
  ggplot(aes(track_name, n, fill = track_name)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
  theme(legend.position = "none") +
  ylab("# of Times Played") +
  ggtitle("FOTB Tour Openers")

Now let’s focus on patterns throughout the setlists. First we’ll look at how the loudness metric tends to change over the course of the show. This plot shows us that the first half of a VW show is pretty loud, with a slight dip after the 75% mark. Our ears all need a good rest sometimes!

FOTB_sentiment %>% 
  ggplot(aes(perc_complete, loudness_rescale)) +
  geom_point(aes(color = eventDate)) +
  geom_smooth() +
  theme(legend.position = "none") +
  ylab("Loudness of Songs") +
  xlab("Duration of Show [0 = Beginning, 1 = End]") +
  ggtitle("Loudness throughout VW Shows")

We can also look at how the show tends changes lyrically. Unlike danceability, which takes a bit of a hit later in the show, positive lyrical vibes tend to fluctuate throughout the whole show.

FOTB_sentiment %>% 
  ggplot(aes(perc_complete, percent_positive)) +
  geom_point(aes(color = eventDate)) +
  geom_smooth() +
  theme(legend.position = "none") +
  ylab("Positivity of Songs") +
  xlab("Duration of Show [0 = Beginning, 1 = End]") +
  ggtitle("Percentage of Positive Words throughout VW Shows")

As a self-proclaimed Phish head, I also had to know the answer to the question of how similar shows are to one another. One way to calculate this is through cosine similarity, which you can read about here if you are interested. Essentially, the higher the cosine similarity, the more similar two setlists are to one another. With 92 shows in our dataset, that’s a lot of comparisons! However, we can just look at the mean value, as well as the range of values.

fotb_wide = FOTB_sentiment %>% 
  group_by(track_name) %>% 
  select(track_name, song_num, eventDate) %>% 
  ungroup() %>% 
  mutate(id = 1:n()) 
fotb_wide = fotb_wide %>% 
  spread(eventDate, song_num, fill = NA) %>% 
  group_by(track_name) %>% 
  summarise_all(~first(na.omit(.))) %>% 
  ungroup()
w_o_names = fotb_wide %>% select(-track_name)
cosine = as.matrix(simil(w_o_names, method="cosine"))
mean(cosine, na.rm = TRUE)

## [1] 0.8300675

range(cosine, na.rm = TRUE)

## [1] 0.04686676 1.00000000

The mean value is .83, with a range of .04 to 1.0. This provides evidence that most shows are fairly similar, but remember that comparing a show to itself results in a value of 1. Therefore this mean value might be a bit of an overestimation. Vampire Weekend certainly switches up their setlist enough that I always have something unexpected to look forward to :)

Last but not least, let us not forget the infamous Banger Index. This plot shows that Vampire Weekend likes to pull out their biggest bangers in the “third quarter” (between the 50% - 75% marks) and then give us a bit of a break afterwards. Personally, I’m not surprised at all by these results - my feet are rarely both on the ground during a VW show. #bangersonly

FOTB_sentiment %>% 
  ggplot(aes(perc_complete, banger_index_rescale)) +
  geom_point(aes(color = eventDate)) +
  geom_smooth() +
  theme(legend.position = "none") +
  ylab("Banger Index of Songs") +
  xlab("Duration of Show [0 = Beginning, 1 = End]") +
  ggtitle("Banger Index throughout VW Shows")

Who amongst us was lucky enough to be at the show with the highest average Banger Index? If you were at the 5/25/2019 show at Stewart Park, consider yourself blessed to hear an 8-song heater.

FOTB_sentiment %>% 
  group_by(eventDate) %>% 
  summarise(mean_banger = mean(banger_index_rescale)) %>% 
  arrange(desc(mean_banger))

## # A tibble: 90 x 2
##    eventDate  mean_banger
##    <chr>            <dbl>
##  1 25-05-2019       0.779
##  2 05-12-2019       0.735
##  3 04-12-2019       0.720
##  4 03-07-2019       0.692
##  5 17-05-2019       0.690
##  6 22-09-2019       0.688
##  7 15-09-2019       0.684
##  8 19-10-2019       0.684
##  9 18-10-2019       0.678
## 10 05-11-2019       0.671
## # … with 80 more rows

FOTB_sentiment %>% 
  filter(eventDate == "25-05-2019") %>% 
  select(eventDate, venue.name, song_num,track_name)

## # A tibble: 8 x 5
## # Groups:   show_num [1]
##   show_num eventDate  venue.name   song_num track_name  
##      <dbl> <chr>      <chr>           <dbl> <chr>       
## 1       75 25-05-2019 Stewart Park        1 Harmony Hall
## 2       75 25-05-2019 Stewart Park        2 Sunflower   
## 3       75 25-05-2019 Stewart Park        3 Holiday     
## 4       75 25-05-2019 Stewart Park        4 Diane Young 
## 5       75 25-05-2019 Stewart Park        5 Cousins     
## 6       75 25-05-2019 Stewart Park        6 A-Punk      
## 7       75 25-05-2019 Stewart Park        7 This Life   
## 8       75 25-05-2019 Stewart Park        8 Unbelievers

That’s all for now, folks :) Hope you enjoyed this data-driven deep dive into Vampire Weekend’s music, lyrics and setlists from this past tour. If you have any comments, questions, or suggestions, feel free to toss ’em my way @hnm1231 on Twitter. Hope to see you at a show!