In recent years, if you’ve had access to any form of social media, the internet, cable, or even public television, you’re sure to see reports and studies being broadcasted about how the Millennial generation is to blame for ruining a multitude of things: cash, marriage, and the napkin industry to name a few. But an argument also exists out there on the interwebs claiming that we’ve ruined music, too. The commonplace sentiment of “all music sounds the same” is coupled with the nostalgia for the variety and perceived higher quality of music of the past.
So the question begs to be asked: has music really changed all that much since the onset of musical recordings? Are artists simply doing what they’ve always done, albeit now in the presence of improved technology and omnipresent social media? And are the people crucifying Millennials for the curent musical state of the world really just curmudgeonly bemoaning a new iteration of “You kids get off my lawn”? Music through the ages changes, without a doubt. But is music today simply a reflection of the past or are times really changing, for the better or worse.
Data for this analysis will be scraped from the internet music streaming service, Spotify. For each track and its respective release date, an overall year and decade will be assigned to each song in the following song dataset. Variation in not only key but tempo, speed, and other assorted musical metrics will be analyzed for each decade as a whole, but also for each respective genre within a decade. Finally, the most popular artists from each decade will be examined to see if there are any themes to musical popularity (a “recipe for success” if you wil) or if what was once popular has dwindled…much like the supposed decreased Millennial attention span and the potential correlation of shortened song duration.
The following three packages will be used for the overall analysis:
library(dplyr)
library(stringr)
library(ggplot2)
library(readr)
Spotify, as previously stated, is a massive online music streaming platform based out of Sweden that came into existence in 2008 and has more than 50 million tracks housed within its service. The data we’ll be using for analysis was scraped from Spotify using the spotifyr package created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff and limts the collected songs to 6 different genres (EDM, Pop, Latin, R&B, Rap, Rock) throughout the ages. This data can be found here.
A brief overview of the variables housed within the dataset (also found at the location above) is listed below:
spotify_songs <- read_csv("spotify_songs.csv")
While there are only 28356 distinct track IDs in the dataset, there are 32833 observations. Furthermore, there are 23449. This indicates that in this data structure, a track (track_id) can be assigned to multiple playlists and the identifier is a unique combination of album and track.
spotify_songs %>%
select(track_name, track_album_id, playlist_id) %>%
group_by(track_name) %>%
summarise(num_albums = n_distinct(track_album_id), num_playlists = n_distinct(playlist_id)) %>%
filter(num_albums > 1 | num_playlists > 1) %>%
head(10)
## # A tibble: 10 x 3
## track_name num_albums num_playlists
## <chr> <int> <int>
## 1 '39 - 2011 Mix 2 2
## 2 'Till I Collapse 2 4
## 3 #1 Stunna 2 2
## 4 $ave Dat Money (feat. Fetty Wap & Rich Homie Quan) 1 2
## 5 (Don't Fear) The Reaper 3 6
## 6 (Don't Fear) The Reaper - Single Version 1 2
## 7 (Feels Like) Heaven 2 2
## 8 (I Can't Get No) Satisfaction - Mono Version 2 3
## 9 (I Can't Get No) Satisfaction - Mono Version / Rema~ 2 2
## 10 (I Just) Died In Your Arms 2 2
Clearly, “(Don’t Fear) The Reaper” is getting some play time in recent history. Now lets look at its track_id, track_album_id, track_album_name, and playlist_id to confirm the hypothesis that track_id is really a unique identifier for not only track, but track and album.
spotify_songs %>%
filter(track_name == "(Don't Fear) The Reaper") %>%
select(track_id, track_album_name, track_album_id, track_album_release_date)
## # A tibble: 6 x 4
## track_id track_album_name track_album_id track_album_releas~
## <chr> <chr> <chr> <chr>
## 1 6NTqBHONQqmud0~ Agents of Fortune 6YOzCPyuPC92Eg44a~ 1976
## 2 5QTxFnGygVM4jF~ Agents Of Fortune 6C9WzlQANeoD0GW5B~ 1976
## 3 6NTqBHONQqmud0~ Agents of Fortune 6YOzCPyuPC92Eg44a~ 1976
## 4 5QTxFnGygVM4jF~ Agents Of Fortune 6C9WzlQANeoD0GW5B~ 1976
## 5 2NL4BBBSgypHnx~ The Essential Blue Öys~ 6NNrQJ8ojvbfFzoUj~ 1972
## 6 2NL4BBBSgypHnx~ The Essential Blue Öys~ 6NNrQJ8ojvbfFzoUj~ 1972
This uncovers something unexpected about our data format: duplication of a track_album_name in different case sensitve forms leads to multiple album IDs when it’s clearly the same album (name and release year as indicators), all due to a subjective calpitalization of “of”.
For the purpose of this analysis, we will not be investigating playlist names or subgenres, so these variables can be dropped. Moving forward, we need to be careful to note songs do appear on varying albums, but we will keep these multiple albums as they are unique indicators not only for our tracks but can be cause for a song’s increased popularity.
Also, after some digging, there are only 5 records that contain null values, related to the overall lack of a track name:
NA_tracks <- spotify_songs %>%
filter(is.na(track_name)) %>%
select(track_id, track_name, track_popularity, playlist_name)
NA_tracks
## # A tibble: 5 x 4
## track_id track_name track_popularity playlist_name
## <chr> <chr> <dbl> <chr>
## 1 69gRFGOWY9OMpFJgFol1u0 <NA> 0 "HIP&HOP"
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA> 0 "GANGSTA Rap"
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA> 0 "GANGSTA Rap"
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA> 0 "Reggaeton viejito\U0001f5~
## 5 69gRFGOWY9OMpFJgFol1u0 <NA> 0 "latin hip hop"
We’ll get to these later (and thankfully dropping playlists in the future to not deal with the pesky fire emoji). For now, let’s get to cleaning to remove duplicated albums and drop some unnecessary variables as well:
spotify_mult_album_ids <- spotify_songs %>%
filter(!is.na(track_album_name)) %>%
select(track_name, track_artist, track_album_id, track_album_name, track_album_release_date) %>%
## We are only going to go off of year moving forward, so we're going to parse out year from release date
## The reasons are two-fold:
## 1) Ease of use
## 2) Differing date formats within the variable
mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>%
select(-track_album_release_date) %>%
mutate(track_album_name = tolower(track_album_name)) %>%
group_by(track_name, track_artist, track_album_name, album_release_year) %>%
summarise(num_album_ids = n_distinct(track_album_id)) %>%
ungroup() %>%
filter(num_album_ids > 1)
rank_album_by_row <- spotify_songs %>%
mutate(track_album_name = tolower(track_album_name)) %>%
mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>%
select(-track_album_release_date) %>%
inner_join(spotify_mult_album_ids, by = c("track_name","track_artist","track_album_name","album_release_year")) %>%
select(track_album_name, track_album_id, duration_ms) %>%
unique() %>%
group_by(track_album_name) %>%
mutate(album_id_rank = order(order(track_album_id, duration_ms, decreasing = TRUE))) %>%
ungroup() %>%
## Bring in all ranks > 1 so we can antijoin and remove from list
filter(album_id_rank > 1)
## Let's add the new release date formatting in here permanently with a lower case album:
spotify_songs_new <- spotify_songs %>%
# We don't want to arbitrarily lower all of the values of the dataframe with the identifiers
mutate(track_album_name = tolower(track_album_name),
track_artist = tolower(track_artist),
track_name = tolower(track_name)) %>%
mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>%
select(-track_album_release_date)
spotify_no_dupes <- spotify_songs_new %>%
anti_join(rank_album_by_row, by = c("track_album_name", "track_album_id"))
Let’s just check to see how The Reaper fairs now:
spotify_no_dupes %>%
filter(track_name == "(don't fear) the reaper") %>%
select(track_id, track_album_name, track_album_id, album_release_year) %>%
unique()
## # A tibble: 2 x 4
## track_id track_album_name track_album_id album_release_ye~
## <chr> <chr> <chr> <chr>
## 1 6NTqBHONQqmud0O~ agents of fortune 6YOzCPyuPC92Eg44a~ 1976
## 2 2NL4BBBSgypHnxU~ the essential blue öyst~ 6NNrQJ8ojvbfFzoUj~ 1972
To be safe, let’s also see if within tracks there is different versioning of a song based on its respective length:
spotify_no_dupes %>%
select(track_id, duration_ms) %>%
group_by(track_id) %>%
summarise(diff_times = n_distinct(duration_ms)) %>%
filter(diff_times > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: track_id <chr>, diff_times <int>
No dupes come up here as well.
Since these are such a small representation of our sample overall and clearly not popular nor plentiful, we’re going to leave the “GANGSTA RAP” at home and remove them from the dataset, as the genres are still well-represented without these unknown five tracks.
spotify_songs_no_NAs <- spotify_no_dupes %>%
anti_join(NA_tracks, by = "track_id") %>%
select(-playlist_name, -playlist_subgenre) %>%
unique()
If you look closely at the data set, there are special characters and punctuation marks rife within an artist’s name and song (looking at you Ty Dolla $ign). For the most part, song titles, album names, and artist names are relatively easily to still comprehend without these variables included, so the initial thoughts was to only include alphanumeric characters for the purpose of this analysis. HOWEVER, since there are non-English, foreign language characters represented in this dataset, we’ll use the ID’s moving forward for general classification and only search for certain instances of artists as examples down the road. Therefore, these string descriptors will remain as-is.
But we still maintain that we have songs with multiple genres assigned to them based upon the playlist they were placed in.
spotify_songs_no_NAs %>%
select(track_name, playlist_genre) %>%
group_by(track_name) %>%
summarise(diff_genres = n_distinct(playlist_genre)) %>%
ungroup() %>%
filter(diff_genres > 1) %>%
head(10)
## # A tibble: 10 x 2
## track_name diff_genres
## <chr> <int>
## 1 'till i collapse 2
## 2 $ave dat money (feat. fetty wap & rich homie quan) 2
## 3 (no one knows me) like the piano 2
## 4 ...baby one more time 2
## 5 ¿cual es tu plan? 2
## 6 ¿quien tu eres? 2
## 7 + 2
## 8 <U+30AC><U+30E9><U+30B9><U+306E>palm tree 2
## 9 <U+30DC><U+30A4><U+30B9><U+30E1><U+30E2> no. 5 2
## 10 <U+541B><U+306E><U+30CF><U+30FC><U+30C8><U+306F><U+30DE><U+30EA><U+30F3><U+30D6><U+30EB><U+30FC> 2
Clearly “10,000 hours (with justin bieber)” is a cross-genre bridge.
spotify_songs_no_NAs %>%
select(track_name, playlist_genre) %>%
filter(track_name == "10,000 hours (with justin bieber)")
## # A tibble: 5 x 2
## track_name playlist_genre
## <chr> <chr>
## 1 10,000 hours (with justin bieber) pop
## 2 10,000 hours (with justin bieber) latin
## 3 10,000 hours (with justin bieber) r&b
## 4 10,000 hours (with justin bieber) r&b
## 5 10,000 hours (with justin bieber) edm
The main point of the above is awareness. For the intents of this analysis, we’re most interested in a song’s dynamics through the ages rather than categorizing based on genres what music trends are underlying. As time allows, we will revisit this to potentially map a single genre to a track based on the frequency of genres of playlists a track has been assigned to.
From cleaning and reformatting this data, we removed 2005 observations that overall will make our dataset easier to manipulate and ultimately pull insights from.
summary(spotify_songs_no_NAs)
A few key things to note from the data summary before diving in to the future proposed EDA:
# Pitch / Key
summary(spotify_songs_no_NAs$key)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 6.000 5.376 9.000 11.000
# Spoken Word
summary(spotify_songs_no_NAs$speechiness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0409 0.0626 0.1073 0.1320 0.9180
# Acoustics
summary(spotify_songs_no_NAs$acousticness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0153 0.0826 0.1783 0.2600 0.9940
# Instrumentals
summary(spotify_songs_no_NAs$instrumentalness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0000000 0.0000167 0.0865119 0.0050025 0.9940000
# Loudness
summary(spotify_songs_no_NAs$loudness)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -46.448 -8.194 -6.185 -6.742 -4.665 1.275