Project Introduction

In recent years, if you’ve had access to any form of social media, the internet, cable, or even public television, you’re sure to see reports and studies being broadcasted about how the Millennial generation is to blame for ruining a multitude of things: cash, marriage, and the napkin industry to name a few. But an argument also exists out there on the interwebs claiming that we’ve ruined music, too. The commonplace sentiment of “all music sounds the same” is coupled with the nostalgia for the variety and perceived higher quality of music of the past.

So the question begs to be asked: has music really changed all that much since the onset of musical recordings? Are artists simply doing what they’ve always done, albeit now in the presence of improved technology and omnipresent social media? And are the people crucifying Millennials for the curent musical state of the world really just curmudgeonly bemoaning a new iteration of “You kids get off my lawn”? Music through the ages changes, without a doubt. But is music today simply a reflection of the past or are times really changing, for the better or worse.

Data for this analysis will be scraped from the internet music streaming service, Spotify. For each track and its respective release date, an overall year and decade will be assigned to each song in the following song dataset. Variation in not only key but tempo, speed, and other assorted musical metrics will be analyzed for each decade as a whole, but also for each respective genre within a decade. Finally, the most popular artists from each decade will be examined to see if there are any themes to musical popularity (a “recipe for success” if you wil) or if what was once popular has dwindled…much like the supposed decreased Millennial attention span and the potential correlation of shortened song duration.

Package Requirements

The following three packages will be used for the overall analysis:

  1. dplyr: general data manipulation and pipe operator
  2. stringr: string manipulation and data cleaning
  3. ggplot2: basic graph and visualization formatting for future EDA
  4. readr: more easily read in CSVs to not be factors
library(dplyr)
library(stringr)
library(ggplot2)
library(readr)

Data Prep

Data Overview

Spotify, as previously stated, is a massive online music streaming platform based out of Sweden that came into existence in 2008 and has more than 50 million tracks housed within its service. The data we’ll be using for analysis was scraped from Spotify using the spotifyr package created by Charlie Thompson, Josiah Parry, Donal Phipps, and Tom Wolff and limts the collected songs to 6 different genres (EDM, Pop, Latin, R&B, Rap, Rock) throughout the ages. This data can be found here.

A brief overview of the variables housed within the dataset (also found at the location above) is listed below:

  • track_id: unique song identifier
  • track_name: song name
  • track_artist: singer
  • track_popularity: a 0-100 index to gauge song popularity, (higher = more popular)
  • track_album_id: unique album identifier
  • track_album_name: respective song’s album it can be found on
  • track_album_release_date: date album was released
  • playlist_name: name of playlist respective song can be found on
  • playlist_id: unique playlist identifier
  • playlist_genre: main genre of playlist
  • playlist_subgenre: secondary genre of playlist
  • danceability: combination of tempo, rhythm, beat, and regularity to determine suitability for dancing, 0 being low and 1 being high
  • energy: activity and intensity measures of a song, 0 being low and 1 being high
  • key: track’s general key, starting at 1 (C) and increasing according to pitch class notation (no key is -1)
  • loudness: average decibels across track from -60 to 0
  • mode: major (1) or minor (0) track
  • speechiness: 0 to 1 scale indicating presence of speech, over .66 being entirely spoken word and under .33 representing music
  • acoustiness: 0 to 1 (near-certain) scale for confidence in a track being acoustic
  • instrumentalness: 0 to 1 (near-certain) scale for confidence in a track being instrumental
  • liveness: 0 to 1 (near-certain) scale for confidence in a track being a live recording
  • valence: 0 to 1 (near-certain) scale where 0 is more “sad” and 1 being more “happy”
  • tempo: beats per minute (BPM) estimate for a track
  • duration_ms: duration of a song in milliseconds

Data Cleaning

spotify_songs <- read_csv("spotify_songs.csv")

While there are only 28356 distinct track IDs in the dataset, there are 32833 observations. Furthermore, there are 23449. This indicates that in this data structure, a track (track_id) can be assigned to multiple playlists and the identifier is a unique combination of album and track.

spotify_songs %>% 
  select(track_name, track_album_id, playlist_id) %>% 
  group_by(track_name) %>% 
  summarise(num_albums = n_distinct(track_album_id), num_playlists = n_distinct(playlist_id)) %>% 
  filter(num_albums > 1 | num_playlists > 1) %>% 
  head(10)
## # A tibble: 10 x 3
##    track_name                                           num_albums num_playlists
##    <chr>                                                     <int>         <int>
##  1 '39 - 2011 Mix                                                2             2
##  2 'Till I Collapse                                              2             4
##  3 #1 Stunna                                                     2             2
##  4 $ave Dat Money (feat. Fetty Wap & Rich Homie Quan)            1             2
##  5 (Don't Fear) The Reaper                                       3             6
##  6 (Don't Fear) The Reaper - Single Version                      1             2
##  7 (Feels Like) Heaven                                           2             2
##  8 (I Can't Get No) Satisfaction - Mono Version                  2             3
##  9 (I Can't Get No) Satisfaction - Mono Version / Rema~          2             2
## 10 (I Just) Died In Your Arms                                    2             2

Clearly, “(Don’t Fear) The Reaper” is getting some play time in recent history. Now lets look at its track_id, track_album_id, track_album_name, and playlist_id to confirm the hypothesis that track_id is really a unique identifier for not only track, but track and album.

spotify_songs %>% 
  filter(track_name == "(Don't Fear) The Reaper") %>% 
  select(track_id, track_album_name, track_album_id, track_album_release_date)
## # A tibble: 6 x 4
##   track_id        track_album_name        track_album_id     track_album_releas~
##   <chr>           <chr>                   <chr>              <chr>              
## 1 6NTqBHONQqmud0~ Agents of Fortune       6YOzCPyuPC92Eg44a~ 1976               
## 2 5QTxFnGygVM4jF~ Agents Of Fortune       6C9WzlQANeoD0GW5B~ 1976               
## 3 6NTqBHONQqmud0~ Agents of Fortune       6YOzCPyuPC92Eg44a~ 1976               
## 4 5QTxFnGygVM4jF~ Agents Of Fortune       6C9WzlQANeoD0GW5B~ 1976               
## 5 2NL4BBBSgypHnx~ The Essential Blue Öys~ 6NNrQJ8ojvbfFzoUj~ 1972               
## 6 2NL4BBBSgypHnx~ The Essential Blue Öys~ 6NNrQJ8ojvbfFzoUj~ 1972

This uncovers something unexpected about our data format: duplication of a track_album_name in different case sensitve forms leads to multiple album IDs when it’s clearly the same album (name and release year as indicators), all due to a subjective calpitalization of “of”.

For the purpose of this analysis, we will not be investigating playlist names or subgenres, so these variables can be dropped. Moving forward, we need to be careful to note songs do appear on varying albums, but we will keep these multiple albums as they are unique indicators not only for our tracks but can be cause for a song’s increased popularity.

Also, after some digging, there are only 5 records that contain null values, related to the overall lack of a track name:

NA_tracks <- spotify_songs %>% 
  filter(is.na(track_name)) %>% 
  select(track_id, track_name, track_popularity, playlist_name)
NA_tracks
## # A tibble: 5 x 4
##   track_id               track_name track_popularity playlist_name              
##   <chr>                  <chr>                 <dbl> <chr>                      
## 1 69gRFGOWY9OMpFJgFol1u0 <NA>                      0 "HIP&HOP"                  
## 2 5cjecvX0CmC9gK0Laf5EMQ <NA>                      0 "GANGSTA Rap"              
## 3 5TTzhRSWQS4Yu8xTgAuq6D <NA>                      0 "GANGSTA Rap"              
## 4 3VKFip3OdAvv4OfNTgFWeQ <NA>                      0 "Reggaeton viejito\U0001f5~
## 5 69gRFGOWY9OMpFJgFol1u0 <NA>                      0 "latin hip hop"

We’ll get to these later (and thankfully dropping playlists in the future to not deal with the pesky fire emoji). For now, let’s get to cleaning to remove duplicated albums and drop some unnecessary variables as well:

spotify_mult_album_ids <- spotify_songs %>% 
  filter(!is.na(track_album_name)) %>% 
  select(track_name, track_artist, track_album_id, track_album_name, track_album_release_date) %>% 
  ## We are only going to go off of year moving forward, so we're going to parse out year from release date
  ## The reasons are two-fold: 
  ##    1) Ease of use 
  ##    2) Differing date formats within the variable
  mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>% 
  select(-track_album_release_date) %>% 
  mutate(track_album_name = tolower(track_album_name)) %>% 
  group_by(track_name, track_artist, track_album_name, album_release_year) %>% 
  summarise(num_album_ids = n_distinct(track_album_id)) %>% 
  ungroup() %>% 
  filter(num_album_ids > 1)

rank_album_by_row <- spotify_songs %>% 
  mutate(track_album_name = tolower(track_album_name)) %>% 
  mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>% 
  select(-track_album_release_date) %>% 
  inner_join(spotify_mult_album_ids, by = c("track_name","track_artist","track_album_name","album_release_year")) %>% 
  select(track_album_name, track_album_id, duration_ms) %>% 
  unique() %>% 
  group_by(track_album_name) %>% 
  mutate(album_id_rank = order(order(track_album_id, duration_ms, decreasing = TRUE))) %>% 
  ungroup() %>% 
  ## Bring in all ranks > 1 so we can antijoin and remove from list
  filter(album_id_rank > 1)

## Let's add the new release date formatting in here permanently with a lower case album:
spotify_songs_new <- spotify_songs %>% 
  # We don't want to arbitrarily lower all of the values of the dataframe with the identifiers
  mutate(track_album_name = tolower(track_album_name), 
         track_artist = tolower(track_artist),
         track_name = tolower(track_name)) %>% 
  mutate(album_release_year = sub(".*(\\d+{4}).*$", "\\1", track_album_release_date)) %>% 
  select(-track_album_release_date)

spotify_no_dupes <- spotify_songs_new %>% 
  anti_join(rank_album_by_row, by = c("track_album_name", "track_album_id"))

Let’s just check to see how The Reaper fairs now:

spotify_no_dupes %>% 
  filter(track_name == "(don't fear) the reaper") %>% 
  select(track_id, track_album_name, track_album_id, album_release_year) %>% 
  unique()
## # A tibble: 2 x 4
##   track_id         track_album_name         track_album_id     album_release_ye~
##   <chr>            <chr>                    <chr>              <chr>            
## 1 6NTqBHONQqmud0O~ agents of fortune        6YOzCPyuPC92Eg44a~ 1976             
## 2 2NL4BBBSgypHnxU~ the essential blue öyst~ 6NNrQJ8ojvbfFzoUj~ 1972

To be safe, let’s also see if within tracks there is different versioning of a song based on its respective length:

spotify_no_dupes %>% 
  select(track_id, duration_ms) %>% 
  group_by(track_id) %>% 
  summarise(diff_times = n_distinct(duration_ms)) %>% 
  filter(diff_times > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: track_id <chr>, diff_times <int>

No dupes come up here as well.

Since these are such a small representation of our sample overall and clearly not popular nor plentiful, we’re going to leave the “GANGSTA RAP” at home and remove them from the dataset, as the genres are still well-represented without these unknown five tracks.

spotify_songs_no_NAs <- spotify_no_dupes %>% 
  anti_join(NA_tracks, by = "track_id") %>% 
  select(-playlist_name, -playlist_subgenre) %>% 
  unique()

If you look closely at the data set, there are special characters and punctuation marks rife within an artist’s name and song (looking at you Ty Dolla $ign). For the most part, song titles, album names, and artist names are relatively easily to still comprehend without these variables included, so the initial thoughts was to only include alphanumeric characters for the purpose of this analysis. HOWEVER, since there are non-English, foreign language characters represented in this dataset, we’ll use the ID’s moving forward for general classification and only search for certain instances of artists as examples down the road. Therefore, these string descriptors will remain as-is.

But we still maintain that we have songs with multiple genres assigned to them based upon the playlist they were placed in.

spotify_songs_no_NAs %>% 
  select(track_name, playlist_genre) %>% 
  group_by(track_name) %>% 
  summarise(diff_genres = n_distinct(playlist_genre)) %>% 
  ungroup() %>% 
  filter(diff_genres > 1) %>% 
  head(10)
## # A tibble: 10 x 2
##    track_name                                         diff_genres
##    <chr>                                                    <int>
##  1 'till i collapse                                             2
##  2 $ave dat money (feat. fetty wap & rich homie quan)           2
##  3 (no one knows me) like the piano                             2
##  4 ...baby one more time                                        2
##  5 ¿cual es tu plan?                                            2
##  6 ¿quien tu eres?                                              2
##  7 +                                                            2
##  8 <U+30AC><U+30E9><U+30B9><U+306E>palm tree                                            2
##  9 <U+30DC><U+30A4><U+30B9><U+30E1><U+30E2> no. 5                                             2
## 10 <U+541B><U+306E><U+30CF><U+30FC><U+30C8><U+306F><U+30DE><U+30EA><U+30F3><U+30D6><U+30EB><U+30FC>                                     2

Clearly “10,000 hours (with justin bieber)” is a cross-genre bridge.

spotify_songs_no_NAs %>% 
  select(track_name, playlist_genre) %>% 
  filter(track_name == "10,000 hours (with justin bieber)")
## # A tibble: 5 x 2
##   track_name                        playlist_genre
##   <chr>                             <chr>         
## 1 10,000 hours (with justin bieber) pop           
## 2 10,000 hours (with justin bieber) latin         
## 3 10,000 hours (with justin bieber) r&b           
## 4 10,000 hours (with justin bieber) r&b           
## 5 10,000 hours (with justin bieber) edm

The main point of the above is awareness. For the intents of this analysis, we’re most interested in a song’s dynamics through the ages rather than categorizing based on genres what music trends are underlying. As time allows, we will revisit this to potentially map a single genre to a track based on the frequency of genres of playlists a track has been assigned to.

From cleaning and reformatting this data, we removed 2005 observations that overall will make our dataset easier to manipulate and ultimately pull insights from.

Data Summary

summary(spotify_songs_no_NAs)

A few key things to note from the data summary before diving in to the future proposed EDA:

  1. We seem to float around the key of F a decent amount (Remember: 0 relates to the key of C, increasing by 1 according to each increase of pitch).
# Pitch / Key
summary(spotify_songs_no_NAs$key)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   6.000   5.376   9.000  11.000
  1. We’re really taken to our electronics and vocal music and don’t have a wealth of spoken, instrumental, or acoustic tracks at our disposal. Shocking really given the world’s fascination with coffee shops.
# Spoken Word
summary(spotify_songs_no_NAs$speechiness)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0409  0.0626  0.1073  0.1320  0.9180
# Acoustics
summary(spotify_songs_no_NAs$acousticness)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0153  0.0826  0.1783  0.2600  0.9940
# Instrumentals
summary(spotify_songs_no_NAs$instrumentalness)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000167 0.0865119 0.0050025 0.9940000
  1. A lot of music doesn’t take on an aggressively loud tonality.
# Loudness
summary(spotify_songs_no_NAs$loudness)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -46.448  -8.194  -6.185  -6.742  -4.665   1.275

Next Steps and Proposed EDA