Billboard 100 and Spotify Exploratory Data Analysis
12/07/2021
CIS 331 Section 01
Amela Aganovic
Jacob Heath
Tuyetnhu Pham
Stella Rodriguez
In this data analytics project, our team analyzed data relating to songs and the data kept on them in Spotify’s backend. The three datasets we used were:
All data was sourced from TidyTuesday.
Our primary question of interest is what makes a song popular? To explore this, we conducted exploratory data analysis on our data with a focus on variables related to song popularity and other variables that were correlated to them, if any.
Our key findings showed that there are factors that correlate to Spotify track popularity, such as loudness and danceability. We also have business recommendations for Spotify and musical performers that relate to using models to predict track popularity to generate more revenue.
Spotify is a digital music streaming service that gives a user access to millions of songs, podcasts, and videos from artists all over the world. It is completely free of charge and comes with a Premium subscription plan. It can be accessed on multiple devices such as personal computers, mobile browsers, phone applications, and smart devices such as an Amazon Echo or Google Home.
For our project, we are looking at the music streaming industry. Music streaming services operate with massive databases and are sources of big data that can be extrapolated and analyzed to unlock insights and identify areas of business value. More specifically, we will be focusing on Spotify’s audio streaming service. We are also interested in investigating data on songs on the Billboard Top 100 Chart to make comparisons with Spotify’s data, as well as additional information.
Our data source is the Tidytuesday project on GitHub. We used three datasets: - Spotify_songs from the Spotify package: audio track info from Spotify (songs from 1957 - 2020) - Billboard dataset from the Billboard 100 (info from 1958 to 2021) - Audio_features: Spotify track information for Billboard dataset observations
One of the three datasets we used, the spotify_songs dataset is incredibly clean out of the box, while audio_features and the billboard dataset are fairly messy with a large amount of null values. There are a lot of variables between these three datasets and they are either character strings or numerically typed. The spotify_songs dataset that we will be analyzing has songs that were released in 1957 all the way to 2020. The Billboard 100 dataset contains data from 1958 until 2021.
Our data comes from using the Spotify package, which is used to pull audio track information from Spotify’s Web API in bulk. The Spotify Web API is an interface that programs can use to get and manage Spotify data over the internet. It provides the artist, album, track data, and also audio features like acousticness, liveness, and instrumentalness (Hughes, 2015).
One article that relates to our questions of interest in music and Spotify is “What Makes a Song Trend? Cluster Analysis of Musical Attributes for Spotify Top Trending Songs,” where the authors perform cluster analysis on the Top 100 Trending songs on Spotify of 2017 and 2018. They used the attributes of the songs, like danceability, loudness, and speechiness to see if there were trends in the attributes of the most popular songs (Al-Beitawi, 2020). Another article is “A Robust Approach to Predict the Popularity of Songs by Identifying Appropriate Properties,” where the author used machine learning to predict a song’s popularity. The author extracted data from a song listing website and analyzed audio characteristics, like beats per minute, of popular and unpopular songs.
Our main goal of this project is to explore what aspects of music make particular songs more popular than others. In order to determine these key factors we will explore the questions listed below to get a better understanding of our data and what makes people enjoy music. Hopefully after research and experimentation we will be able to identify the major components of music that make songs popular.
Initial questions of interest:
Our initial hypotheses are:
We began by reading our three datasets from their TidyTuesday GitHub repositories.The biggest challenge was combining the Billboard 100 dataset with its related audio features dataset in order to have Spotify song data for each observation. Luckily, the two datasets were linked by a common foreign key. It took some time deciding what type of “join” or “merge” to use on the tables. Once the tables were merged, we removed NULL values for Spotify data, duplicate data, and columns that were not needed for analysis.
In the combined Billboard 100 dataset with audio features, we replaced the 0-11 values in the “key” column with the corresponding names of the musical keys (0 = C, 1 = C#, and so on) for readability and easier visualization. We also replaced the numeric values of 0 or 1 to “minor” or “major” respectively when referring to musical mode for the same reason.
if (!require("flexdashboard")) install.packages("flexdashboard")
library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(magrittr)
library(ggplot2)
library(flexdashboard)# Loading the data
spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')
billboard <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')
audio_features <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')# Adding a year column to the billboard dataset
billboard$year = str_sub(billboard$week_id, -4)
billboard$year <- as.numeric(billboard$year)
# Getting rid of columns not needed for analysis
audio_features$spotify_track_id <- NULL
audio_features$spotify_track_preview_url <- NULL
billboard$url <- NULL
billboard$instance <- NULL# Joining billboard and audio_features tables using common song_id column
billboard_full <- billboard %>% left_join(audio_features, by = "song_id")
# Dropping rows with no Spotify data
billboard_full <- billboard_full %>%
drop_na(spotify_track_duration_ms, spotify_track_explicit, spotify_track_album, instrumentalness, liveness, valence, tempo, time_signature, spotify_track_popularity)
# Checking for inconsistencies in song names / performer names
billboard_full %>%
filter(performer.x != performer.y) %>%
select(performer.x, performer.y, song.x, song.y)
billboard_full %>%
filter(song.x != song.y) %>%
select(song.x, song.y, performer.x, performer.y)
# Remove duplicate / unneeded columns
billboard_full$performer.y <- NULL
billboard_full$song.y <- NULL
billboard_full$song_id <- NULL
# Changing key numbers to key names in billboard dataset
billboard_full$key[billboard_full$key == 0] <- "C"
billboard_full$key[billboard_full$key == 1] <- "C#/Db"
billboard_full$key[billboard_full$key == 2] <- "D"
billboard_full$key[billboard_full$key == 3] <- "D#/Eb"
billboard_full$key[billboard_full$key == 4] <- "E"
billboard_full$key[billboard_full$key == 5] <- "F"
billboard_full$key[billboard_full$key == 6] <- "F#/Gb"
billboard_full$key[billboard_full$key == 7] <- "G"
billboard_full$key[billboard_full$key == 8] <- "G#/Ab"
billboard_full$key[billboard_full$key == 9] <- "A"
billboard_full$key[billboard_full$key == 10] <- "A#/Bb"
billboard_full$key[billboard_full$key == 11] <- "B"
# Changing mode numbers to mode names in billboard dataset
billboard_full$mode[billboard_full$mode == 0] <- "minor"
billboard_full$mode[billboard_full$mode == 1] <- "major"
# Rename columns
billboard_full <- billboard_full %>% rename(song = song.x, performer = performer.x)
billboard_full %>% head(5)Our process for analyzing the data was exploratory in nature. We mainly computed averages of different variables and visualized them in relation to other variables to see if any correlations were apparent. We used our initial questions of interests to guide our analysis. We conducted analysis on both the Billboard 100 dataset and the Spotify Songs random sample dataset.
In our Spotify Songs random sample, we can see that tracks in Pop playlists have the highest average popularity. This gives support to one of our initial hypotheses that Pop is the most popular Spotify genre.
However, we did not expect Latin to come in a close second.
Again in our Spotify Songs random sample, Fig. 2 shows many Sub-genres related with pop having a greater Spotify popularity, like post-teen pop, hip pop, and dance pop. Similar to Fig.1, latin Sub-genres like reggaeton and latin pop are among the most popular.
Fig.3 shows the mean Spotify popularity for each musical key. Here we can see that there is no strong correlation between key and how popular a song is, because there is little variation between the values. This disproved our initial prediction of the key of “C” being the most popular key, despite being one of the most commonly found keys in music.
Surprisingly, we can see that in our Billboard 100 data, songs in Minor keys had higher average Spotify popularity than songs in Major keys.
Fig.5 shows the top 10 performers with the highest average Spotify track popularity.
Fig.6 below are the top 10 songs with the most number 1’s in week position on the Billboard 100 Chart. Many of these songs are classified as Pop songs.
Fig.7 shown below are the top 10 artists with the most number 1’s in week position on the Billboard 100 Chart. A majority of these performers are Pop performers.
Fig.8 and Fig.9 contain averages for each value of track popularity on Spotify for a given song. We can see that there is a general upward trend for danceability and loudness for songs with higher popularity, especially for the Billboard dataset. There is some interesting variation at the higher popularity values that provides potential for further investigation. There is also a discrepancy between the values for our Spotify random sample and Billboard datasets, which may be influenced by the number of observations in them (i.e. the Billboard dataset has 200,000+ observations while spotify_songs has 32,000+).
The next tables display statistics on Spotify track song durations.
We can see that the artist with the longest song duration on average is Newcleus, with an average of 516760 milliseconds, which is about 8.61 minutes.# A tibble: 5 x 2
track_artist mean_dur
<chr> <dbl>
1 Newcleus 516760
2 Don McLean 516209
3 V-Sag 515703
4 The Joneses 515680
5 Dele Sosimi Afrobeat Orchestra 512100
The Spotify playlist genre with the longest song durations on average is rock with an average duration of 248576.5 milliseconds, which is about 4.1 minutes.
# A tibble: 6 x 2
playlist_genre mean_dur
<chr> <dbl>
1 rock 248577.
2 r&b 237599.
3 edm 222541.
4 pop 217768.
5 latin 216863.
6 rap 214164.
The Spotify playlist subgenre with the longest song durations on average is new jack swing with an average duration of 275128.6 milliseconds, which is about 4.5 minutes.
# A tibble: 5 x 2
playlist_subgenre mean_dur
<chr> <dbl>
1 new jack swing 275129.
2 classic rock 256667.
3 album rock 255363.
4 progressive electro house 254006.
5 southern hip hop 247029.
# A tibble: 1 x 1
avg_len
<dbl>
1 69.7
Fig. 10 shows the average Spotify track popularity by release year. In general the Spotify Track Popularity increases as the release years increase. However, the release year with the greatest Spotify popularity average is not 2021 or 2020 but 2019.
The results of our findings in relation to our initial hypotheses are as follows:
None of the results from our study were highly significant or conclusive, and it was difficult to make any certain conclusions from the analysis we performed. We can say that there are a lot of variables that go into what makes a song popular.
One limitation of our analysis was the nature of it being shallow in depth but wide in scope. More investigation could be done on specific songs according to their features (such as loudness, speechiness, etc.).
In the future, we would have to look at more data and perform more advanced visualization and trend analysis to see if any more insights could be extracted from it. It would be easier to make more definitive conclusions with a much larger volume of data. Spotify itself has data on around 70 million tracks, and our analysis was only performed on a very small percentage of that.
Our business recommendations are:
Al-Beitawi, Z., Salehan, M., & Zhang, S. (2020). What Makes a Song Trend? Cluster Analysis of Musical Attributes for Spotify Top Trending Songs. Journal of Marketing Development and Competitiveness, 14(3), 79-91. http://search.proquest.com.ezproxy.gvsu.edu/scholarly-journals/what-makes-song-trend-cluster-analysis-musical/docview/2444523984/se-2?a ccountid=39473
Anjana, S. (2015). A Robust Approach to Predict the Popularity of Songs by Identifying Appropriate Properties. Retrieved November 19, 2021, from https://dl.ucsc.cmb.ac.lk/jspui/bitstream/123456789/4157/1/2015%20CS%20013.pdf.
Hughes, C. (2015, March 9). Understanding the Spotify Web API. Spotify Engineering. Retrieved November 19, 2021, from https://engineering.atspotify.com/2015/03/09/understanding-spotify-web-api/#:~:text=What%20is%20the%20Spotify%20Web,used%20by%20every%20internet%20browser.