Executive Summary

Billboard 100 and Spotify Exploratory Data Analysis

12/07/2021

CIS 331 Section 01

Amela Aganovic

Jacob Heath

Tuyetnhu Pham

Stella Rodriguez

In this data analytics project, our team analyzed data relating to songs and the data kept on them in Spotify’s backend. The three datasets we used were:

a random sample of songs on Spotify
a dataset containing Billboard Top 100 song data from 1958-2021
a dataset containing Spotify data that corresponds to the Billboard Top 100 songs

All data was sourced from TidyTuesday.

Our primary question of interest is what makes a song popular? To explore this, we conducted exploratory data analysis on our data with a focus on variables related to song popularity and other variables that were correlated to them, if any.

Our key findings showed that there are factors that correlate to Spotify track popularity, such as loudness and danceability. We also have business recommendations for Spotify and musical performers that relate to using models to predict track popularity to generate more revenue.

Introduction

Spotify is a digital music streaming service that gives a user access to millions of songs, podcasts, and videos from artists all over the world. It is completely free of charge and comes with a Premium subscription plan. It can be accessed on multiple devices such as personal computers, mobile browsers, phone applications, and smart devices such as an Amazon Echo or Google Home.

For our project, we are looking at the music streaming industry. Music streaming services operate with massive databases and are sources of big data that can be extrapolated and analyzed to unlock insights and identify areas of business value. More specifically, we will be focusing on Spotify’s audio streaming service. We are also interested in investigating data on songs on the Billboard Top 100 Chart to make comparisons with Spotify’s data, as well as additional information.

Our data source is the Tidytuesday project on GitHub. We used three datasets: - Spotify_songs from the Spotify package: audio track info from Spotify (songs from 1957 - 2020) - Billboard dataset from the Billboard 100 (info from 1958 to 2021) - Audio_features: Spotify track information for Billboard dataset observations

One of the three datasets we used, the spotify_songs dataset is incredibly clean out of the box, while audio_features and the billboard dataset are fairly messy with a large amount of null values. There are a lot of variables between these three datasets and they are either character strings or numerically typed. The spotify_songs dataset that we will be analyzing has songs that were released in 1957 all the way to 2020. The Billboard 100 dataset contains data from 1958 until 2021.

Our data comes from using the Spotify package, which is used to pull audio track information from Spotify’s Web API in bulk. The Spotify Web API is an interface that programs can use to get and manage Spotify data over the internet. It provides the artist, album, track data, and also audio features like acousticness, liveness, and instrumentalness (Hughes, 2015).

One article that relates to our questions of interest in music and Spotify is “What Makes a Song Trend? Cluster Analysis of Musical Attributes for Spotify Top Trending Songs,” where the authors perform cluster analysis on the Top 100 Trending songs on Spotify of 2017 and 2018. They used the attributes of the songs, like danceability, loudness, and speechiness to see if there were trends in the attributes of the most popular songs (Al-Beitawi, 2020). Another article is “A Robust Approach to Predict the Popularity of Songs by Identifying Appropriate Properties,” where the author used machine learning to predict a song’s popularity. The author extracted data from a song listing website and analyzed audio characteristics, like beats per minute, of popular and unpopular songs.

Initial Hypotheses

Our main goal of this project is to explore what aspects of music make particular songs more popular than others. In order to determine these key factors we will explore the questions listed below to get a better understanding of our data and what makes people enjoy music. Hopefully after research and experimentation we will be able to identify the major components of music that make songs popular.

Initial questions of interest:

What makes a song popular? What do popular songs have in common?
Do songs with higher tempo and danceability lead to higher popularity?
What is the most listened to genre/subgenre/artist?
What key is used for the songs with the highest popularity?
Do songs with older release dates have the same popularity as songs released recently?
What genre/subgenre has the longest song durations on average?
What artist has the longest song duration on average?
How many songs are in (on average) a playlist (based on playlist_name)?

Our initial hypotheses are:

Popular songs have higher danceability on average.
Pop and pop-related song genres have more popular songs.
The key of C is a popular musical key.
Songs in the major mode are more popular than songs in the minor mode.
Songs released more recently (within the last 5 years) will have higher popularities than older songs.

Data Preparation

We began by reading our three datasets from their TidyTuesday GitHub repositories.The biggest challenge was combining the Billboard 100 dataset with its related audio features dataset in order to have Spotify song data for each observation. Luckily, the two datasets were linked by a common foreign key. It took some time deciding what type of “join” or “merge” to use on the tables. Once the tables were merged, we removed NULL values for Spotify data, duplicate data, and columns that were not needed for analysis.

In the combined Billboard 100 dataset with audio features, we replaced the 0-11 values in the “key” column with the corresponding names of the musical keys (0 = C, 1 = C#, and so on) for readability and easier visualization. We also replaced the numeric values of 0 or 1 to “minor” or “major” respectively when referring to musical mode for the same reason.

if (!require("flexdashboard")) install.packages("flexdashboard")

library(dplyr)
library(tidyr)
library(readr)
library(stringr)
library(magrittr)
library(ggplot2)
library(flexdashboard)

# Loading the data

spotify_songs <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-01-21/spotify_songs.csv')

billboard <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/billboard.csv')

audio_features <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-09-14/audio_features.csv')

# Adding a year column to the billboard dataset
billboard$year = str_sub(billboard$week_id, -4)
billboard$year <- as.numeric(billboard$year)

# Getting rid of columns not needed for analysis
audio_features$spotify_track_id <- NULL
audio_features$spotify_track_preview_url <- NULL
billboard$url <- NULL
billboard$instance <- NULL

# Joining billboard and audio_features tables using common song_id column
billboard_full <- billboard %>% left_join(audio_features, by = "song_id")

# Dropping rows with no Spotify data
billboard_full <- billboard_full %>% 
  drop_na(spotify_track_duration_ms, spotify_track_explicit, spotify_track_album, instrumentalness, liveness, valence, tempo, time_signature, spotify_track_popularity)

# Checking for inconsistencies in song names / performer names
billboard_full %>%
  filter(performer.x != performer.y) %>% 
  select(performer.x, performer.y, song.x, song.y)

billboard_full %>% 
  filter(song.x != song.y) %>% 
  select(song.x, song.y, performer.x, performer.y)

# Remove duplicate / unneeded columns
billboard_full$performer.y <- NULL
billboard_full$song.y <- NULL
billboard_full$song_id <- NULL

# Changing key numbers to key names in billboard dataset
billboard_full$key[billboard_full$key == 0] <- "C"
billboard_full$key[billboard_full$key == 1] <- "C#/Db"
billboard_full$key[billboard_full$key == 2] <- "D"
billboard_full$key[billboard_full$key == 3] <- "D#/Eb"
billboard_full$key[billboard_full$key == 4] <- "E"
billboard_full$key[billboard_full$key == 5] <- "F"
billboard_full$key[billboard_full$key == 6] <- "F#/Gb"
billboard_full$key[billboard_full$key == 7] <- "G"
billboard_full$key[billboard_full$key == 8] <- "G#/Ab"
billboard_full$key[billboard_full$key == 9] <- "A"
billboard_full$key[billboard_full$key == 10] <- "A#/Bb"
billboard_full$key[billboard_full$key == 11] <- "B"

# Changing mode numbers to mode names in billboard dataset
billboard_full$mode[billboard_full$mode == 0] <- "minor"
billboard_full$mode[billboard_full$mode == 1] <- "major"

# Rename columns
billboard_full <- billboard_full %>% rename(song = song.x, performer = performer.x)

billboard_full %>% head(5)

Exploratory Data Analysis

Our process for analyzing the data was exploratory in nature. We mainly computed averages of different variables and visualized them in relation to other variables to see if any correlations were apparent. We used our initial questions of interests to guide our analysis. We conducted analysis on both the Billboard 100 dataset and the Spotify Songs random sample dataset.

Spotify Songs Average Track Popularity by Playlist Genre

In our Spotify Songs random sample, we can see that tracks in Pop playlists have the highest average popularity. This gives support to one of our initial hypotheses that Pop is the most popular Spotify genre.

However, we did not expect Latin to come in a close second.

Figure 1

Spotify Songs Average Track Popularity by Playlist Sub-genre

Again in our Spotify Songs random sample, Fig. 2 shows many Sub-genres related with pop having a greater Spotify popularity, like post-teen pop, hip pop, and dance pop. Similar to Fig.1, latin Sub-genres like reggaeton and latin pop are among the most popular.

Figure 2

Mean Spotify Popularity of by Musical Key and Mode

Fig.3 shows the mean Spotify popularity for each musical key. Here we can see that there is no strong correlation between key and how popular a song is, because there is little variation between the values. This disproved our initial prediction of the key of “C” being the most popular key, despite being one of the most commonly found keys in music.

Figure 3

Surprisingly, we can see that in our Billboard 100 data, songs in Minor keys had higher average Spotify popularity than songs in Major keys.

Figure 4

Top Mean Popularity by Performer

Fig.5 shows the top 10 performers with the highest average Spotify track popularity.

Figure 5

Songs with Most #1 Week Positions

Fig.6 below are the top 10 songs with the most number 1’s in week position on the Billboard 100 Chart. Many of these songs are classified as Pop songs.

Figure 6

Artists with the Most Billboard #1 Songs

Fig.7 shown below are the top 10 artists with the most number 1’s in week position on the Billboard 100 Chart. A majority of these performers are Pop performers.

Figure 7

Billboard Spotify Data Averages

Fig.8 and Fig.9 contain averages for each value of track popularity on Spotify for a given song. We can see that there is a general upward trend for danceability and loudness for songs with higher popularity, especially for the Billboard dataset. There is some interesting variation at the higher popularity values that provides potential for further investigation. There is also a discrepancy between the values for our Spotify random sample and Billboard datasets, which may be influenced by the number of observations in them (i.e. the Billboard dataset has 200,000+ observations while spotify_songs has 32,000+).

Figure 8

#### Figure 9

Song Durations

The next tables display statistics on Spotify track song durations.

We can see that the artist with the longest song duration on average is Newcleus, with an average of 516760 milliseconds, which is about 8.61 minutes.

# A tibble: 5 x 2
  track_artist                   mean_dur
  <chr>                             <dbl>
1 Newcleus                         516760
2 Don McLean                       516209
3 V-Sag                            515703
4 The Joneses                      515680
5 Dele Sosimi Afrobeat Orchestra   512100

The Spotify playlist genre with the longest song durations on average is rock with an average duration of 248576.5 milliseconds, which is about 4.1 minutes.

# A tibble: 6 x 2
  playlist_genre mean_dur
  <chr>             <dbl>
1 rock            248577.
2 r&b             237599.
3 edm             222541.
4 pop             217768.
5 latin           216863.
6 rap             214164.

The Spotify playlist subgenre with the longest song durations on average is new jack swing with an average duration of 275128.6 milliseconds, which is about 4.5 minutes.

# A tibble: 5 x 2
  playlist_subgenre         mean_dur
  <chr>                        <dbl>
1 new jack swing             275129.
2 classic rock               256667.
3 album rock                 255363.
4 progressive electro house  254006.
5 southern hip hop           247029.

Average Number of Songs in a Playlist

From the following table we see that there are an average of 69.7 songs on a playlist.

# A tibble: 1 x 1
  avg_len
    <dbl>
1    69.7

Spotify Track Popularity by Release Year

Fig. 10 shows the average Spotify track popularity by release year. In general the Spotify Track Popularity increases as the release years increase. However, the release year with the greatest Spotify popularity average is not 2021 or 2020 but 2019.

Figure 10

Summary

The results of our findings in relation to our initial hypotheses are as follows:

Popular songs do have higher danceability on average.
Pop and pop-related song genres do have more popular songs (especially in playlists).
We cannot decisively say that the key of C is among the most popular.
Our analysis showed that songs in the minor mode were more popular on average.
Songs released more recently (within the last 5 years) do generally have higher popularities than older songs.

None of the results from our study were highly significant or conclusive, and it was difficult to make any certain conclusions from the analysis we performed. We can say that there are a lot of variables that go into what makes a song popular.

One limitation of our analysis was the nature of it being shallow in depth but wide in scope. More investigation could be done on specific songs according to their features (such as loudness, speechiness, etc.).

In the future, we would have to look at more data and perform more advanced visualization and trend analysis to see if any more insights could be extracted from it. It would be easier to make more definitive conclusions with a much larger volume of data. Spotify itself has data on around 70 million tracks, and our analysis was only performed on a very small percentage of that.

Our business recommendations are:

Performers/artists should use models to predict Spotify track popularity and/or Billboard 100 chart positions based on Spotify track attributes (to create more popular songs which would mean more revenue).
Spotify should place more advertisements on playlists with popular tracks and/or tracks likely to be popular based on their attributes (because Spotify makes much of its profit from ad revenue).

References

Al-Beitawi, Z., Salehan, M., & Zhang, S. (2020). What Makes a Song Trend? Cluster Analysis of Musical Attributes for Spotify Top Trending Songs. Journal of Marketing Development and Competitiveness, 14(3), 79-91. http://search.proquest.com.ezproxy.gvsu.edu/scholarly-journals/what-makes-song-trend-cluster-analysis-musical/docview/2444523984/se-2?a ccountid=39473

Anjana, S. (2015). A Robust Approach to Predict the Popularity of Songs by Identifying Appropriate Properties. Retrieved November 19, 2021, from https://dl.ucsc.cmb.ac.lk/jspui/bitstream/123456789/4157/1/2015%20CS%20013.pdf.

Hughes, C. (2015, March 9). Understanding the Spotify Web API. Spotify Engineering. Retrieved November 19, 2021, from https://engineering.atspotify.com/2015/03/09/understanding-spotify-web-api/#:~:text=What%20is%20the%20Spotify%20Web,used%20by%20every%20internet%20browser.