Popularity in Time and Tonality

Introduction

We aim to explore how the makeup of a song– such as the dancebility of it– affects the popularity of songs. What’s more, we investigate how that has evolved over decades through using quantitative data analysis. By leveraging music datasets (e.g., Spotify’s music database), we will investigate trends in popularity scores across genres, decades, and cultural shifts.

For this research, we will use R and RStudio to analyze the popularity of songs across different decades by querying data from Spotify’s API. First, we will import the .csv file and retrieve track metadata and audio features, including popularity scores, from curated playlists that span various genres and time periods. we will extract release dates to categorize tracks by decade, then use packages like dplyr to calculate and visualize average popularity trends over time. Additionally, we examine which genre of playlists these songs appear in to identify patterns in how danceable tracks are grouped,such as workout, party, or chill playlists—and whether certain playlist types consistently feature higher popularity scores. This approach will allow us to uncover both temporal and contextual shifts in musical rhythm and movement.

To ensure meaningful analysis, the first step in this research involves tidying and standardizing the data. This includes cleaning inconsistencies in release dates, converting them into a uniform format, and categorizing songs by decade. Duplicate entries and missing values in key variables like popularity scores must be addressed to maintain data integrity. Standardizing playlist names and genres also helps in grouping and comparing across categories. Once the dataset is clean, visualizations become powerful tools: histograms will be especially useful to show trends in average popularity over time, while boxplots can highlight the distribution and variability of popularity within each decade. Heatmaps may also reveal correlations between playlist types and popularity scores, offering a layered view of how musical movement has evolved both temporally and contextually.

Packages Required

We load the spotifyr package to access Spotify data through its API. The tidyverse, dplyr, ggplot2, rpart, and lubridate packages are also loaded, as they provide a comprehensive set of tools for efficient data manipulation, visualization, and analysis.

Data Preparation

Our data comes directly from Spotify via the spotifyr package.

####turning scientific numbers into integers 
options(scipen = 999)

####turning loudness numbers over zero to n/a
spotify_songs$instrumentalness[spotify_songs$instrumentalness > 0] <- NA

####turning mode to major or minor keys
spotify_songs$mode <- ifelse(spotify_songs$mode == 1, "major", "minor")

####turning key from numeric to character
spotify_songs <- spotify_songs %>%mutate(key = case_when(key == 0  ~ "C",key == 1  ~ "C♯/D♭",key == 2  ~ "D",key == 3  ~ "D♯/E♭",key == 4  ~ "E",key == 5  ~ "F",key == 6  ~ "F♯/G♭",key == 7  ~ "G",key == 8  ~ "G♯/A♭",key == 9  ~ "A",key == 10 ~ "A♯/B♭",key == 11 ~ "B",TRUE      ~ as.character(key)),mode = ifelse(mode == 1, "major", "minor"),key_signature = paste(key, mode))

###using date column to create  years column and then decades column
spotify_songs$year <- as.numeric(substr(spotify_songs$track_album_release_date, 1, 4))
spotify_songs$decade <- floor(spotify_songs$year / 10) * 10

Now that our data is tidied, we are able to create a new data frame with only the columns that we are going to use:

spotify_songs_project <- spotify_songs %>%
  select(track_name, track_artist, playlist_genre, track_popularity, decade, key_signature)

Below is the top 10 lines of our tidied data:

head(spotify_songs_project,10)
## # A tibble: 10 × 6
##    track_name  track_artist playlist_genre track_popularity decade key_signature
##    <chr>       <chr>        <chr>                     <dbl>  <dbl> <chr>        
##  1 I Don't Ca… Ed Sheeran   pop                          66   2010 F♯/G♭ minor  
##  2 Memories -… Maroon 5     pop                          67   2010 B minor      
##  3 All the Ti… Zara Larsson pop                          70   2010 C♯/D♭ minor  
##  4 Call You M… The Chainsm… pop                          60   2010 G minor      
##  5 Someone Yo… Lewis Capal… pop                          69   2010 C♯/D♭ minor  
##  6 Beautiful … Ed Sheeran   pop                          67   2010 G♯/A♭ minor  
##  7 Never Real… Katy Perry   pop                          62   2010 F minor      
##  8 Post Malon… Sam Feldt    pop                          69   2010 E minor      
##  9 Tough Love… Avicii       pop                          68   2010 G♯/A♭ minor  
## 10 If I Can't… Shawn Mendes pop                          67   2010 D minor

Some interesting tidbits about our data are how few genres are represented, only 6! Most of our data set includes songs only from the 2010s which may skew our results.

## 
##   edm latin   pop   r&b   rap  rock 
##  6043  5155  5507  5431  5746  4951
## 
##  1950  1960  1970  1980  1990  2000  2010  2020 
##     3   172   966  1306  2310  4077 23214   785

Data Dictionary

variable class description
track_id character Song unique ID
track_name character Song Name
track_artist character Song Artist
track_popularity double Song Popularity (0-100) where higher is better
track_album_id character Album unique ID
track_album_name character Song album name
track_album_release_date character Date when album released
playlist_name character Name of playlist
playlist_id character Playlist ID
playlist_genre character Playlist genre
playlist_subgenre character Playlist subgenre
danceability double Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
energy double Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
key double The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
loudness double The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode double Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
speechiness double Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
acousticness double A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
instrumentalness double Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness double Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
valence double A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo double The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
duration_ms double Duration of song in milliseconds

Exploratory Data Analysis

For our analysis, we have already created new variables, decade and key_signatures. Decades was done to chunk up our original track_album_release_date column so that we would have less unique observations and help us track popularity over eras not just years. Also, we decided to create a new data frame from our original to only include columns that we were interested in to help us keep our information tidy and succinct.

We are planning to use histograms, stacked bar charts, and scatterplots to help us find our answers.

Summary

As we stated at the beginning of our project, we aim to explore how the makeup of a song– such as the dancebility of it– affects the popularity of songs. What’s more, we investigate how that has evolved over decades through using quantitative data analysis. By leveraging music datasets (e.g., Spotify’s music database), we are addressing whether songs have become more or less danceable over time, and how that affects popularity.

Spotify is a large streaming service that has many consumers. First and foremost, our targer audience is the CEO, executives and shareholders of Spotify. Spotify only recently had its full first year of profitability in 2024 which was a major milestone for the company which has been in business since 2008. They did this by increasing subscription costs in different markets, creating royalty deals with record labels, increasing podcasts and audiobook catalogs and giving rightsholders the option to give up 30% of royalties to get pushed by the algorithm. These moves benefit the top execs and shareholders because Spotify previously only earned 30% of the total generated revenue. When a label utilizes the discovery feature, Spotify earns an extra 17% on their gross margin. Our data will be able to be used to see where trends lie and how to manipulate them to better the company.

This leads to the other consumers who will benefit from our data. Our data will be beneficial to Record Label executives who will want to see what type of artists perform well of the app. They can use it when choosing artists to sign and for the songwriting process. Songs have patterns in the way they are constructed to appeal to the human brain, for example the standard 4 chord progression that can be found in many pop songs. In addition to this, our data can be used to look at past music and see what appeals to people now and is popular. It means that labels can push a certain sound, for example, the 80’s retro sound we have had for the past 5 years.

Insights we found were that the popularity data is somewhat skewed. Whilst Spotify has been around since 2008, they did not see a huge increase in listeners until the mid 2010s. In Q1 2015 Spotify had approximately 68 million monthly active listeners but by Q4 2020 they had 345 million monthly active listeners. We noticed that the most popular songs, 80 and over, were mostly from 2017 to 2020 with a very small number being from previous years. We also noticed that Rock music was the only genre to be mostly absent from the most popular songs. Moreover, danceability may have peaked on the mid 2010s as we moved away from certain sounds and new trends and forms of social media emerged.

Moreover, for example, as Spotify and other services grew, Billboard created its first Streaming Songs Chart in 2013. The top 2 songs on the first chart posted, Thrift Shop and Locked out of Heaven, both have popularity scores between 70 and 79 and danceability scores of .781 and .726 respectively. With the most popular songs mostly being from 2017-2020, it also tracks with the sudden rise in streaming in 2018, which contributed to Billboard further increasing the weight streams had on their charts. Streams have since overtaken radio when it comes to how Billboard tallies their points, which explains why Spotify has become profitable over the past couple years through their newer tactics. Labels and artists need streams to chart well, and sacrificing some royalties for a streaming push and higher chart placement is worth it to them.

With these insights, the implications to our target consumer would be that the executives can use the data to manipulate the Spotify algorithm. With labels and artists sacrificing a percentage of their royalties to get a push, the executives can use the popularity and danceability of certain tracks to earn more profits by pushing a popular genre or already well performing songs more to a wider audience. Of course, one negative implication is that the listeners may get weary of Spotify doing this and look at other music streaming services. It is not unusual for some Spotify to complain that some autoplay songs, recommended or playlists have newer songs that are out of place in a playlist or do not fit with what they usually listen to.

The limitation of our analysis mainly has to do with Spotify and Streaming still being a relatively new thing that is seeing user growth daily. As such, our data is limited by recency bias. Songs that come out more recently are naturally going to be more popular, while an older song is going to be less popular.

Models

#first, we will create new variables to make playlist_genre and decade as factors

spotify_songs_project$decade_model <- as.factor(spotify_songs_project$decade)
spotify_songs_project$genre_model <- as.factor(spotify_songs_project$playlist_genre)


#second, we must build the tree model, using anova since track_popularity is a numeric variable

tree_model <-  rpart(track_popularity ~ decade_model + genre_model,
                     data = spotify_songs_project,
                     method = "anova")

#third, we will plot the tree 

rpart.plot(tree_model,
           type = 3,
           extra = 101,
           fallen.leaves = TRUE,
           main = "Predicting Song Popularity")

#What does this mean?
#Firstly, EDM songs in our data frame are most likely to be unpopular. Next, of all of our other genres, if a song is from the 1990s or 2000s #it will be less popular from all the other decades. 
#need to create a new variable that is binary, we have selected 80 as a threshold for songs being popular or not

spotify_songs_project <- spotify_songs_project %>%
  mutate(popular = ifelse(track_popularity >= 80, "Popular", "Not Popular"))

#build a model tree to predict whether a song is popular or not, rather than has high popularity like the previous model

tree_model <- rpart(popular ~ genre_model + decade_model, data = spotify_songs_project, method = "class")
rpart.plot(tree_model, type = 3, extra = 104)

#this results in the model saying that 95% of songs will be unpopular and only 5% will be popular, which makes sense