We aim to explore how the makeup of a song– such as the dancebility of it– affects the popularity of songs. What’s more, we investigate how that has evolved over decades through using quantitative data analysis. By leveraging music datasets (e.g., Spotify’s music database), we will investigate trends in popularity scores across genres, decades, and cultural shifts.
For this research, we will use R and RStudio to analyze the popularity of songs across different decades by querying data from Spotify’s API. First, we will import the .csv file and retrieve track metadata and audio features, including popularity scores, from curated playlists that span various genres and time periods. we will extract release dates to categorize tracks by decade, then use packages like dplyr to calculate and visualize average popularity trends over time. Additionally, we examine which genre of playlists these songs appear in to identify patterns in how danceable tracks are grouped,such as workout, party, or chill playlists—and whether certain playlist types consistently feature higher popularity scores. This approach will allow us to uncover both temporal and contextual shifts in musical rhythm and movement.
To ensure meaningful analysis, the first step in this research involves tidying and standardizing the data. This includes cleaning inconsistencies in release dates, converting them into a uniform format, and categorizing songs by decade. Duplicate entries and missing values in key variables like popularity scores must be addressed to maintain data integrity. Standardizing playlist names and genres also helps in grouping and comparing across categories. Once the dataset is clean, visualizations become powerful tools: histograms will be especially useful to show trends in average popularity over time, while boxplots can highlight the distribution and variability of popularity within each decade. Heatmaps may also reveal correlations between playlist types and popularity scores, offering a layered view of how musical movement has evolved both temporally and contextually.
We load the spotifyr
package to access Spotify data
through its API. The tidyverse
, dplyr
,
ggplot2
, rpart
, and lubridate
packages are also loaded, as they provide a comprehensive set of tools
for efficient data manipulation, visualization, and analysis.
Our data comes directly from Spotify via the spotifyr
package.
####turning scientific numbers into integers
options(scipen = 999)
####turning loudness numbers over zero to n/a
spotify_songs$instrumentalness[spotify_songs$instrumentalness > 0] <- NA
####turning mode to major or minor keys
spotify_songs$mode <- ifelse(spotify_songs$mode == 1, "major", "minor")
####turning key from numeric to character
spotify_songs <- spotify_songs %>%mutate(key = case_when(key == 0 ~ "C",key == 1 ~ "C♯/D♭",key == 2 ~ "D",key == 3 ~ "D♯/E♭",key == 4 ~ "E",key == 5 ~ "F",key == 6 ~ "F♯/G♭",key == 7 ~ "G",key == 8 ~ "G♯/A♭",key == 9 ~ "A",key == 10 ~ "A♯/B♭",key == 11 ~ "B",TRUE ~ as.character(key)),mode = ifelse(mode == 1, "major", "minor"),key_signature = paste(key, mode))
###using date column to create years column and then decades column
spotify_songs$year <- as.numeric(substr(spotify_songs$track_album_release_date, 1, 4))
spotify_songs$decade <- floor(spotify_songs$year / 10) * 10
Now that our data is tidied, we are able to create a new data frame with only the columns that we are going to use:
spotify_songs_project <- spotify_songs %>%
select(track_name, track_artist, playlist_genre, track_popularity, decade, key_signature)
Below is the top 10 lines of our tidied data:
head(spotify_songs_project,10)
## # A tibble: 10 × 6
## track_name track_artist playlist_genre track_popularity decade key_signature
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 I Don't Ca… Ed Sheeran pop 66 2010 F♯/G♭ minor
## 2 Memories -… Maroon 5 pop 67 2010 B minor
## 3 All the Ti… Zara Larsson pop 70 2010 C♯/D♭ minor
## 4 Call You M… The Chainsm… pop 60 2010 G minor
## 5 Someone Yo… Lewis Capal… pop 69 2010 C♯/D♭ minor
## 6 Beautiful … Ed Sheeran pop 67 2010 G♯/A♭ minor
## 7 Never Real… Katy Perry pop 62 2010 F minor
## 8 Post Malon… Sam Feldt pop 69 2010 E minor
## 9 Tough Love… Avicii pop 68 2010 G♯/A♭ minor
## 10 If I Can't… Shawn Mendes pop 67 2010 D minor
Some interesting tidbits about our data are how few genres are represented, only 6! Most of our data set includes songs only from the 2010s which may skew our results.
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
##
## 1950 1960 1970 1980 1990 2000 2010 2020
## 3 172 966 1306 2310 4077 23214 785
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
track_album_id | character | Album unique ID |
track_album_name | character | Song album name |
track_album_release_date | character | Date when album released |
playlist_name | character | Name of playlist |
playlist_id | character | Playlist ID |
playlist_genre | character | Playlist genre |
playlist_subgenre | character | Playlist subgenre |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
For our analysis, we have already created new variables, decade and key_signatures. Decades was done to chunk up our original track_album_release_date column so that we would have less unique observations and help us track popularity over eras not just years. Also, we decided to create a new data frame from our original to only include columns that we were interested in to help us keep our information tidy and succinct.
We are planning to use histograms, stacked bar charts, and scatterplots to help us find our answers.
As we stated at the beginning of our project, we aim to explore how the makeup of a song– such as the dancebility of it– affects the popularity of songs. What’s more, we investigate how that has evolved over decades through using quantitative data analysis. By leveraging music datasets (e.g., Spotify’s music database), we are addressing whether songs have become more or less danceable over time, and how that affects popularity.
Spotify is a large streaming service that has many consumers. First and foremost, our targer audience is the CEO, executives and shareholders of Spotify. Spotify only recently had its full first year of profitability in 2024 which was a major milestone for the company which has been in business since 2008. They did this by increasing subscription costs in different markets, creating royalty deals with record labels, increasing podcasts and audiobook catalogs and giving rightsholders the option to give up 30% of royalties to get pushed by the algorithm. These moves benefit the top execs and shareholders because Spotify previously only earned 30% of the total generated revenue. When a label utilizes the discovery feature, Spotify earns an extra 17% on their gross margin. Our data will be able to be used to see where trends lie and how to manipulate them to better the company.
This leads to the other consumers who will benefit from our data. Our data will be beneficial to Record Label executives who will want to see what type of artists perform well of the app. They can use it when choosing artists to sign and for the songwriting process. Songs have patterns in the way they are constructed to appeal to the human brain, for example the standard 4 chord progression that can be found in many pop songs. In addition to this, our data can be used to look at past music and see what appeals to people now and is popular. It means that labels can push a certain sound, for example, the 80’s retro sound we have had for the past 5 years.
Insights we found were that the popularity data is somewhat skewed. Whilst Spotify has been around since 2008, they did not see a huge increase in listeners until the mid 2010s. In Q1 2015 Spotify had approximately 68 million monthly active listeners but by Q4 2020 they had 345 million monthly active listeners. We noticed that the most popular songs, 80 and over, were mostly from 2017 to 2020 with a very small number being from previous years. We also noticed that Rock music was the only genre to be mostly absent from the most popular songs. Moreover, danceability may have peaked on the mid 2010s as we moved away from certain sounds and new trends and forms of social media emerged.
Moreover, for example, as Spotify and other services grew, Billboard created its first Streaming Songs Chart in 2013. The top 2 songs on the first chart posted, Thrift Shop and Locked out of Heaven, both have popularity scores between 70 and 79 and danceability scores of .781 and .726 respectively. With the most popular songs mostly being from 2017-2020, it also tracks with the sudden rise in streaming in 2018, which contributed to Billboard further increasing the weight streams had on their charts. Streams have since overtaken radio when it comes to how Billboard tallies their points, which explains why Spotify has become profitable over the past couple years through their newer tactics. Labels and artists need streams to chart well, and sacrificing some royalties for a streaming push and higher chart placement is worth it to them.
With these insights, the implications to our target consumer would be that the executives can use the data to manipulate the Spotify algorithm. With labels and artists sacrificing a percentage of their royalties to get a push, the executives can use the popularity and danceability of certain tracks to earn more profits by pushing a popular genre or already well performing songs more to a wider audience. Of course, one negative implication is that the listeners may get weary of Spotify doing this and look at other music streaming services. It is not unusual for some Spotify to complain that some autoplay songs, recommended or playlists have newer songs that are out of place in a playlist or do not fit with what they usually listen to.
The limitation of our analysis mainly has to do with Spotify and Streaming still being a relatively new thing that is seeing user growth daily. As such, our data is limited by recency bias. Songs that come out more recently are naturally going to be more popular, while an older song is going to be less popular.
#first, we will create new variables to make playlist_genre and decade as factors
spotify_songs_project$decade_model <- as.factor(spotify_songs_project$decade)
spotify_songs_project$genre_model <- as.factor(spotify_songs_project$playlist_genre)
#second, we must build the tree model, using anova since track_popularity is a numeric variable
tree_model <- rpart(track_popularity ~ decade_model + genre_model,
data = spotify_songs_project,
method = "anova")
#third, we will plot the tree
rpart.plot(tree_model,
type = 3,
extra = 101,
fallen.leaves = TRUE,
main = "Predicting Song Popularity")
#What does this mean?
#Firstly, EDM songs in our data frame are most likely to be unpopular. Next, of all of our other genres, if a song is from the 1990s or 2000s #it will be less popular from all the other decades.
#need to create a new variable that is binary, we have selected 80 as a threshold for songs being popular or not
spotify_songs_project <- spotify_songs_project %>%
mutate(popular = ifelse(track_popularity >= 80, "Popular", "Not Popular"))
#build a model tree to predict whether a song is popular or not, rather than has high popularity like the previous model
tree_model <- rpart(popular ~ genre_model + decade_model, data = spotify_songs_project, method = "class")
rpart.plot(tree_model, type = 3, extra = 104)
#this results in the model saying that 95% of songs will be unpopular and only 5% will be popular, which makes sense