We aim to explore how the danceability of popular songs has evolved over a specific time period using quantitative data analysis. By leveraging music datasets (e.g., Spotify’s music database), we will investigate trends in danceability scores across genres, decades, and cultural shifts.
For this research, we will use R and RStudio to analyze the danceability of songs across different decades by querying data from Spotify’s API. First, we will import the .csv file and retrieve track metadata and audio features—including danceability scores—from curated playlists that span various genres and time periods. we will extract release dates to categorize tracks by decade, then use packages like dplyr to calculate and visualize average danceability trends over time. Additionally, we examine which playlists these songs appear in to identify patterns in how danceable tracks are grouped—such as workout, party, or chill playlists—and whether certain playlist types consistently feature higher danceability scores. This approach will allow us to uncover both temporal and contextual shifts in musical rhythm and movement.
To ensure meaningful analysis, the first step in this research involves tidying and standardizing the data. This includes cleaning inconsistencies in release dates, converting them into a uniform format, and categorizing songs by decade. Duplicate entries and missing values in key variables like danceability scores must be addressed to maintain data integrity. Standardizing playlist names and genres also helps in grouping and comparing across categories. Once the dataset is clean, visualizations become powerful tools: line graphs will be especially useful to show trends in average danceability over time, while boxplots can highlight the distribution and variability of danceability within each decade. Heatmaps may also reveal correlations between playlist types and danceability scores, offering a layered view of how musical movement has evolved both temporally and contextually.
We load the spotifyr
package to access Spotify data
through its API. The tidyverse
, dplyr
, and
lubridate
packages are also loaded, as they provide a
comprehensive set of tools for efficient data manipulation,
visualization, and analysis.
Our data comes directly from Spotify via the spotifyr
package.
variable | class | description |
---|---|---|
track_id | character | Song unique ID |
track_name | character | Song Name |
track_artist | character | Song Artist |
track_popularity | double | Song Popularity (0-100) where higher is better |
track_album_id | character | Album unique ID |
track_album_name | character | Song album name |
track_album_release_date | character | Date when album released |
playlist_name | character | Name of playlist |
playlist_id | character | Playlist ID |
playlist_genre | character | Playlist genre |
playlist_subgenre | character | Playlist subgenre |
danceability | double | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
energy | double | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
key | double | The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1. |
loudness | double | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
mode | double | Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0. |
speechiness | double | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
acousticness | double | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
instrumentalness | double | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | double | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
valence | double | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
tempo | double | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
duration_ms | double | Duration of song in milliseconds |
Next, we will transform and tidy our data using the below codes.
####Turning scientific numbers into integers
options(scipen = 999)
####Turning loudness numbers over zero to n/a
spotify_songs$instrumentalness[spotify_songs$instrumentalness > 0] <- NA
####turning mode to major or minor keys
spotify_songs$mode <- ifelse(spotify_songs$mode == 1, "major", "minor")
####turning key from numeric to character
spotify_songs <- spotify_songs %>%mutate(key = case_when(key == 0 ~ "C",key == 1 ~ "C♯/D♭",key == 2 ~ "D",key == 3 ~ "D♯/E♭",key == 4 ~ "E",key == 5 ~ "F",key == 6 ~ "F♯/G♭",key == 7 ~ "G",key == 8 ~ "G♯/A♭",key == 9 ~ "A",key == 10 ~ "A♯/B♭",key == 11 ~ "B",TRUE ~ as.character(key)),mode = ifelse(mode == 1, "major", "minor"),key_signature = paste(key, mode))
###using date column to create years column and then decades column
spotify_songs$year <- as.numeric(substr(spotify_songs$track_album_release_date, 1, 4))
spotify_songs$decade <- floor(spotify_songs$year / 10) * 10
Now that our data is tidied, we are able to create a new data frame with only the columns that we are going to use:
spotify_songs_project <- spotify_songs %>%
select(track_name, track_artist, playlist_genre, danceability, decade, key_signature)
Below is the top 10 lines of our tidied data:
head(spotify_songs_project,10)
## # A tibble: 10 × 6
## track_name track_artist playlist_genre danceability decade key_signature
## <chr> <chr> <chr> <dbl> <dbl> <chr>
## 1 I Don't Care (… Ed Sheeran pop 0.748 2010 F♯/G♭ minor
## 2 Memories - Dil… Maroon 5 pop 0.726 2010 B minor
## 3 All the Time -… Zara Larsson pop 0.675 2010 C♯/D♭ minor
## 4 Call You Mine … The Chainsm… pop 0.718 2010 G minor
## 5 Someone You Lo… Lewis Capal… pop 0.65 2010 C♯/D♭ minor
## 6 Beautiful Peop… Ed Sheeran pop 0.675 2010 G♯/A♭ minor
## 7 Never Really O… Katy Perry pop 0.449 2010 F minor
## 8 Post Malone (f… Sam Feldt pop 0.542 2010 E minor
## 9 Tough Love - T… Avicii pop 0.594 2010 G♯/A♭ minor
## 10 If I Can't Hav… Shawn Mendes pop 0.642 2010 D minor
Some interesting tidbits about our data are how few genres are represented, only 6! Most of our data set includes songs only from the 2010s which may skew our results.
##
## edm latin pop r&b rap rock
## 6043 5155 5507 5431 5746 4951
##
## 1950 1960 1970 1980 1990 2000 2010 2020
## 3 172 966 1306 2310 4077 23214 785
For our analysis, we have already created new variables, decade and key_signatures. Decades was done to chunk up our original track_album_release_date column so that we would have less unique observations and help us track danceability over eras not just years. Also, we decided to create a new data frame from our original to only include columns that we were interested in to help us keep our information tidy and succinct.
We are planning to use line graphs, box plots, and heatmaps to help us find our answers.
As far as machine learning, we are thinking of creating a binary classification model to help us predict whether a song in a certain key from a certain genre will have a high danceability or not.