Spotify’s vast streaming catalog offers an ideal sandbox for probing how musical style interacts with commercial success. This technical report chronicles the assembly of an analysis-ready dataset that merges two complementary slices—“High Popularity” and “Low Popularity”—from the Kaggle Spotify Music Data repository. After appending a categorical popularity label, we condense over thirty playlist-level genre tags into six headline genres to streamline visualisation and modelling. Duplicate records are systematically flagged at the track-artist-playlist axis using flexible separator rules that detect comma, ampersand, and semicolon delimiters within multi-artist fields. Because Spotify sometimes assigns the borderline popularity score of sixty-eight to both classes, ambiguous low-popularity rows are excluded to guarantee mutually exclusive groups. The resulting table contains 13,342 unique tracks spanning electronic, pop, Latin, hip-hop, ambient, rock, and an aggregated “others” bucket. Every transformation is fully scripted in R and encapsulated within a Quarto document, enabling frictionless reruns on fresh dumps or forked datasets. The cleaned corpus supports subsequent inquiries into genre-specific audience reach, collaboration networks, and the quantitative drivers of streaming attention. Practitioners can reuse the workflow as a drop-in template for playlist analytics, while researchers gain a transparent foundation for reproducible music-industry studies. The code is concise, documented, and easily shareable.
Keywords
Spotify, R language
Introduction
# ===== 0. Libraries ==========================================================library(tidyverse) # dplyr, ggplot2, readr, forcats...library(janitor) # clean_names()library(lubridate) # ymd(), year()library(tidymodels) # recipes + workflows + parsnip + yardsticklibrary(yardstick) # (tidymodels loads this, explicit for clarity)library(pROC) # threshold optimisation & extra ROC tools# ===== 1. Load & initial inspect ============================================# * Please place the data file in the same working directorydf <-read_csv("dat_with_country_all.csv") %>%clean_names() %>%mutate(# Convert the target label to a factor; set the level order to "low" -> "high"popularity =factor(popularity, levels =c("low", "high")),# Convert dates to Date objects and extract only the yeartrack_album_release_date =ymd(track_album_release_date),release_year =year(track_album_release_date) ) %>%select(-track_album_release_date) # Drop the original date columnglimpse(df)