Spotify01
Spotify’s vast streaming catalog offers an ideal sandbox for probing how musical style interacts with commercial success. This technical report chronicles the assembly of an analysis-ready dataset that merges two complementary slices—“High Popularity” and “Low Popularity”—from the Kaggle Spotify Music Data repository. After appending a categorical popularity label, we condense over thirty playlist-level genre tags into six headline genres to streamline visualisation and modelling. Duplicate records are systematically flagged at the track-artist-playlist axis using flexible separator rules that detect comma, ampersand, and semicolon delimiters within multi-artist fields. Because Spotify sometimes assigns the borderline popularity score of sixty-eight to both classes, ambiguous low-popularity rows are excluded to guarantee mutually exclusive groups. The resulting table contains 13,342 unique tracks spanning electronic, pop, Latin, hip-hop, ambient, rock, and an aggregated “others” bucket. Every transformation is fully scripted in R and encapsulated within a Quarto document, enabling frictionless reruns on fresh dumps or forked datasets. The cleaned corpus supports subsequent inquiries into genre-specific audience reach, collaboration networks, and the quantitative drivers of streaming attention. Practitioners can reuse the workflow as a drop-in template for playlist analytics, while researchers gain a transparent foundation for reproducible music-industry studies. The code is concise, documented, and easily shareable.
Spotify, R language
All analyses in this document use the Spotify Music Data collection hosted on Kaggle (https://www.kaggle.com/datasets/pavansanagapati/spotify-music-data/data). From that repository we downloaded two curated CSV files—high_popularity_spotify_data.csv and low_popularity_spotify_data.csv—each containing playlist metadata, audio features, and Spotify’s proprietary track_popularity score. After loading the files into R, we added a popularity flag (“high” or “low”) to preserve provenance, then combined the tables with dplyr::bind_rows() to produce a single tidy frame (spotify_raw). All subsequent cleaning steps—genre consolidation, duplicate detection, and removal of ambiguous popularity rows—are applied to this unified source, ensuring a consistent reference point for every downstream visualisation, model, or descriptive statistic.
0 Load the library
# Load required libraries (hide output in final render)
library(dplyr)
library(tidyr)
library(stringr)
1 Load the source data
# Read high‑ and low‑popularity tracks exported from Spotify
hg <- read.csv("high_popularity_spotify_data.csv")
lw <- read.csv("low_popularity_spotify_data.csv")
2 Combine the data sets and tag popularity level
# Add a categorical label and stack both tables
spotify_raw <- bind_rows(
mutate(hg, popularity = "high"),
mutate(lw, popularity = "low")
)
# Persist the initial combined table (optional)
write.csv(spotify_raw, "spotify_data.csv", row.names = FALSE)
3 Collapse playlist genres into six major groups
# Define the six focus genres; everything else becomes "others"
top_genres <- c("electronic", "pop", "latin", "hip-hop", "ambient", "rock")
spotify_genre6 <- spotify_raw %>%
mutate(
genre6 = ifelse(playlist_genre %in% top_genres,
playlist_genre,
"others")
)
write.csv(spotify_genre6, "spotify_data_genre6.csv", row.names = FALSE)
4 Check for duplicated track records (same name × artist × playlist)
# Split multiple artists listed in one cell and examine duplicates
artist_expanded <- spotify_genre6 %>%
separate_rows(track_artist, sep = ",|&|;") %>% # allow multiple separators
mutate(track_artist = str_trim(track_artist))
duplicates_tracks <- artist_expanded %>%
group_by(track_name, track_artist, playlist_name) %>%
filter(n() > 1) %>%
arrange(track_name, track_artist, playlist_name)
n_dupes <- nrow(duplicates_tracks)
print(paste("Potential duplicate rows:", n_dupes))
5 Refine the popularity split
Spotify occasionally assigns a borderline popularity score of 68 to both “low” and “high” lists. To make the classes mutually exclusive we drop the ambiguous rows from the low slice.
spotify_clean <- bind_rows(
spotify_genre6 %>% filter(popularity == "low" & track_popularity < 68),
spotify_genre6 %>% filter(popularity == "high")
)
write.csv(spotify_clean, "spotify_data_clean.csv", row.names = FALSE)