Spotify01

Author

Takafumi Kubota

Published

June 3, 2025

Abstract

Spotify’s vast streaming catalog offers an ideal sandbox for probing how musical style interacts with commercial success. This technical report chronicles the assembly of an analysis-ready dataset that merges two complementary slices—“High Popularity” and “Low Popularity”—from the Kaggle Spotify Music Data repository. After appending a categorical popularity label, we condense over thirty playlist-level genre tags into six headline genres to streamline visualisation and modelling. Duplicate records are systematically flagged at the track-artist-playlist axis using flexible separator rules that detect comma, ampersand, and semicolon delimiters within multi-artist fields. Because Spotify sometimes assigns the borderline popularity score of sixty-eight to both classes, ambiguous low-popularity rows are excluded to guarantee mutually exclusive groups. The resulting table contains 13,342 unique tracks spanning electronic, pop, Latin, hip-hop, ambient, rock, and an aggregated “others” bucket. Every transformation is fully scripted in R and encapsulated within a Quarto document, enabling frictionless reruns on fresh dumps or forked datasets. The cleaned corpus supports subsequent inquiries into genre-specific audience reach, collaboration networks, and the quantitative drivers of streaming attention. Practitioners can reuse the workflow as a drop-in template for playlist analytics, while researchers gain a transparent foundation for reproducible music-industry studies. The code is concise, documented, and easily shareable.

Keywords

Spotify, R language

All analyses in this document use the Spotify Music Data collection hosted on Kaggle (https://www.kaggle.com/datasets/pavansanagapati/spotify-music-data/data). From that repository we downloaded two curated CSV files—high_popularity_spotify_data.csv and low_popularity_spotify_data.csv—each containing playlist metadata, audio features, and Spotify’s proprietary track_popularity score. After loading the files into R, we added a popularity flag (“high” or “low”) to preserve provenance, then combined the tables with dplyr::bind_rows() to produce a single tidy frame (spotify_raw). All subsequent cleaning steps—genre consolidation, duplicate detection, and removal of ambiguous popularity rows—are applied to this unified source, ensuring a consistent reference point for every downstream visualisation, model, or descriptive statistic.

0 Load the library

# Load required libraries (hide output in final render)
library(dplyr)
library(tidyr)
library(stringr)

1 Load the source data

# Read high‑ and low‑popularity tracks exported from Spotify
hg <- read.csv("high_popularity_spotify_data.csv")
lw <- read.csv("low_popularity_spotify_data.csv")

2 Combine the data sets and tag popularity level

# Add a categorical label and stack both tables
spotify_raw <- bind_rows(
  mutate(hg, popularity = "high"),
  mutate(lw, popularity = "low")
)

# Persist the initial combined table (optional)
write.csv(spotify_raw, "spotify_data.csv", row.names = FALSE)

3 Collapse playlist genres into six major groups

# Define the six focus genres; everything else becomes "others"
top_genres <- c("electronic", "pop", "latin", "hip-hop", "ambient", "rock")

spotify_genre6 <- spotify_raw %>%
  mutate(
    genre6 = ifelse(playlist_genre %in% top_genres,
                    playlist_genre,
                    "others")
  )

write.csv(spotify_genre6, "spotify_data_genre6.csv", row.names = FALSE)

4 Check for duplicated track records (same name × artist × playlist)

# Split multiple artists listed in one cell and examine duplicates
artist_expanded <- spotify_genre6 %>%
  separate_rows(track_artist, sep = ",|&|;") %>%  # allow multiple separators
  mutate(track_artist = str_trim(track_artist))

duplicates_tracks <- artist_expanded %>%
  group_by(track_name, track_artist, playlist_name) %>%
  filter(n() > 1) %>%
  arrange(track_name, track_artist, playlist_name)

n_dupes <- nrow(duplicates_tracks)
print(paste("Potential duplicate rows:", n_dupes))

5 Refine the popularity split

Spotify occasionally assigns a borderline popularity score of 68 to both “low” and “high” lists. To make the classes mutually exclusive we drop the ambiguous rows from the low slice.

spotify_clean <- bind_rows(
  spotify_genre6 %>% filter(popularity == "low"  & track_popularity < 68),
  spotify_genre6 %>% filter(popularity == "high")
)

write.csv(spotify_clean, "spotify_data_clean.csv", row.names = FALSE)