soptify02
This notebook enriches a cleaned Spotify track list with artist-country metadata to support geographic analyses of streaming popularity. Starting from spotify_data_rmdup.csv—a 13 k-row table of unique tracks and curated audio-features—we query the public MusicBrainz API in 1 000-row batches, assigning one or more ISO-3166 country codes to every artist credited on each track. A lightweight in-memory cache plus an on-disk RDS file prevent redundant requests, while an adjustable one-second delay respects MusicBrainz rate limits. All steps—chunk creation, safe API calls, duplicate handling, progress reporting, and final export—are fully scripted in R and wrapped in a Quarto document for reproducibility. The resulting file, dat_with_country_all.rds, adds a pipe-delimited country field that can be merged back into genre or popularity studies, enabling questions such as “Which countries dominate high-popularity electronic playlists?” or “How do cross-border collaborations affect reach?” Researchers can reuse or extend the workflow with minimal edits.
Spotify, R language
The global nature of streaming means that a single Spotify playlist often features artists scattered across continents. Understanding where those artists come from is crucial when modelling cultural diffusion, collaboration networks, or the regionality of taste. Unfortunately, Spotify’s public metadata does not expose consistent country-of-origin fields. To bridge this gap, the present Quarto notebook augments an already deduplicated track table (spotify_data_rmdup.csv) with country codes harvested from the crowd-sourced MusicBrainz knowledge graph.
We begin by loading four core tidy-verse helpers—dplyr, stringr, purrr, and readr—alongside musicbrainz and httr for HTTP calls. Because MusicBrainz enforces strict user-agent and rate-limit policies, we set an explicit agent string (spotify-country-batch/1.1) and a modest RATE_WAIT delay of one second between API hits. This keeps our process compliant while still moving through roughly 60 000 look-ups per hour.
Next, we define chunk boundaries with seq(1, nrow(dat), by = 1000), adding a sentinel at nrow(dat)+1. Each chunk is processed independently, ensuring that even if an API outage occurs the batch already completed is preserved and cached. The cache itself is a simple R environment serialised to artist_country_cache.rds, mapping artist names (optionally album and track for disambiguation) to previously discovered ISO-3166 codes. On reruns the cache is reloaded, eliminating duplicate look-ups for the same artist.
The critical yet user-dependent function, get_country(), encapsulates the MusicBrainz query. You should implement it to (1) search for the artist by name, (2) optionally filter by release group or recording that matches the album or track, and (3) return the primary or earliest known country code. Wrapping this call in purrr::safely() yields safe_country(), which converts any API error into a harmless NA while preserving the error message for logging.
process_block() loops row-by-row through a chunk, exploding multi-artist strings, fetching codes for each artist, collapsing duplicates with unique(), and concatenating multiple codes via "|". A sleek progress_bar from the progress package provides real-time feedback (percent and ETA). After all blocks are mapped with purrr::map2_dfr(), we write two artefacts: dat_with_country_all.rds containing the enriched dataset and the updated cache.
Because every transformation is staged in discrete, well-commented code cells—with computationally expensive cells defaulted to eval = FALSE—the notebook is easy to audit, rerun, or adapt. Swap in a different chunk size, tighten the rate limit, or repoint get_country() toward your own knowledge base; the surrounding scaffolding remains intact. Researchers thus gain a turnkey, reproducible path from raw Spotify exports to geography-aware analyses of streaming behaviour.
Runtime note: Running the full 13 k‑row batch through the MusicBrainz API typically takes 6–7 hours overnight on a standard home internet connection.
0 Load the library
# ----------------------------------------------------------------------------
# batch_country_lookup.R — enrich tracks with artist country codes in batches
# ----------------------------------------------------------------------------
# Core libraries
library(dplyr)
library(stringr)
library(purrr)
# MusicBrainz + HTTP utilities
library(musicbrainz)
library(httr)
# Progress bar
library(progress)
# I/O helper
library(readr)
# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
httr::set_config(user_agent("spotify-country-batch/1.1 (your_email@example.com)"))
RATE_WAIT <- 1 # seconds between API calls to respect rate limits
1 Load the cleaned Spotify data
# `spotify_data_rmdup.csv` should be the deduplicated file created earlier
# (13 k rows, no borderline popularity duplicates)
dat <- read.csv("spotify_data_rmdup.csv")
2 Chunk the data frame into 1 000‑row blocks
# Compute split indices
split_points <- c(seq(1, nrow(dat), by = 1000), nrow(dat) + 1)
3 Prepare an artist‑to‑country cache
cache_file <- "artist_country_cache.rds"
cache <- if (file.exists(cache_file)) readRDS(cache_file) else new.env()
4 Helper: get_country()
# Implement your MusicBrainz query here — must return a two‑letter ISO‑3166 code
get_country <- function(artist, album = NULL, track = NULL) {
# TODO: Replace with actual MusicBrainz API call + parsing logic
NA_character_
}
5 Safe row‑wise wrapper
safe_country <- purrr::safely(function(row) {
arts <- str_trim(str_split(row$track_artist, ",")[[1]])
album <- row$track_album_name
trk <- row$track_name
codes <- map_chr(arts, get_country, album = album, track = trk)
codes <- unique(discard(codes, is.na))
if (length(codes) == 0) NA_character_ else str_c(codes, collapse = "|")
})
6 Process one block at a time
process_block <- function(df, block_id) {
pb <- progress_bar$new(
format = sprintf("Block %d [:bar] :percent ETA: :eta", block_id),
total = nrow(df), clear = FALSE, width = 60
)
out <- vector("character", nrow(df))
for (i in seq_len(nrow(df))) {
res <- safe_country(df[i, ])
if (!is.null(res$error)) {
message("⚠️ row ", i, " error: ", res$error$message)
out[i] <- NA_character_
} else {
out[i] <- res$result
}
pb$tick()
Sys.sleep(RATE_WAIT)
}
df$country <- out
df
}
7 Iterate through all blocks
blocks <- map2_dfr(
seq_along(split_points[-length(split_points)]),
seq_len(length(split_points) - 1),
~ {
from <- split_points[.x]
to <- split_points[.x + 1] - 1
process_block(dat[from:to, ], .x)
}
)
8 Persist results and cache
write_rds(blocks, "dat_with_country_all.rds")
# write.csv(blocks, "dat_with_country_all.csv", row.names = FALSE)
saveRDS(as.list(cache), cache_file)
Set eval=FALSE to eval=TRUE when you are ready to execute the long‑running MusicBrainz queries.