soptify02

Author

Takafumi Kubota

Published

June 3, 2025

Abstract

This notebook enriches a cleaned Spotify track list with artist-country metadata to support geographic analyses of streaming popularity. Starting from spotify_data_rmdup.csv—a 13 k-row table of unique tracks and curated audio-features—we query the public MusicBrainz API in 1 000-row batches, assigning one or more ISO-3166 country codes to every artist credited on each track. A lightweight in-memory cache plus an on-disk RDS file prevent redundant requests, while an adjustable one-second delay respects MusicBrainz rate limits. All steps—chunk creation, safe API calls, duplicate handling, progress reporting, and final export—are fully scripted in R and wrapped in a Quarto document for reproducibility. The resulting file, dat_with_country_all.rds, adds a pipe-delimited country field that can be merged back into genre or popularity studies, enabling questions such as “Which countries dominate high-popularity electronic playlists?” or “How do cross-border collaborations affect reach?” Researchers can reuse or extend the workflow with minimal edits.

Keywords

Spotify, R language


The global nature of streaming means that a single Spotify playlist often features artists scattered across continents. Understanding where those artists come from is crucial when modelling cultural diffusion, collaboration networks, or the regionality of taste. Unfortunately, Spotify’s public metadata does not expose consistent country-of-origin fields. To bridge this gap, the present Quarto notebook augments an already deduplicated track table (spotify_data_rmdup.csv) with country codes harvested from the crowd-sourced MusicBrainz knowledge graph.

We begin by loading four core tidy-verse helpers—dplyr, stringr, purrr, and readr—alongside musicbrainz and httr for HTTP calls. Because MusicBrainz enforces strict user-agent and rate-limit policies, we set an explicit agent string (spotify-country-batch/1.1) and a modest RATE_WAIT delay of one second between API hits. This keeps our process compliant while still moving through roughly 60 000 look-ups per hour.

Next, we define chunk boundaries with seq(1, nrow(dat), by = 1000), adding a sentinel at nrow(dat)+1. Each chunk is processed independently, ensuring that even if an API outage occurs the batch already completed is preserved and cached. The cache itself is a simple R environment serialised to artist_country_cache.rds, mapping artist names (optionally album and track for disambiguation) to previously discovered ISO-3166 codes. On reruns the cache is reloaded, eliminating duplicate look-ups for the same artist.

The critical yet user-dependent function, get_country(), encapsulates the MusicBrainz query. You should implement it to (1) search for the artist by name, (2) optionally filter by release group or recording that matches the album or track, and (3) return the primary or earliest known country code. Wrapping this call in purrr::safely() yields safe_country(), which converts any API error into a harmless NA while preserving the error message for logging.

process_block() loops row-by-row through a chunk, exploding multi-artist strings, fetching codes for each artist, collapsing duplicates with unique(), and concatenating multiple codes via "|". A sleek progress_bar from the progress package provides real-time feedback (percent and ETA). After all blocks are mapped with purrr::map2_dfr(), we write two artefacts: dat_with_country_all.rds containing the enriched dataset and the updated cache.

Because every transformation is staged in discrete, well-commented code cells—with computationally expensive cells defaulted to eval = FALSE—the notebook is easy to audit, rerun, or adapt. Swap in a different chunk size, tighten the rate limit, or repoint get_country() toward your own knowledge base; the surrounding scaffolding remains intact. Researchers thus gain a turnkey, reproducible path from raw Spotify exports to geography-aware analyses of streaming behaviour.

Runtime note: Running the full 13 k‑row batch through the MusicBrainz API typically takes 6–7 hours overnight on a standard home internet connection.

0  Load the library

# ----------------------------------------------------------------------------
# batch_country_lookup.R — enrich tracks with artist country codes in batches
# ----------------------------------------------------------------------------

# Core libraries
library(dplyr)
library(stringr)
library(purrr)

# MusicBrainz + HTTP utilities
library(musicbrainz)
library(httr)

# Progress bar
library(progress)

# I/O helper
library(readr)

# ---------------------------------------------------------------------------
# Configuration
# ---------------------------------------------------------------------------
httr::set_config(user_agent("spotify-country-batch/1.1 (your_email@example.com)"))
RATE_WAIT <- 1  # seconds between API calls to respect rate limits

1  Load the cleaned Spotify data

# `spotify_data_rmdup.csv` should be the deduplicated file created earlier
# (13 k rows, no borderline popularity duplicates)
dat <- read.csv("spotify_data_rmdup.csv")

2  Chunk the data frame into 1 000‑row blocks

# Compute split indices
split_points <- c(seq(1, nrow(dat), by = 1000), nrow(dat) + 1)

3  Prepare an artist‑to‑country cache

cache_file <- "artist_country_cache.rds"
cache <- if (file.exists(cache_file)) readRDS(cache_file) else new.env()

4  Helper: get_country()

# Implement your MusicBrainz query here — must return a two‑letter ISO‑3166 code
get_country <- function(artist, album = NULL, track = NULL) {
  # TODO: Replace with actual MusicBrainz API call + parsing logic
  NA_character_
}

5  Safe row‑wise wrapper

safe_country <- purrr::safely(function(row) {
  arts  <- str_trim(str_split(row$track_artist, ",")[[1]])
  album <- row$track_album_name
  trk   <- row$track_name
  codes <- map_chr(arts, get_country, album = album, track = trk)
  codes <- unique(discard(codes, is.na))
  if (length(codes) == 0) NA_character_ else str_c(codes, collapse = "|")
})

6  Process one block at a time

process_block <- function(df, block_id) {
  pb <- progress_bar$new(
    format = sprintf("Block %d [:bar] :percent ETA: :eta", block_id),
    total  = nrow(df), clear = FALSE, width = 60
  )
  out <- vector("character", nrow(df))
  for (i in seq_len(nrow(df))) {
    res <- safe_country(df[i, ])
    if (!is.null(res$error)) {
      message("⚠️  row ", i, " error: ", res$error$message)
      out[i] <- NA_character_
    } else {
      out[i] <- res$result
    }
    pb$tick()
    Sys.sleep(RATE_WAIT)
  }
  df$country <- out
  df
}

7  Iterate through all blocks

blocks <- map2_dfr(
  seq_along(split_points[-length(split_points)]),
  seq_len(length(split_points) - 1),
  ~ {
    from <- split_points[.x]
    to   <- split_points[.x + 1] - 1
    process_block(dat[from:to, ], .x)
  }
)

8  Persist results and cache

write_rds(blocks, "dat_with_country_all.rds")
# write.csv(blocks, "dat_with_country_all.csv", row.names = FALSE)

saveRDS(as.list(cache), cache_file)

Set eval=FALSE to eval=TRUE when you are ready to execute the long‑running MusicBrainz queries.