Summary

In this project, edges represent collaborations—specifically when artists are featured together on a song—while nodes (or vertices) represent individual artists. The data set includes information on approximately 20,000 primary artists whose songs appeared on the Spotify weekly charts, along with around 136,000 additional artists who were featured on songs with at least one charting artist.

The dataset also captures the frequency and structure of these collaborations, allowing us to generate a large-scale network comprising over 135,000 artists as nodes and more than 300,000 edges representing collaborative links between them.

Data Sources: -Aggregated weekly Spotify chart data was collected from Kworb -Artist and feature data were scraped from the Spotify API.

Temporal Coverage: -Start Date: September 28, 2013 -End Date: October 9, 2022

Load required libraries

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
library(igraph)
## Warning: package 'igraph' was built under R version 4.4.3
library(tidygraph)
## Warning: package 'tidygraph' was built under R version 4.4.3
library(ggraph)
## Warning: package 'ggraph' was built under R version 4.4.3

Step 1: Read the data

e <- read.csv("edges.csv", header = TRUE, stringsAsFactors = FALSE)
n <- read.csv("nodes.csv", header = TRUE, stringsAsFactors = FALSE)

head(e)
##                     id_0                   id_1
## 1 76M2Ekj8bG8W7X2nbx2CpF 7sfl4Xt5KmfyDs2T3SVSMK
## 2 0hk4xVujcyOr6USD95wcWb 7Do8se3ZoaVqUt3woqqSrD
## 3 38jpuy3yt3QIxQ8Fn1HTeJ 4csQIMQm6vI2A2SCVDuM2z
## 4 6PvcxssrQ0QaJVaBWHD07l 6UCQYrcJ6wab6gnQ89OJFh
## 5 2R1QrQqWuw3IjoP5dXRFjt 4mk1ScvOUkuQzzCZpT6bc0
## 6 0k70gnDBLPirCltbTzoxuM 5FK3qokBQYxr7ZLkr8GVFn
head(n)
##               spotify_id               name followers popularity
## 1 48WvrUGoijadXXCsGocwM4          Byklubben      1738         24
## 2 4lDiJcOJ2GLCK6p9q5BgfK           Kontra K   1999676         72
## 3 652XIvIBNGg3C0KIGEJWit              Maxim     34596         36
## 4 3dXC1YPbnQPsfHPVkm1ipj Christopher Martin    249233         52
## 5 74terC9ol9zMo8rfzhSOiG      Jakob Hellman     21193         39
## 6 0FQMb3mVrAKlyU4H5mQOJh               Madh     26677         19
##                                                           genres
## 1                                 ['nordic house', 'russelater']
## 2                         ['christlicher rap', 'german hip hop']
## 3                                                             []
## 4 ['dancehall', 'lovers rock', 'modern reggae', 'reggae fusion']
## 5     ['classic swedish pop', 'norrbotten indie', 'swedish pop']
## 6                                                             []
##                                                chart_hits
## 1                                              ['no (3)']
## 2 ['at (44)', 'de (111)', 'lu (22)', 'ch (31)', 'vn (1)']
## 3                                              ['de (1)']
## 4                                    ['at (1)', 'de (1)']
## 5                                              ['se (6)']
## 6                                              ['it (2)']

Step 2: Make sure node IDs match

n_subset <- n %>%
  rename(id = spotify_id) %>%
  distinct(id, .keep_all = TRUE) %>%
  slice(1:1000)  #for only first 1000 for faster results

valid_ids <- n_subset$id

edges_filtered <- e %>%
  filter(id_0 %in% valid_ids & id_1 %in% valid_ids)

Step 3: Create graph from edge list and join with node data

g <- graph_from_data_frame(d = edges_filtered, vertices = n_subset, directed = FALSE)

plot(g,
     edge.arrow.size = .4,
     edge.arrow.color = "pink",
     vertex.color = "green", 
     vertex.label = NA, 
     vertex.size = 7)

Graph shows us a tight group of popular artist in the center meaning they are some of the biggest names. Most artist here are connected to a few people. A few people however, help link the rest together through lots of collaborations.

Step 4: Basic network metrics

V(g)$degree <- degree(g)
V(g)$betweenness <- betweenness(g)
V(g)$pagerank <- page_rank(g)$vector

top_degree <- V(g)[order(-degree)][1:10]
top_degree_df <- data.frame(
  name = V(g)$name[top_degree],
  degree = V(g)$degree[top_degree]
)
print("Top 10 artists by degree:")
## [1] "Top 10 artists by degree:"
print(top_degree_df)
##           name degree
## 1      The Him      8
## 2      Cardi B      7
## 3     Lil Baby      7
## 4  Nicki Minaj      7
## 5     Dua Lipa      6
## 6    Epik High      6
## 7  Cheat Codes      6
## 8   Bebe Rexha      6
## 9      Juicy J      6
## 10        Kygo      5
top_betweenness <- V(g)[order(-betweenness)][1:10]
top_betweenness_df <- data.frame(
  name = V(g)$name[top_betweenness],
  betweenness = V(g)$betweenness[top_betweenness]
)
print("Top 10 artists by betweenness:")
## [1] "Top 10 artists by betweenness:"
print(top_betweenness_df)
##           name betweenness
## 1      Cardi B    933.2667
## 2   Bebe Rexha    874.6476
## 3      Juicy J    641.8952
## 4      The Him    635.7429
## 5     Kontra K    572.0000
## 6  Nicki Minaj    526.0143
## 7        Logic    395.0000
## 8      Olexesh    390.0000
## 9         Kygo    345.9286
## 10 Cheat Codes    337.6429
top_pagerank <- V(g)[order(-pagerank)][1:10]
top_pagerank_df <- data.frame(
  name = V(g)$name[top_pagerank],
  pagerank = V(g)$pagerank[top_pagerank]
)
print("Top 10 artists by PageRank:")
## [1] "Top 10 artists by PageRank:"
print(top_pagerank_df)
##             name    pagerank
## 1      Epik High 0.007897324
## 2        The Him 0.007835590
## 3  Gusttavo Lima 0.007436214
## 4  Carlos Rivera 0.007003816
## 5    Nicki Minaj 0.006894586
## 6       Dua Lipa 0.006655025
## 7        Cardi B 0.006384164
## 8       Lil Baby 0.006348447
## 9         Common 0.005889330
## 10       Juicy J 0.005764036

Top degree artist - The Him. The top artist here has worked with the most other artists. They are very active in collaborations. Top betweeness artist - Cardi B. This artist connects different groups of artists. They often work with people who don’t usually work together. Top pagerank artist - Epik High. This artist is very influential in the network. They are connected to other important artists.

Step 5: Community detection using Louvain algorithm

louvain_clusters <- cluster_louvain(g)
V(g)$community <- louvain_clusters$membership

community_df <- data.frame(
  name = V(g)$name,
  community = V(g)$community
)
print("Sample of artists and their community:")
## [1] "Sample of artists and their community:"
print(head(community_df, 10))
##                  name community
## 1           Byklubben         1
## 2            Kontra K         2
## 3               Maxim         2
## 4  Christopher Martin         3
## 5       Jakob Hellman         4
## 6                Madh         5
## 7               Juice         6
## 8              Nehuda         7
## 9         VovaZiLvova         8
## 10        Nata Record         9

We used the Louvain algorithm to find communities in the artist network. Each community is a group of artists who tend to collaborate with each other more often. This helps us see how the music industry is organized into smaller groups or genres, and which artists belong to which group.

Step 6: Top artists by centrality

top_artists <- V(g)[order(-V(g)$degree)][1:10]

top_artists_info <- data.frame(
  name = V(g)$name[top_artists],
  degree = V(g)$degree[top_artists],
  popularity = V(g)$popularity[top_artists],
  followers = V(g)$followers[top_artists]
)


print(top_artists_info)
##           name degree popularity followers
## 1      The Him      8         56    107823
## 2      Cardi B      7         80  20361435
## 3     Lil Baby      7         89  11530234
## 4  Nicki Minaj      7         87  26039960
## 5     Dua Lipa      6         88  36163788
## 6    Epik High      6         56    630279
## 7  Cheat Codes      6         71   2087675
## 8   Bebe Rexha      6         81   7531004
## 9      Juicy J      6         73   2891371
## 10        Kygo      5         80   8134874

This table shows the top 10 artists with the most collaborations in the network. These artists are highly active and connected in the music industry.

Conclusion

This analysis highlights the structure and key players in the Spotify artist collaboration network. Using community detection and centrality metrics, we gain insight into how interconnected the music industry is and which artists play central roles in shaping collaboration trends.