Final Project Code

Introduction

Spotify is one of the most widely used music streaming platforms in the world, serving hundreds of millions of users with personalized listening experiences built on sophisticated recommendation algorithms. Features like Daily Mixes and Discover Weekly are central to what makes Spotify feel personal, as they surface songs that feel tailored to individual taste rather than just popularity. For this assignment, the goal is to analyze Spotify’s recommender system through a scenario design lens, reverse engineer what we can about how it works using publicly available information and data, and then extend that analysis with our own implementation using real track data. Specifically, we pull audio feature data from a Kaggle CSV dataset containing over 114,000 tracks spanning 112 genres, supplement it with a scraped Wikipedia table of the world’s most streamed Spotify songs, and apply k-means clustering across nine audio features including energy, danceability, acousticness, valence, and tempo. The clusters that emerge from this analysis are then used to reason about how Spotify likely groups songs internally when building personalized features like Daily Mixes. The analysis follows an OSEMN workflow, moving from data acquisition through transformation, clustering, visualization, and finally a set of concrete recommendations for how Spotify could improve its recommendation capabilities going forward.

Code

Source 1

For this first we are going to use a Kaggle data set with spotify music data

library(tidyverse)

## Warning: package 'ggplot2' was built under R version 4.5.2

## Warning: package 'tidyr' was built under R version 4.5.2

## Warning: package 'dplyr' was built under R version 4.5.2

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

url <- "https://raw.githubusercontent.com/JZunaRepo/Spring27Projects/refs/heads/main/FinaProject/data.csv"

spotify_df <- read.csv(
  file = url
)
glimpse(spotify_df)

## Rows: 114,000
## Columns: 21
## $ X                <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ track_id         <chr> "5SuOikwiRyPMVoIQDJUgSV", "4qPNDBW1i3p13qLCt0Ki3A", "…
## $ artists          <chr> "Gen Hoshino", "Ben Woodward", "Ingrid Michaelson;ZAY…
## $ album_name       <chr> "Comedy", "Ghost (Acoustic)", "To Begin Again", "Craz…
## $ track_name       <chr> "Comedy", "Ghost - Acoustic", "To Begin Again", "Can'…
## $ popularity       <int> 73, 55, 57, 71, 82, 58, 74, 80, 74, 56, 74, 69, 52, 6…
## $ duration_ms      <int> 230666, 149610, 210826, 201933, 198853, 214240, 22940…
## $ explicit         <chr> "False", "False", "False", "False", "False", "False",…
## $ danceability     <dbl> 0.676, 0.420, 0.438, 0.266, 0.618, 0.688, 0.407, 0.70…
## $ energy           <dbl> 0.4610, 0.1660, 0.3590, 0.0596, 0.4430, 0.4810, 0.147…
## $ key              <int> 1, 1, 0, 0, 2, 6, 2, 11, 0, 1, 8, 4, 7, 3, 2, 4, 2, 1…
## $ loudness         <dbl> -6.746, -17.235, -9.734, -18.515, -9.681, -8.807, -8.…
## $ mode             <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ speechiness      <dbl> 0.1430, 0.0763, 0.0557, 0.0363, 0.0526, 0.1050, 0.035…
## $ acousticness     <dbl> 0.0322, 0.9240, 0.2100, 0.9050, 0.4690, 0.2890, 0.857…
## $ instrumentalness <dbl> 1.01e-06, 5.56e-06, 0.00e+00, 7.07e-05, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.3580, 0.1010, 0.1170, 0.1320, 0.0829, 0.1890, 0.091…
## $ valence          <dbl> 0.7150, 0.2670, 0.1200, 0.1430, 0.1670, 0.6660, 0.076…
## $ tempo            <dbl> 87.917, 77.489, 76.332, 181.740, 119.949, 98.017, 141…
## $ time_signature   <int> 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4,…
## $ track_genre      <chr> "acoustic", "acoustic", "acoustic", "acoustic", "acou…

Exploring the dimensions and first rows of the data

dim(spotify_df)

## [1] 114000     21

head(spotify_df)

##   X               track_id                artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV            Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A           Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx           Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K       Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc           Tyrone Wells
##                                               album_name
## 1                                                 Comedy
## 2                                       Ghost (Acoustic)
## 3                                         To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5                                                Hold On
## 6                                   Days I Will Remember
##                   track_name popularity duration_ms explicit danceability
## 1                     Comedy         73      230666    False        0.676
## 2           Ghost - Acoustic         55      149610    False        0.420
## 3             To Begin Again         57      210826    False        0.438
## 4 Can't Help Falling In Love         71      201933    False        0.266
## 5                    Hold On         82      198853    False        0.618
## 6       Days I Will Remember         58      214240    False        0.688
##   energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610   1   -6.746    0      0.1430       0.0322         1.01e-06   0.3580
## 2 0.1660   1  -17.235    1      0.0763       0.9240         5.56e-06   0.1010
## 3 0.3590   0   -9.734    1      0.0557       0.2100         0.00e+00   0.1170
## 4 0.0596   0  -18.515    1      0.0363       0.9050         7.07e-05   0.1320
## 5 0.4430   2   -9.681    1      0.0526       0.4690         0.00e+00   0.0829
## 6 0.4810   6   -8.807    1      0.1050       0.2890         0.00e+00   0.1890
##   valence   tempo time_signature track_genre
## 1   0.715  87.917              4    acoustic
## 2   0.267  77.489              4    acoustic
## 3   0.120  76.332              4    acoustic
## 4   0.143 181.740              3    acoustic
## 5   0.167 119.949              4    acoustic
## 6   0.666  98.017              4    acoustic

Checking for any blank values

colSums(is.na(spotify_df))

##                X         track_id          artists       album_name 
##                0                0                0                0 
##       track_name       popularity      duration_ms         explicit 
##                0                0                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##             mode      speechiness     acousticness instrumentalness 
##                0                0                0                0 
##         liveness          valence            tempo   time_signature 
##                0                0                0                0 
##      track_genre 
##                0

Now checking how many genres are in the data set. The reason to do this is because this will make the clusters more meaningful and easier to interpret

unique(spotify_df$track_genre)

##   [1] "acoustic"          "afrobeat"          "alt-rock"         
##   [4] "alternative"       "ambient"           "anime"            
##   [7] "black-metal"       "bluegrass"         "blues"            
##  [10] "brazil"            "breakbeat"         "british"          
##  [13] "cantopop"          "chicago-house"     "children"         
##  [16] "chill"             "classical"         "club"             
##  [19] "comedy"            "country"           "dance"            
##  [22] "dancehall"         "death-metal"       "deep-house"       
##  [25] "detroit-techno"    "disco"             "disney"           
##  [28] "drum-and-bass"     "dub"               "dubstep"          
##  [31] "edm"               "electro"           "electronic"       
##  [34] "emo"               "folk"              "forro"            
##  [37] "french"            "funk"              "garage"           
##  [40] "german"            "gospel"            "goth"             
##  [43] "grindcore"         "groove"            "grunge"           
##  [46] "guitar"            "happy"             "hard-rock"        
##  [49] "hardcore"          "hardstyle"         "heavy-metal"      
##  [52] "hip-hop"           "honky-tonk"        "house"            
##  [55] "idm"               "indian"            "indie-pop"        
##  [58] "indie"             "industrial"        "iranian"          
##  [61] "j-dance"           "j-idol"            "j-pop"            
##  [64] "j-rock"            "jazz"              "k-pop"            
##  [67] "kids"              "latin"             "latino"           
##  [70] "malay"             "mandopop"          "metal"            
##  [73] "metalcore"         "minimal-techno"    "mpb"              
##  [76] "new-age"           "opera"             "pagode"           
##  [79] "party"             "piano"             "pop-film"         
##  [82] "pop"               "power-pop"         "progressive-house"
##  [85] "psych-rock"        "punk-rock"         "punk"             
##  [88] "r-n-b"             "reggae"            "reggaeton"        
##  [91] "rock-n-roll"       "rock"              "rockabilly"       
##  [94] "romance"           "sad"               "salsa"            
##  [97] "samba"             "sertanejo"         "show-tunes"       
## [100] "singer-songwriter" "ska"               "sleep"            
## [103] "songwriter"        "soul"              "spanish"          
## [106] "study"             "swedish"           "synth-pop"        
## [109] "tango"             "techno"            "trance"           
## [112] "trip-hop"          "turkish"           "world-music"

Checking the distribution of tracks across genres

spotify_df %>% 
  count(track_genre, sort = TRUE)

##           track_genre    n
## 1            acoustic 1000
## 2            afrobeat 1000
## 3            alt-rock 1000
## 4         alternative 1000
## 5             ambient 1000
## 6               anime 1000
## 7         black-metal 1000
## 8           bluegrass 1000
## 9               blues 1000
## 10             brazil 1000
## 11          breakbeat 1000
## 12            british 1000
## 13           cantopop 1000
## 14      chicago-house 1000
## 15           children 1000
## 16              chill 1000
## 17          classical 1000
## 18               club 1000
## 19             comedy 1000
## 20            country 1000
## 21              dance 1000
## 22          dancehall 1000
## 23        death-metal 1000
## 24         deep-house 1000
## 25     detroit-techno 1000
## 26              disco 1000
## 27             disney 1000
## 28      drum-and-bass 1000
## 29                dub 1000
## 30            dubstep 1000
## 31                edm 1000
## 32            electro 1000
## 33         electronic 1000
## 34                emo 1000
## 35               folk 1000
## 36              forro 1000
## 37             french 1000
## 38               funk 1000
## 39             garage 1000
## 40             german 1000
## 41             gospel 1000
## 42               goth 1000
## 43          grindcore 1000
## 44             groove 1000
## 45             grunge 1000
## 46             guitar 1000
## 47              happy 1000
## 48          hard-rock 1000
## 49           hardcore 1000
## 50          hardstyle 1000
## 51        heavy-metal 1000
## 52            hip-hop 1000
## 53         honky-tonk 1000
## 54              house 1000
## 55                idm 1000
## 56             indian 1000
## 57              indie 1000
## 58          indie-pop 1000
## 59         industrial 1000
## 60            iranian 1000
## 61            j-dance 1000
## 62             j-idol 1000
## 63              j-pop 1000
## 64             j-rock 1000
## 65               jazz 1000
## 66              k-pop 1000
## 67               kids 1000
## 68              latin 1000
## 69             latino 1000
## 70              malay 1000
## 71           mandopop 1000
## 72              metal 1000
## 73          metalcore 1000
## 74     minimal-techno 1000
## 75                mpb 1000
## 76            new-age 1000
## 77              opera 1000
## 78             pagode 1000
## 79              party 1000
## 80              piano 1000
## 81                pop 1000
## 82           pop-film 1000
## 83          power-pop 1000
## 84  progressive-house 1000
## 85         psych-rock 1000
## 86               punk 1000
## 87          punk-rock 1000
## 88              r-n-b 1000
## 89             reggae 1000
## 90          reggaeton 1000
## 91               rock 1000
## 92        rock-n-roll 1000
## 93         rockabilly 1000
## 94            romance 1000
## 95                sad 1000
## 96              salsa 1000
## 97              samba 1000
## 98          sertanejo 1000
## 99         show-tunes 1000
## 100 singer-songwriter 1000
## 101               ska 1000
## 102             sleep 1000
## 103        songwriter 1000
## 104              soul 1000
## 105           spanish 1000
## 106             study 1000
## 107           swedish 1000
## 108         synth-pop 1000
## 109             tango 1000
## 110            techno 1000
## 111            trance 1000
## 112          trip-hop 1000
## 113           turkish 1000
## 114       world-music 1000

After exploring the genres and the tracks per genre let me to the decision to strip down some tracks. The full dataset is too large for k-means to run efficiently so I’m taking a stratified sample of 20 tracks per genre. This keeps representation balanced across all genres while keeping the dataset at a manageable size

set.seed(42)

spotify_sample <- spotify_df %>%
  group_by(track_genre) %>%
  slice_sample(n = 20) %>%
  ungroup()

dim(spotify_sample)

## [1] 2280   21

Next, selecting only the numeric audio features for clustering. These are the same features Spotify uses internally tocharacterize and compare songs

audio_features <- spotify_sample %>%
  select(track_name, artists, track_genre, danceability, energy, 
         loudness, speechiness, acousticness, instrumentalness, 
         liveness, valence, tempo)

glimpse(audio_features)

## Rows: 2,280
## Columns: 12
## $ track_name       <chr> "Nothing Like You and I", "Corinna - From \"The Natch…
## $ artists          <chr> "The Perishers", "Taj Mahal", "Zack Tabudlo", "Howie …
## $ track_genre      <chr> "acoustic", "acoustic", "acoustic", "acoustic", "acou…
## $ danceability     <dbl> 0.598, 0.621, 0.592, 0.636, 0.607, 0.647, 0.560, 0.59…
## $ energy           <dbl> 0.2910, 0.2940, 0.4900, 0.6250, 0.4730, 0.5670, 0.832…
## $ loudness         <dbl> -8.627, -13.468, -8.508, -7.895, -8.555, -8.971, -5.5…
## $ speechiness      <dbl> 0.0328, 0.1100, 0.0263, 0.0277, 0.0340, 0.0328, 0.035…
## $ acousticness     <dbl> 0.46600, 0.55800, 0.39500, 0.22200, 0.71200, 0.03180,…
## $ instrumentalness <dbl> 6.52e-05, 2.54e-05, 0.00e+00, 5.35e-05, 0.00e+00, 0.0…
## $ liveness         <dbl> 0.1200, 0.0946, 0.3330, 0.1190, 0.6600, 0.1330, 0.232…
## $ valence          <dbl> 0.3800, 0.7280, 0.4370, 0.3420, 0.5400, 0.3740, 0.802…
## $ tempo            <dbl> 123.978, 78.599, 99.994, 93.931, 119.698, 125.871, 10…

Before clustering I need to scale all the audio features so that each one contributes equally to the distance calculations k-means works by measuring how far apart songs are from each other and without scaling a feature like tempo would dominate simply because its numbers are much larger than something like danceability

audio_scaled <- audio_features %>%
  select(danceability, energy, loudness, speechiness, 
         acousticness, instrumentalness, liveness, valence, tempo) %>%
  scale()

round(colMeans(audio_scaled), 4)

##     danceability           energy         loudness      speechiness 
##                0                0                0                0 
##     acousticness instrumentalness         liveness          valence 
##                0                0                0                0 
##            tempo 
##                0

round(apply(audio_scaled, 2, sd), 4)

##     danceability           energy         loudness      speechiness 
##                1                1                1                1 
##     acousticness instrumentalness         liveness          valence 
##                1                1                1                1 
##            tempo 
##                1

To find the optimal number of clusters I’m running k-means for k values from 1 to 10 and recording the total within cluster sum of squares (wss) for each one. The wss tells us how tightly packed the clusters are. A lower wss means songs within each cluster are more similar to each other which is exactly what we want

set.seed(42)

wss <- map_dbl(1:10, function(k) {
  kmeans(audio_scaled, centers = k, nstart = 25)$tot.withinss
})

# Now let's plot the elbow curve
elbow_df <- tibble(k = 1:10, wss = wss)

ggplot(elbow_df, aes(x = k, y = wss)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(size = 3, color = "steelblue") +
  scale_x_continuous(breaks = 1:10) +
  labs(
    title = "Elbow Plot: Finding the Optimal Number of Clusters",
    x = "Number of Clusters (k)",
    y = "Total Within-Cluster Sum of Squares"
  ) +
  theme_minimal()

Based on the elbow plot I’m choosing k=6 as my number of clusters. The curve flattens noticeably after 6 meaning adding more clusters beyond that point doesn’t meaningfully improve how tightly grouped the songs are within each cluster

set.seed(42)

kmeans_result <- kmeans(audio_scaled, centers = 6, nstart = 25)

Now I’ll attach the cluster assignments back to our original dataframe so we can see which song landed in which cluster and interpret what each cluster actually represents musically

audio_clustered <- audio_features %>%
  mutate(cluster = factor(kmeans_result$cluster))
audio_clustered %>%
  count(cluster, sort = FALSE)

## # A tibble: 6 × 2
##   cluster     n
##   <fct>   <int>
## 1 1         834
## 2 2         165
## 3 3          25
## 4 4         505
## 5 5         576
## 6 6         175

Now I want to understand what each cluster actually represents in musical terms. I’ll do this by calculating the average value of each audio feature per cluster. High energy + high danceability might suggest a party/EDM cluster while high acousticness + low energy might suggest a quiet acoustic or classical cluster.

cluster_profiles <- audio_clustered %>%
  group_by(cluster) %>%
  summarise(
    danceability = round(mean(danceability), 3),
    energy = round(mean(energy), 3),
    loudness = round(mean(loudness), 3),
    speechiness = round(mean(speechiness), 3),
    acousticness = round(mean(acousticness), 3),
    instrumentalness = round(mean(instrumentalness), 3),
    liveness = round(mean(liveness), 3),
    valence = round(mean(valence), 3),
    tempo = round(mean(tempo), 3),
    track_count = n()
  )

print(cluster_profiles)

## # A tibble: 6 × 11
##   cluster danceability energy loudness speechiness acousticness instrumentalness
##   <fct>          <dbl>  <dbl>    <dbl>       <dbl>        <dbl>            <dbl>
## 1 1              0.687  0.729    -6.57       0.095        0.191            0.061
## 2 2              0.552  0.757    -6.47       0.083        0.283            0.046
## 3 3              0.585  0.716   -10.6        0.856        0.725            0.014
## 4 4              0.541  0.391   -10.4        0.054        0.659            0.045
## 5 5              0.477  0.802    -6.12       0.084        0.062            0.248
## 6 6              0.366  0.189   -20.6        0.053        0.831            0.769
## # ℹ 4 more variables: liveness <dbl>, valence <dbl>, tempo <dbl>,
## #   track_count <int>

To visualize the cluster profiles I need to reshape the data from wide to long format so ggplot can map each feature as a position on the x axis and color by cluster. This is a standard wide to long transformation using pivot_longer

cluster_long <- cluster_profiles %>%
  select(-track_count, -loudness, -tempo) %>%
  pivot_longer(cols = -cluster,
               names_to = "feature",
               values_to = "value")

ggplot(cluster_long, aes(x = feature, y = value, fill = cluster)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Average Audio Feature Values by Cluster",
    x = "Audio Feature",
    y = "Average Value",
    fill = "Cluster"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This scatter plot maps each song by its energy and danceability values and colors them by cluster. These two features are the most human-interpretable axes for music, high energy + high danceability is party music while low on both is quiet/ambient. This plot helps us visually confirm that our clusters are actually separating songs in a meaningful way

ggplot(audio_clustered, aes(x = danceability, y = energy, color = cluster)) +
  geom_point(alpha = 0.4, size = 1.5) +
  labs(
    title = "Song Clusters by Energy and Danceability",
    subtitle = "Each point represents one track, colored by k-means cluster assignment",
    x = "Danceability",
    y = "Energy",
    color = "Cluster"
  ) +
  theme_minimal()

Now that I’ve profiled each cluster I want to attach a human readable label to each one so the analysis tells a clear story. These labels are based on the average feature values we saw in the profile table and the bar chart above.

audio_clustered <- audio_clustered %>%
  mutate(cluster_label = case_when(
    cluster == 1 ~ "Energetic & Danceable (Pop/Dance)",
    cluster == 2 ~ "Moderate Energy (Mainstream)",
    cluster == 3 ~ "Spoken Word / Comedy",
    cluster == 4 ~ "Acoustic & Mellow (Folk/Indie)",
    cluster == 5 ~ "High Intensity (Metal/EDM)",
    cluster == 6 ~ "Calm & Acoustic (Classical/Ambient)"
  ))

audio_clustered %>%
  count(cluster_label, track_genre) %>%
  arrange(cluster_label, desc(n)) %>%
  group_by(cluster_label) %>%
  slice_head(n = 3) %>%
  print(n = 30)

## # A tibble: 18 × 3
## # Groups:   cluster_label [6]
##    cluster_label                       track_genre     n
##    <chr>                               <chr>       <int>
##  1 Acoustic & Mellow (Folk/Indie)      honky-tonk     18
##  2 Acoustic & Mellow (Folk/Indie)      cantopop       15
##  3 Acoustic & Mellow (Folk/Indie)      jazz           15
##  4 Calm & Acoustic (Classical/Ambient) sleep          17
##  5 Calm & Acoustic (Classical/Ambient) new-age        16
##  6 Calm & Acoustic (Classical/Ambient) piano          16
##  7 Energetic & Danceable (Pop/Dance)   latino         19
##  8 Energetic & Danceable (Pop/Dance)   reggaeton      18
##  9 Energetic & Danceable (Pop/Dance)   latin          17
## 10 High Intensity (Metal/EDM)          death-metal    20
## 11 High Intensity (Metal/EDM)          black-metal    19
## 12 High Intensity (Metal/EDM)          grindcore      18
## 13 Moderate Energy (Mainstream)        sertanejo      11
## 14 Moderate Energy (Mainstream)        samba           8
## 15 Moderate Energy (Mainstream)        electro         7
## 16 Spoken Word / Comedy                comedy         18
## 17 Spoken Word / Comedy                grindcore       2
## 18 Spoken Word / Comedy                kids            2

This final plot redoes the scatter plot but now uses our human readable cluster labels instead of numbers. This is the chart I’ll use to support my conclusions in the write up because it connects the data science output back to something a general audience can understand intuitively

ggplot(audio_clustered, aes(x = danceability, y = energy, color = cluster_label)) +
  geom_point(alpha = 0.4, size = 1.5) +
  labs(
    title = "Spotify Song Clusters by Energy and Danceability",
    subtitle = "K-means clustering (k=6) applied to audio features across 112 genres",
    x = "Danceability",
    y = "Energy",
    color = "Cluster"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom",
        legend.text = element_text(size = 7)) +
  guides(color = guide_legend(nrow = 3))

Source 2

For my second data source I’m scraping a web page to satisfy the requirement of having two different source types. I’m using rvest to pull the Wikipedia list of most streamed Spotify songs which gives me a real world popularity ranking that I can join back to my clustered dataset

library(rvest)

## 
## Attaching package: 'rvest'

## The following object is masked from 'package:readr':
## 
##     guess_encoding

url <- "https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify"

wiki_page <- read_html(url)

top_streamed <- wiki_page %>%
  html_element("table.wikitable") %>%
  html_table()

glimpse(top_streamed)

## Rows: 101
## Columns: 6
## $ Rank                <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",…
## $ Song                <chr> "\"Blinding Lights\"", "\"Shape of You\"", "\"Swea…
## $ `Artist(s)`         <chr> "The Weeknd", "Ed Sheeran", "The Neighbourhood", "…
## $ `Streams(billions)` <chr> "5.406", "4.903", "4.591", "4.525", "4.402", "4.31…
## $ `Release date`      <chr> "29 November 2019", "6 January 2017", "3 December …
## $ Ref.                <chr> "[1]", "[2]", "[3]", "[4]", "[5]", "[6]", "[7]", "…

head(top_streamed)

## # A tibble: 6 × 6
##   Rank  Song                `Artist(s)` `Streams(billions)` `Release date` Ref. 
##   <chr> <chr>               <chr>       <chr>               <chr>          <chr>
## 1 1     "\"Blinding Lights… The Weeknd  5.406               29 November 2… [1]  
## 2 2     "\"Shape of You\""  Ed Sheeran  4.903               6 January 2017 [2]  
## 3 3     "\"Sweater Weather… The Neighb… 4.591               3 December 20… [3]  
## 4 4     "\"Starboy\""       The Weeknd… 4.525               21 September … [4]  
## 5 5     "\"As It Was\""     Harry Styl… 4.402               1 April 2022   [5]  
## 6 6     "\"Someone You Lov… Lewis Capa… 4.315               8 November 20… [6]

First I need to clean up the scraped table column names since they came in with special characters and spaces which makes them hard to work with in R

top_streamed <- top_streamed %>%
  rename(
    rank = Rank,
    track_name = Song,
    artists = `Artist(s)`,
    streams_billions = `Streams(billions)`,
    release_date = `Release date`
  ) %>%
  select(rank, track_name, artists, streams_billions, release_date) %>%
  mutate(
    rank = as.integer(rank),
    streams_billions = as.numeric(streams_billions),
    track_name = str_remove_all(track_name, '"')
  )

## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `rank = as.integer(rank)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.

glimpse(top_streamed)

## Rows: 101
## Columns: 5
## $ rank             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ track_name       <chr> "Blinding Lights", "Shape of You", "Sweater Weather",…
## $ artists          <chr> "The Weeknd", "Ed Sheeran", "The Neighbourhood", "The…
## $ streams_billions <dbl> 5.406, 4.903, 4.591, 4.525, 4.402, 4.315, 4.235, 4.21…
## $ release_date     <chr> "29 November 2019", "6 January 2017", "3 December 201…

Now I’ll join the scraped streaming data to our clustered dataset to see if any of our sampled songs appear on the most streamed list and which clusters they belong to. This connects our audio feature clustering to real world streaming success which is a meaningful insight

clustered_with_streams <- audio_clustered %>%
  inner_join(top_streamed, by = "track_name")

clustered_with_streams %>%
  select(track_name, cluster_label, streams_billions) %>%
  arrange(desc(streams_billions))

## # A tibble: 16 × 3
##    track_name       cluster_label                       streams_billions
##    <chr>            <chr>                                          <dbl>
##  1 Heat Waves       Energetic & Danceable (Pop/Dance)               3.75
##  2 Señorita         Energetic & Danceable (Pop/Dance)               3.31
##  3 Señorita         Energetic & Danceable (Pop/Dance)               3.31
##  4 Watermelon Sugar Calm & Acoustic (Classical/Ambient)             3.28
##  5 Die For You      Energetic & Danceable (Pop/Dance)               3.26
##  6 Wake Me Up       Energetic & Danceable (Pop/Dance)               3.10
##  7 Shallow          Acoustic & Mellow (Folk/Indie)                  3.08
##  8 Without Me       Energetic & Danceable (Pop/Dance)               3.08
##  9 All of Me        Acoustic & Mellow (Folk/Indie)                  3.07
## 10 Beautiful Things Energetic & Danceable (Pop/Dance)               2.93
## 11 The Scientist    Calm & Acoustic (Classical/Ambient)             2.85
## 12 Take On Me       Energetic & Danceable (Pop/Dance)               2.76
## 13 Numb             High Intensity (Metal/EDM)                      2.73
## 14 Save Your Tears  Acoustic & Mellow (Folk/Indie)                  2.71
## 15 Happier          Energetic & Danceable (Pop/Dance)               2.61
## 16 Happier          Energetic & Danceable (Pop/Dance)               2.61

With 16 matches between our clustered songs and the world’s most streamed tracks I can now visualize which clusters the most popular songs tend to fall into. This is a meaningful real world validation of our clustering approach because if Spotify’s most streamed songs cluster together it suggests our audio feature groupings are capturing something real about what makes music resonate with listeners at scale

clustered_with_streams %>%
  group_by(cluster_label) %>%
  summarise(
    song_count = n(),
    avg_streams = round(mean(streams_billions), 3)
  ) %>%
  arrange(desc(song_count)) %>%
  ggplot(aes(x = reorder(cluster_label, song_count), 
             y = song_count, fill = cluster_label)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = paste0(song_count, " songs\n", 
                               avg_streams, "B avg streams")), 
            hjust = -0.1, size = 3) +
  coord_flip() +
  scale_y_continuous(limits = c(0, 10)) +
  labs(
    title = "Most Streamed Spotify Songs by Cluster",
    subtitle = "Where do the world's most streamed tracks fall in our clustering?",
    x = NULL,
    y = "Number of Top Streamed Songs"
  ) +
  theme_minimal()

Conclusion

This analysis set out to understand how Spotify’s recommender system works by reverse engineering its audio feature based approach through k-means clustering. Using a stratified sample of 2,280 tracks drawn equally across 112 genres, we scaled nine audio features and applied k-means clustering with k=6 selected via elbow plot analysis. The resulting clusters mapped remarkably well onto real musical categories, with the High Intensity cluster capturing death metal and black metal, the Calm and Acoustic cluster grouping sleep, new-age, and piano tracks, the Spoken Word cluster isolating comedy recordings, and the Energetic and Danceable cluster drawing in latino and reggaeton. When we joined our clustered dataset to a scraped list of the world’s most streamed Spotify songs, ten of the sixteen matching tracks fell into the Energetic and Danceable cluster with an average of 3 billion streams, strongly suggesting that the audio feature profile of high danceability combined with high energy is not just a common sonic pattern but the dominant driver of mainstream streaming success. This finding has direct implications for how Spotify could improve its recommender system going forward. Rather than relying solely on collaborative filtering based on what similar users have listened to, Spotify could weight audio feature proximity more heavily when seeding new Daily Mix sessions, particularly for users whose listening history skews toward the Energetic and Danceable cluster. Additionally, the existence of a small but distinct Spoken Word cluster with only 25 tracks suggests that Spotify’s genre taxonomy may conflate audio content types that listeners experience very differently, and separating spoken audio from musical audio earlier in the recommendation pipeline could meaningfully improve the relevance of suggestions for users who mix both content types in their daily listening.