Spotify is one of the most widely used music streaming platforms in the world, serving hundreds of millions of users with personalized listening experiences built on sophisticated recommendation algorithms. Features like Daily Mixes and Discover Weekly are central to what makes Spotify feel personal, as they surface songs that feel tailored to individual taste rather than just popularity. For this assignment, the goal is to analyze Spotify’s recommender system through a scenario design lens, reverse engineer what we can about how it works using publicly available information and data, and then extend that analysis with our own implementation using real track data. Specifically, we pull audio feature data from a Kaggle CSV dataset containing over 114,000 tracks spanning 112 genres, supplement it with a scraped Wikipedia table of the world’s most streamed Spotify songs, and apply k-means clustering across nine audio features including energy, danceability, acousticness, valence, and tempo. The clusters that emerge from this analysis are then used to reason about how Spotify likely groups songs internally when building personalized features like Daily Mixes. The analysis follows an OSEMN workflow, moving from data acquisition through transformation, clustering, visualization, and finally a set of concrete recommendations for how Spotify could improve its recommendation capabilities going forward.
For this first we are going to use a Kaggle data set with spotify music data
library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.5.2
## Warning: package 'tidyr' was built under R version 4.5.2
## Warning: package 'dplyr' was built under R version 4.5.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
url <- "https://raw.githubusercontent.com/JZunaRepo/Spring27Projects/refs/heads/main/FinaProject/data.csv"
spotify_df <- read.csv(
file = url
)
glimpse(spotify_df)
## Rows: 114,000
## Columns: 21
## $ X <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ track_id <chr> "5SuOikwiRyPMVoIQDJUgSV", "4qPNDBW1i3p13qLCt0Ki3A", "…
## $ artists <chr> "Gen Hoshino", "Ben Woodward", "Ingrid Michaelson;ZAY…
## $ album_name <chr> "Comedy", "Ghost (Acoustic)", "To Begin Again", "Craz…
## $ track_name <chr> "Comedy", "Ghost - Acoustic", "To Begin Again", "Can'…
## $ popularity <int> 73, 55, 57, 71, 82, 58, 74, 80, 74, 56, 74, 69, 52, 6…
## $ duration_ms <int> 230666, 149610, 210826, 201933, 198853, 214240, 22940…
## $ explicit <chr> "False", "False", "False", "False", "False", "False",…
## $ danceability <dbl> 0.676, 0.420, 0.438, 0.266, 0.618, 0.688, 0.407, 0.70…
## $ energy <dbl> 0.4610, 0.1660, 0.3590, 0.0596, 0.4430, 0.4810, 0.147…
## $ key <int> 1, 1, 0, 0, 2, 6, 2, 11, 0, 1, 8, 4, 7, 3, 2, 4, 2, 1…
## $ loudness <dbl> -6.746, -17.235, -9.734, -18.515, -9.681, -8.807, -8.…
## $ mode <int> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ speechiness <dbl> 0.1430, 0.0763, 0.0557, 0.0363, 0.0526, 0.1050, 0.035…
## $ acousticness <dbl> 0.0322, 0.9240, 0.2100, 0.9050, 0.4690, 0.2890, 0.857…
## $ instrumentalness <dbl> 1.01e-06, 5.56e-06, 0.00e+00, 7.07e-05, 0.00e+00, 0.0…
## $ liveness <dbl> 0.3580, 0.1010, 0.1170, 0.1320, 0.0829, 0.1890, 0.091…
## $ valence <dbl> 0.7150, 0.2670, 0.1200, 0.1430, 0.1670, 0.6660, 0.076…
## $ tempo <dbl> 87.917, 77.489, 76.332, 181.740, 119.949, 98.017, 141…
## $ time_signature <int> 4, 4, 4, 3, 4, 4, 3, 4, 4, 4, 4, 3, 4, 4, 4, 3, 4, 4,…
## $ track_genre <chr> "acoustic", "acoustic", "acoustic", "acoustic", "acou…
Exploring the dimensions and first rows of the data
dim(spotify_df)
## [1] 114000 21
head(spotify_df)
## X track_id artists
## 1 0 5SuOikwiRyPMVoIQDJUgSV Gen Hoshino
## 2 1 4qPNDBW1i3p13qLCt0Ki3A Ben Woodward
## 3 2 1iJBSr7s7jYXzM8EGcbK5b Ingrid Michaelson;ZAYN
## 4 3 6lfxq3CG4xtTiEg7opyCyx Kina Grannis
## 5 4 5vjLSffimiIP26QG5WcN2K Chord Overstreet
## 6 5 01MVOl9KtVTNfFiBU9I7dc Tyrone Wells
## album_name
## 1 Comedy
## 2 Ghost (Acoustic)
## 3 To Begin Again
## 4 Crazy Rich Asians (Original Motion Picture Soundtrack)
## 5 Hold On
## 6 Days I Will Remember
## track_name popularity duration_ms explicit danceability
## 1 Comedy 73 230666 False 0.676
## 2 Ghost - Acoustic 55 149610 False 0.420
## 3 To Begin Again 57 210826 False 0.438
## 4 Can't Help Falling In Love 71 201933 False 0.266
## 5 Hold On 82 198853 False 0.618
## 6 Days I Will Remember 58 214240 False 0.688
## energy key loudness mode speechiness acousticness instrumentalness liveness
## 1 0.4610 1 -6.746 0 0.1430 0.0322 1.01e-06 0.3580
## 2 0.1660 1 -17.235 1 0.0763 0.9240 5.56e-06 0.1010
## 3 0.3590 0 -9.734 1 0.0557 0.2100 0.00e+00 0.1170
## 4 0.0596 0 -18.515 1 0.0363 0.9050 7.07e-05 0.1320
## 5 0.4430 2 -9.681 1 0.0526 0.4690 0.00e+00 0.0829
## 6 0.4810 6 -8.807 1 0.1050 0.2890 0.00e+00 0.1890
## valence tempo time_signature track_genre
## 1 0.715 87.917 4 acoustic
## 2 0.267 77.489 4 acoustic
## 3 0.120 76.332 4 acoustic
## 4 0.143 181.740 3 acoustic
## 5 0.167 119.949 4 acoustic
## 6 0.666 98.017 4 acoustic
Checking for any blank values
colSums(is.na(spotify_df))
## X track_id artists album_name
## 0 0 0 0
## track_name popularity duration_ms explicit
## 0 0 0 0
## danceability energy key loudness
## 0 0 0 0
## mode speechiness acousticness instrumentalness
## 0 0 0 0
## liveness valence tempo time_signature
## 0 0 0 0
## track_genre
## 0
Now checking how many genres are in the data set. The reason to do this is because this will make the clusters more meaningful and easier to interpret
unique(spotify_df$track_genre)
## [1] "acoustic" "afrobeat" "alt-rock"
## [4] "alternative" "ambient" "anime"
## [7] "black-metal" "bluegrass" "blues"
## [10] "brazil" "breakbeat" "british"
## [13] "cantopop" "chicago-house" "children"
## [16] "chill" "classical" "club"
## [19] "comedy" "country" "dance"
## [22] "dancehall" "death-metal" "deep-house"
## [25] "detroit-techno" "disco" "disney"
## [28] "drum-and-bass" "dub" "dubstep"
## [31] "edm" "electro" "electronic"
## [34] "emo" "folk" "forro"
## [37] "french" "funk" "garage"
## [40] "german" "gospel" "goth"
## [43] "grindcore" "groove" "grunge"
## [46] "guitar" "happy" "hard-rock"
## [49] "hardcore" "hardstyle" "heavy-metal"
## [52] "hip-hop" "honky-tonk" "house"
## [55] "idm" "indian" "indie-pop"
## [58] "indie" "industrial" "iranian"
## [61] "j-dance" "j-idol" "j-pop"
## [64] "j-rock" "jazz" "k-pop"
## [67] "kids" "latin" "latino"
## [70] "malay" "mandopop" "metal"
## [73] "metalcore" "minimal-techno" "mpb"
## [76] "new-age" "opera" "pagode"
## [79] "party" "piano" "pop-film"
## [82] "pop" "power-pop" "progressive-house"
## [85] "psych-rock" "punk-rock" "punk"
## [88] "r-n-b" "reggae" "reggaeton"
## [91] "rock-n-roll" "rock" "rockabilly"
## [94] "romance" "sad" "salsa"
## [97] "samba" "sertanejo" "show-tunes"
## [100] "singer-songwriter" "ska" "sleep"
## [103] "songwriter" "soul" "spanish"
## [106] "study" "swedish" "synth-pop"
## [109] "tango" "techno" "trance"
## [112] "trip-hop" "turkish" "world-music"
Checking the distribution of tracks across genres
spotify_df %>%
count(track_genre, sort = TRUE)
## track_genre n
## 1 acoustic 1000
## 2 afrobeat 1000
## 3 alt-rock 1000
## 4 alternative 1000
## 5 ambient 1000
## 6 anime 1000
## 7 black-metal 1000
## 8 bluegrass 1000
## 9 blues 1000
## 10 brazil 1000
## 11 breakbeat 1000
## 12 british 1000
## 13 cantopop 1000
## 14 chicago-house 1000
## 15 children 1000
## 16 chill 1000
## 17 classical 1000
## 18 club 1000
## 19 comedy 1000
## 20 country 1000
## 21 dance 1000
## 22 dancehall 1000
## 23 death-metal 1000
## 24 deep-house 1000
## 25 detroit-techno 1000
## 26 disco 1000
## 27 disney 1000
## 28 drum-and-bass 1000
## 29 dub 1000
## 30 dubstep 1000
## 31 edm 1000
## 32 electro 1000
## 33 electronic 1000
## 34 emo 1000
## 35 folk 1000
## 36 forro 1000
## 37 french 1000
## 38 funk 1000
## 39 garage 1000
## 40 german 1000
## 41 gospel 1000
## 42 goth 1000
## 43 grindcore 1000
## 44 groove 1000
## 45 grunge 1000
## 46 guitar 1000
## 47 happy 1000
## 48 hard-rock 1000
## 49 hardcore 1000
## 50 hardstyle 1000
## 51 heavy-metal 1000
## 52 hip-hop 1000
## 53 honky-tonk 1000
## 54 house 1000
## 55 idm 1000
## 56 indian 1000
## 57 indie 1000
## 58 indie-pop 1000
## 59 industrial 1000
## 60 iranian 1000
## 61 j-dance 1000
## 62 j-idol 1000
## 63 j-pop 1000
## 64 j-rock 1000
## 65 jazz 1000
## 66 k-pop 1000
## 67 kids 1000
## 68 latin 1000
## 69 latino 1000
## 70 malay 1000
## 71 mandopop 1000
## 72 metal 1000
## 73 metalcore 1000
## 74 minimal-techno 1000
## 75 mpb 1000
## 76 new-age 1000
## 77 opera 1000
## 78 pagode 1000
## 79 party 1000
## 80 piano 1000
## 81 pop 1000
## 82 pop-film 1000
## 83 power-pop 1000
## 84 progressive-house 1000
## 85 psych-rock 1000
## 86 punk 1000
## 87 punk-rock 1000
## 88 r-n-b 1000
## 89 reggae 1000
## 90 reggaeton 1000
## 91 rock 1000
## 92 rock-n-roll 1000
## 93 rockabilly 1000
## 94 romance 1000
## 95 sad 1000
## 96 salsa 1000
## 97 samba 1000
## 98 sertanejo 1000
## 99 show-tunes 1000
## 100 singer-songwriter 1000
## 101 ska 1000
## 102 sleep 1000
## 103 songwriter 1000
## 104 soul 1000
## 105 spanish 1000
## 106 study 1000
## 107 swedish 1000
## 108 synth-pop 1000
## 109 tango 1000
## 110 techno 1000
## 111 trance 1000
## 112 trip-hop 1000
## 113 turkish 1000
## 114 world-music 1000
After exploring the genres and the tracks per genre let me to the decision to strip down some tracks. The full dataset is too large for k-means to run efficiently so I’m taking a stratified sample of 20 tracks per genre. This keeps representation balanced across all genres while keeping the dataset at a manageable size
set.seed(42)
spotify_sample <- spotify_df %>%
group_by(track_genre) %>%
slice_sample(n = 20) %>%
ungroup()
dim(spotify_sample)
## [1] 2280 21
Next, selecting only the numeric audio features for clustering. These are the same features Spotify uses internally tocharacterize and compare songs
audio_features <- spotify_sample %>%
select(track_name, artists, track_genre, danceability, energy,
loudness, speechiness, acousticness, instrumentalness,
liveness, valence, tempo)
glimpse(audio_features)
## Rows: 2,280
## Columns: 12
## $ track_name <chr> "Nothing Like You and I", "Corinna - From \"The Natch…
## $ artists <chr> "The Perishers", "Taj Mahal", "Zack Tabudlo", "Howie …
## $ track_genre <chr> "acoustic", "acoustic", "acoustic", "acoustic", "acou…
## $ danceability <dbl> 0.598, 0.621, 0.592, 0.636, 0.607, 0.647, 0.560, 0.59…
## $ energy <dbl> 0.2910, 0.2940, 0.4900, 0.6250, 0.4730, 0.5670, 0.832…
## $ loudness <dbl> -8.627, -13.468, -8.508, -7.895, -8.555, -8.971, -5.5…
## $ speechiness <dbl> 0.0328, 0.1100, 0.0263, 0.0277, 0.0340, 0.0328, 0.035…
## $ acousticness <dbl> 0.46600, 0.55800, 0.39500, 0.22200, 0.71200, 0.03180,…
## $ instrumentalness <dbl> 6.52e-05, 2.54e-05, 0.00e+00, 5.35e-05, 0.00e+00, 0.0…
## $ liveness <dbl> 0.1200, 0.0946, 0.3330, 0.1190, 0.6600, 0.1330, 0.232…
## $ valence <dbl> 0.3800, 0.7280, 0.4370, 0.3420, 0.5400, 0.3740, 0.802…
## $ tempo <dbl> 123.978, 78.599, 99.994, 93.931, 119.698, 125.871, 10…
Before clustering I need to scale all the audio features so that each one contributes equally to the distance calculations k-means works by measuring how far apart songs are from each other and without scaling a feature like tempo would dominate simply because its numbers are much larger than something like danceability
audio_scaled <- audio_features %>%
select(danceability, energy, loudness, speechiness,
acousticness, instrumentalness, liveness, valence, tempo) %>%
scale()
round(colMeans(audio_scaled), 4)
## danceability energy loudness speechiness
## 0 0 0 0
## acousticness instrumentalness liveness valence
## 0 0 0 0
## tempo
## 0
round(apply(audio_scaled, 2, sd), 4)
## danceability energy loudness speechiness
## 1 1 1 1
## acousticness instrumentalness liveness valence
## 1 1 1 1
## tempo
## 1
To find the optimal number of clusters I’m running k-means for k values from 1 to 10 and recording the total within cluster sum of squares (wss) for each one. The wss tells us how tightly packed the clusters are. A lower wss means songs within each cluster are more similar to each other which is exactly what we want
set.seed(42)
wss <- map_dbl(1:10, function(k) {
kmeans(audio_scaled, centers = k, nstart = 25)$tot.withinss
})
# Now let's plot the elbow curve
elbow_df <- tibble(k = 1:10, wss = wss)
ggplot(elbow_df, aes(x = k, y = wss)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(size = 3, color = "steelblue") +
scale_x_continuous(breaks = 1:10) +
labs(
title = "Elbow Plot: Finding the Optimal Number of Clusters",
x = "Number of Clusters (k)",
y = "Total Within-Cluster Sum of Squares"
) +
theme_minimal()
Based on the elbow plot I’m choosing k=6 as my number of clusters. The curve flattens noticeably after 6 meaning adding more clusters beyond that point doesn’t meaningfully improve how tightly grouped the songs are within each cluster
set.seed(42)
kmeans_result <- kmeans(audio_scaled, centers = 6, nstart = 25)
Now I’ll attach the cluster assignments back to our original dataframe so we can see which song landed in which cluster and interpret what each cluster actually represents musically
audio_clustered <- audio_features %>%
mutate(cluster = factor(kmeans_result$cluster))
audio_clustered %>%
count(cluster, sort = FALSE)
## # A tibble: 6 × 2
## cluster n
## <fct> <int>
## 1 1 834
## 2 2 165
## 3 3 25
## 4 4 505
## 5 5 576
## 6 6 175
Now I want to understand what each cluster actually represents in musical terms. I’ll do this by calculating the average value of each audio feature per cluster. High energy + high danceability might suggest a party/EDM cluster while high acousticness + low energy might suggest a quiet acoustic or classical cluster.
cluster_profiles <- audio_clustered %>%
group_by(cluster) %>%
summarise(
danceability = round(mean(danceability), 3),
energy = round(mean(energy), 3),
loudness = round(mean(loudness), 3),
speechiness = round(mean(speechiness), 3),
acousticness = round(mean(acousticness), 3),
instrumentalness = round(mean(instrumentalness), 3),
liveness = round(mean(liveness), 3),
valence = round(mean(valence), 3),
tempo = round(mean(tempo), 3),
track_count = n()
)
print(cluster_profiles)
## # A tibble: 6 × 11
## cluster danceability energy loudness speechiness acousticness instrumentalness
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.687 0.729 -6.57 0.095 0.191 0.061
## 2 2 0.552 0.757 -6.47 0.083 0.283 0.046
## 3 3 0.585 0.716 -10.6 0.856 0.725 0.014
## 4 4 0.541 0.391 -10.4 0.054 0.659 0.045
## 5 5 0.477 0.802 -6.12 0.084 0.062 0.248
## 6 6 0.366 0.189 -20.6 0.053 0.831 0.769
## # ℹ 4 more variables: liveness <dbl>, valence <dbl>, tempo <dbl>,
## # track_count <int>
To visualize the cluster profiles I need to reshape the data from wide to long format so ggplot can map each feature as a position on the x axis and color by cluster. This is a standard wide to long transformation using pivot_longer
cluster_long <- cluster_profiles %>%
select(-track_count, -loudness, -tempo) %>%
pivot_longer(cols = -cluster,
names_to = "feature",
values_to = "value")
ggplot(cluster_long, aes(x = feature, y = value, fill = cluster)) +
geom_bar(stat = "identity", position = "dodge") +
labs(
title = "Average Audio Feature Values by Cluster",
x = "Audio Feature",
y = "Average Value",
fill = "Cluster"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This scatter plot maps each song by its energy and danceability values and colors them by cluster. These two features are the most human-interpretable axes for music, high energy + high danceability is party music while low on both is quiet/ambient. This plot helps us visually confirm that our clusters are actually separating songs in a meaningful way
ggplot(audio_clustered, aes(x = danceability, y = energy, color = cluster)) +
geom_point(alpha = 0.4, size = 1.5) +
labs(
title = "Song Clusters by Energy and Danceability",
subtitle = "Each point represents one track, colored by k-means cluster assignment",
x = "Danceability",
y = "Energy",
color = "Cluster"
) +
theme_minimal()
Now that I’ve profiled each cluster I want to attach a human readable label to each one so the analysis tells a clear story. These labels are based on the average feature values we saw in the profile table and the bar chart above.
audio_clustered <- audio_clustered %>%
mutate(cluster_label = case_when(
cluster == 1 ~ "Energetic & Danceable (Pop/Dance)",
cluster == 2 ~ "Moderate Energy (Mainstream)",
cluster == 3 ~ "Spoken Word / Comedy",
cluster == 4 ~ "Acoustic & Mellow (Folk/Indie)",
cluster == 5 ~ "High Intensity (Metal/EDM)",
cluster == 6 ~ "Calm & Acoustic (Classical/Ambient)"
))
audio_clustered %>%
count(cluster_label, track_genre) %>%
arrange(cluster_label, desc(n)) %>%
group_by(cluster_label) %>%
slice_head(n = 3) %>%
print(n = 30)
## # A tibble: 18 × 3
## # Groups: cluster_label [6]
## cluster_label track_genre n
## <chr> <chr> <int>
## 1 Acoustic & Mellow (Folk/Indie) honky-tonk 18
## 2 Acoustic & Mellow (Folk/Indie) cantopop 15
## 3 Acoustic & Mellow (Folk/Indie) jazz 15
## 4 Calm & Acoustic (Classical/Ambient) sleep 17
## 5 Calm & Acoustic (Classical/Ambient) new-age 16
## 6 Calm & Acoustic (Classical/Ambient) piano 16
## 7 Energetic & Danceable (Pop/Dance) latino 19
## 8 Energetic & Danceable (Pop/Dance) reggaeton 18
## 9 Energetic & Danceable (Pop/Dance) latin 17
## 10 High Intensity (Metal/EDM) death-metal 20
## 11 High Intensity (Metal/EDM) black-metal 19
## 12 High Intensity (Metal/EDM) grindcore 18
## 13 Moderate Energy (Mainstream) sertanejo 11
## 14 Moderate Energy (Mainstream) samba 8
## 15 Moderate Energy (Mainstream) electro 7
## 16 Spoken Word / Comedy comedy 18
## 17 Spoken Word / Comedy grindcore 2
## 18 Spoken Word / Comedy kids 2
This final plot redoes the scatter plot but now uses our human readable cluster labels instead of numbers. This is the chart I’ll use to support my conclusions in the write up because it connects the data science output back to something a general audience can understand intuitively
ggplot(audio_clustered, aes(x = danceability, y = energy, color = cluster_label)) +
geom_point(alpha = 0.4, size = 1.5) +
labs(
title = "Spotify Song Clusters by Energy and Danceability",
subtitle = "K-means clustering (k=6) applied to audio features across 112 genres",
x = "Danceability",
y = "Energy",
color = "Cluster"
) +
theme_minimal() +
theme(legend.position = "bottom",
legend.text = element_text(size = 7)) +
guides(color = guide_legend(nrow = 3))
For my second data source I’m scraping a web page to satisfy the requirement of having two different source types. I’m using rvest to pull the Wikipedia list of most streamed Spotify songs which gives me a real world popularity ranking that I can join back to my clustered dataset
library(rvest)
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
url <- "https://en.wikipedia.org/wiki/List_of_most-streamed_songs_on_Spotify"
wiki_page <- read_html(url)
top_streamed <- wiki_page %>%
html_element("table.wikitable") %>%
html_table()
glimpse(top_streamed)
## Rows: 101
## Columns: 6
## $ Rank <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",…
## $ Song <chr> "\"Blinding Lights\"", "\"Shape of You\"", "\"Swea…
## $ `Artist(s)` <chr> "The Weeknd", "Ed Sheeran", "The Neighbourhood", "…
## $ `Streams(billions)` <chr> "5.406", "4.903", "4.591", "4.525", "4.402", "4.31…
## $ `Release date` <chr> "29 November 2019", "6 January 2017", "3 December …
## $ Ref. <chr> "[1]", "[2]", "[3]", "[4]", "[5]", "[6]", "[7]", "…
head(top_streamed)
## # A tibble: 6 × 6
## Rank Song `Artist(s)` `Streams(billions)` `Release date` Ref.
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 "\"Blinding Lights… The Weeknd 5.406 29 November 2… [1]
## 2 2 "\"Shape of You\"" Ed Sheeran 4.903 6 January 2017 [2]
## 3 3 "\"Sweater Weather… The Neighb… 4.591 3 December 20… [3]
## 4 4 "\"Starboy\"" The Weeknd… 4.525 21 September … [4]
## 5 5 "\"As It Was\"" Harry Styl… 4.402 1 April 2022 [5]
## 6 6 "\"Someone You Lov… Lewis Capa… 4.315 8 November 20… [6]
First I need to clean up the scraped table column names since they came in with special characters and spaces which makes them hard to work with in R
top_streamed <- top_streamed %>%
rename(
rank = Rank,
track_name = Song,
artists = `Artist(s)`,
streams_billions = `Streams(billions)`,
release_date = `Release date`
) %>%
select(rank, track_name, artists, streams_billions, release_date) %>%
mutate(
rank = as.integer(rank),
streams_billions = as.numeric(streams_billions),
track_name = str_remove_all(track_name, '"')
)
## Warning: There were 2 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `rank = as.integer(rank)`.
## Caused by warning:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 1 remaining warning.
glimpse(top_streamed)
## Rows: 101
## Columns: 5
## $ rank <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ track_name <chr> "Blinding Lights", "Shape of You", "Sweater Weather",…
## $ artists <chr> "The Weeknd", "Ed Sheeran", "The Neighbourhood", "The…
## $ streams_billions <dbl> 5.406, 4.903, 4.591, 4.525, 4.402, 4.315, 4.235, 4.21…
## $ release_date <chr> "29 November 2019", "6 January 2017", "3 December 201…
Now I’ll join the scraped streaming data to our clustered dataset to see if any of our sampled songs appear on the most streamed list and which clusters they belong to. This connects our audio feature clustering to real world streaming success which is a meaningful insight
clustered_with_streams <- audio_clustered %>%
inner_join(top_streamed, by = "track_name")
clustered_with_streams %>%
select(track_name, cluster_label, streams_billions) %>%
arrange(desc(streams_billions))
## # A tibble: 16 × 3
## track_name cluster_label streams_billions
## <chr> <chr> <dbl>
## 1 Heat Waves Energetic & Danceable (Pop/Dance) 3.75
## 2 Señorita Energetic & Danceable (Pop/Dance) 3.31
## 3 Señorita Energetic & Danceable (Pop/Dance) 3.31
## 4 Watermelon Sugar Calm & Acoustic (Classical/Ambient) 3.28
## 5 Die For You Energetic & Danceable (Pop/Dance) 3.26
## 6 Wake Me Up Energetic & Danceable (Pop/Dance) 3.10
## 7 Shallow Acoustic & Mellow (Folk/Indie) 3.08
## 8 Without Me Energetic & Danceable (Pop/Dance) 3.08
## 9 All of Me Acoustic & Mellow (Folk/Indie) 3.07
## 10 Beautiful Things Energetic & Danceable (Pop/Dance) 2.93
## 11 The Scientist Calm & Acoustic (Classical/Ambient) 2.85
## 12 Take On Me Energetic & Danceable (Pop/Dance) 2.76
## 13 Numb High Intensity (Metal/EDM) 2.73
## 14 Save Your Tears Acoustic & Mellow (Folk/Indie) 2.71
## 15 Happier Energetic & Danceable (Pop/Dance) 2.61
## 16 Happier Energetic & Danceable (Pop/Dance) 2.61
With 16 matches between our clustered songs and the world’s most streamed tracks I can now visualize which clusters the most popular songs tend to fall into. This is a meaningful real world validation of our clustering approach because if Spotify’s most streamed songs cluster together it suggests our audio feature groupings are capturing something real about what makes music resonate with listeners at scale
clustered_with_streams %>%
group_by(cluster_label) %>%
summarise(
song_count = n(),
avg_streams = round(mean(streams_billions), 3)
) %>%
arrange(desc(song_count)) %>%
ggplot(aes(x = reorder(cluster_label, song_count),
y = song_count, fill = cluster_label)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = paste0(song_count, " songs\n",
avg_streams, "B avg streams")),
hjust = -0.1, size = 3) +
coord_flip() +
scale_y_continuous(limits = c(0, 10)) +
labs(
title = "Most Streamed Spotify Songs by Cluster",
subtitle = "Where do the world's most streamed tracks fall in our clustering?",
x = NULL,
y = "Number of Top Streamed Songs"
) +
theme_minimal()
This analysis set out to understand how Spotify’s recommender system works by reverse engineering its audio feature based approach through k-means clustering. Using a stratified sample of 2,280 tracks drawn equally across 112 genres, we scaled nine audio features and applied k-means clustering with k=6 selected via elbow plot analysis. The resulting clusters mapped remarkably well onto real musical categories, with the High Intensity cluster capturing death metal and black metal, the Calm and Acoustic cluster grouping sleep, new-age, and piano tracks, the Spoken Word cluster isolating comedy recordings, and the Energetic and Danceable cluster drawing in latino and reggaeton. When we joined our clustered dataset to a scraped list of the world’s most streamed Spotify songs, ten of the sixteen matching tracks fell into the Energetic and Danceable cluster with an average of 3 billion streams, strongly suggesting that the audio feature profile of high danceability combined with high energy is not just a common sonic pattern but the dominant driver of mainstream streaming success. This finding has direct implications for how Spotify could improve its recommender system going forward. Rather than relying solely on collaborative filtering based on what similar users have listened to, Spotify could weight audio feature proximity more heavily when seeding new Daily Mix sessions, particularly for users whose listening history skews toward the Energetic and Danceable cluster. Additionally, the existence of a small but distinct Spoken Word cluster with only 25 tracks suggests that Spotify’s genre taxonomy may conflate audio content types that listeners experience very differently, and separating spoken audio from musical audio earlier in the recommendation pipeline could meaningfully improve the relevance of suggestions for users who mix both content types in their daily listening.