Spotify has become the most popular and widely used music streaming platform today with approximately 345 millions monthly active users. It offered a variety collections of songs, genres and artists from around the globe which listeners can enjoy and have access to. With this report, we are going to analyse and try to do some clustering of songs on Spotify based on its audio features.
library(readr)
spotify <- read_csv("SpotifyFeatures.csv")
head(spotify)library(dplyr)
glimpse(spotify)## Rows: 232,725
## Columns: 18
## $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia~
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
summary(spotify)## genre artist_name track_name track_id
## Length:232725 Length:232725 Length:232725 Length:232725
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## popularity acousticness danceability duration_ms
## Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
## 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
## Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
## Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
## 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
## Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
## energy instrumentalness key liveness
## Min. :2.03e-05 Min. :0.0000000 Length:232725 Min. :0.00967
## 1st Qu.:3.85e-01 1st Qu.:0.0000000 Class :character 1st Qu.:0.09740
## Median :6.05e-01 Median :0.0000443 Mode :character Median :0.12800
## Mean :5.71e-01 Mean :0.1483012 Mean :0.21501
## 3rd Qu.:7.87e-01 3rd Qu.:0.0358000 3rd Qu.:0.26400
## Max. :9.99e-01 Max. :0.9990000 Max. :1.00000
## loudness mode speechiness tempo
## Min. :-52.457 Length:232725 Min. :0.0222 Min. : 30.38
## 1st Qu.:-11.771 Class :character 1st Qu.:0.0367 1st Qu.: 92.96
## Median : -7.762 Mode :character Median :0.0501 Median :115.78
## Mean : -9.570 Mean :0.1208 Mean :117.67
## 3rd Qu.: -5.501 3rd Qu.:0.1050 3rd Qu.:139.05
## Max. : 3.744 Max. :0.9670 Max. :242.90
## time_signature valence
## Length:232725 Min. :0.0000
## Class :character 1st Qu.:0.2370
## Mode :character Median :0.4440
## Mean :0.4549
## 3rd Qu.:0.6600
## Max. :1.0000
Check for missing values
anyNA(spotify)## [1] FALSE
There’s no missing values in our dataset.
Drop column track_id then change columns genre, mode and key to factor.
library(dplyr)
spotify <- spotify %>%
select(-track_id) %>%
mutate_at(c("genre", "mode", "key","time_signature"), as.factor)
spotifyWe would define popularity as a binary variable and to select a certain songs that have a popularity of more or equal than 57. Tracks that have a popularity of >= 57 will be classified as “popular” and thus will be encode as 1 while tracks that scored below 57 in popularity will be labelled with 0.
# songs that has a popularity score of more than 57 will be labeled as "popular" (1)
# songs that scored below 57 will be labeled with "0"
spotify <- spotify %>%
mutate(popularity.conv = if_else(popularity >= 57, "1", "0"))
spotifyFilter data with all popular songs.
# filter data with all popular songs
popular <- spotify %>% filter(popularity.conv == "1")
popularAggregate the most popular genre based on how often it appeared on our list of popular songs.
library(ggplot2)
# most popular genre
pop.genre <- popular %>%
count(genre) %>%
rename("total" = "n") %>%
arrange(desc(total))
pop.genre %>%
head(10) %>%
ggplot(aes(y = reorder(genre, total), x = total)) +
geom_bar(aes(fill = total), stat = "identity") +
scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
labs(title = "Top 10 most popular genre",
y = "genre",
x = "total of popular songs") +
theme_minimal()# (low = "#F9CCDA", high = "#5A3DDA")Most common music key to use in popular songs.
pop.key <- popular %>%
count(key) %>%
rename("total" = "n") %>%
arrange(desc(total))
# visualized it
pop.key %>%
ggplot(aes(x = key, y = total)) +
geom_bar(aes(fill = total), stat = "identity") +
scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
labs(title = "Most common key to use in popular songs",
x = "key",
y = "total of songs") +
theme_minimal() +
theme(legend.position = "none")There’s an overwhelming proportion of popular songs that used a time signature of 4/4 with roughly more than 4,500 songs.
Most common mode to use in popular songs.
pop.mode <- popular %>%
count(mode) %>%
rename("total" = "n") %>%
arrange(desc(total))
# visualized it
pop.mode %>%
ggplot(aes(x = mode, y = total)) +
geom_bar(aes(fill = total), stat = "identity") +
scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
labs(title = "Most common mode to use in popular songs"
, x = "mode",
y = "total of songs") +
theme_minimal() +
theme(legend.position = "none")Next, we’ll check the correlation between each numerical variables. If there’s a strong correlation between variables, then we will be able to use PCA (Principal Component Analysis) to reduce the high dimensions of our spotify dataset.
# correlations between numerical variables
library(GGally)## Warning: package 'GGally' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(spotify, label=T, label_size = 2.9, hjust = 1)## Warning in ggcorr(spotify, label = T, label_size = 2.9, hjust = 1): data in
## column(s) 'genre', 'artist_name', 'track_name', 'key', 'mode', 'time_signature',
## 'popularity.conv' are not numeric and were ignored
There’s indeed several variables that have a moderate and strong correlation with each other which means we are able to apply PCA and reduce the dimensions of our Spotify dataset.
Positively associated variables: * Loudness & Energy (strong) * Valence & Danceability (moderate) * Speechiness & Liveliness (moderate) * Loudness & Danceability (moderate)
Negatively associated variables: * Energy & Accousticness (strong) * Loudness & Accousticeness (strong) * Loudness & Instrumentallness (moderate)
When it comes to clustering, one of the most widely used algorithm to solved such problem is by using K-means. K-means works best with numerical variables and since we’re only interested in clustering the songs based on their audio features,it seems that K-means will be the most appropriate solution to our clustering problem.
For data preprocessing, there are several things that need to be done. First, we will changed the popularity.conv column to a factor type.
library(tidyverse)
spotify_clean <- spotify %>%
mutate(popularity.conv = as.factor(popularity.conv))
head(spotify_clean)Non-numerical variables in spotify_clean dataset.
cat.var <- spotify_clean %>%
select_if(negate(is.numeric))
head(cat.var)Check range for numerical variables.
spotify_clean %>%
select(where(is.numeric),
-popularity) %>%
summary()## acousticness danceability duration_ms energy
## Min. :0.0000 Min. :0.0569 Min. : 15387 Min. :2.03e-05
## 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857 1st Qu.:3.85e-01
## Median :0.2320 Median :0.5710 Median : 220427 Median :6.05e-01
## Mean :0.3686 Mean :0.5544 Mean : 235122 Mean :5.71e-01
## 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768 3rd Qu.:7.87e-01
## Max. :0.9960 Max. :0.9890 Max. :5552917 Max. :9.99e-01
## instrumentalness liveness loudness speechiness
## Min. :0.0000000 Min. :0.00967 Min. :-52.457 Min. :0.0222
## 1st Qu.:0.0000000 1st Qu.:0.09740 1st Qu.:-11.771 1st Qu.:0.0367
## Median :0.0000443 Median :0.12800 Median : -7.762 Median :0.0501
## Mean :0.1483012 Mean :0.21501 Mean : -9.570 Mean :0.1208
## 3rd Qu.:0.0358000 3rd Qu.:0.26400 3rd Qu.: -5.501 3rd Qu.:0.1050
## Max. :0.9990000 Max. :1.00000 Max. : 3.744 Max. :0.9670
## tempo valence
## Min. : 30.38 Min. :0.0000
## 1st Qu.: 92.96 1st Qu.:0.2370
## Median :115.78 Median :0.4440
## Mean :117.67 Mean :0.4549
## 3rd Qu.:139.05 3rd Qu.:0.6600
## Max. :242.90 Max. :1.0000
K-means use euclidean distance to measure the similarities between objects and we would need to scale the numerical variables first before we compute and analyse the dataset with k-means clustering. This was due to the range gap between variables such as column duration_ms with the other variables.
# scalling
num.var <- spotify_clean %>%
select(where(is.numeric),
-popularity) %>%
scale()
head(num.var)## acousticness danceability duration_ms energy instrumentalness
## [1,] 0.6833748 -0.8909329 -1.1413655 1.2869052 -0.48981747
## [2,] -0.3454664 0.1919933 -0.8218657 0.6302479 -0.48981747
## [3,] 1.6445663 0.5852948 -0.5452965 -1.6699502 -0.48981747
## [4,] 0.9426992 -1.6936990 -0.6952933 -0.9297874 -0.48981747
## [5,] 1.6389288 -1.2034190 -1.2821808 -1.3131538 -0.08356631
## [6,] 1.0723614 0.1273410 -0.6263486 -1.8073548 -0.48981747
## liveness loudness speechiness tempo valence
## [1,] 0.66065975 1.2907007 -0.3679692 1.5956039 1.3807413
## [2,] -0.32283477 0.6686811 -0.1830817 1.8232495 1.3884316
## [3,] -0.56492573 -0.7184009 -0.4558311 -0.5883245 -0.3342114
## [4,] -0.58762176 -0.4348159 -0.4380431 1.7505932 -0.8763826
## [5,] -0.06561313 -1.9305971 -0.4051623 0.7414313 -0.2496173
## [6,] -0.54475148 -0.9002886 0.1198533 -0.9769791 -0.3726633
After scalling the numerical variables, we are going to reduce the amount of observations from our dataset by randomly select the songs that we are going to keep for further analysis. The main purpose of doing this is to lessen the computation time in later stage, particularly when we try to find the optimum number of k for clustering.
RNGkind(sample.kind = "Rounding")
set.seed(205)
reduction <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.015)
spotify_keep <- spotify_clean[reduction,]
reduction2 <- sample(x = nrow(spotify_keep), size = nrow(spotify_keep)*0.2)
spotify_keep2 <- spotify_keep[reduction2,]K-means works best with numerical data and therefore we would drop categorical columns from spotify_keep2 and to use this dataset to find the optimum k for our clustering.
spotify_keep2 <- spotify_keep2 %>%
select(-(is.factor))## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.factor)
##
## # Good
## data %>% select(where(is.factor))
##
## i Please update your code.
## This message is displayed once per session.
spotify_keep2Select variables that are numeric and drop column “popularity”.
spotify_num <- spotify_keep2 %>%
select(where(is.numeric),
-popularity) %>%
scale()
head(spotify_num)## acousticness danceability duration_ms energy instrumentalness
## [1,] 0.4900276 -1.2386902 -0.5672322 -1.5987966 2.55478985
## [2,] 1.7695260 -0.5039828 -0.4813279 -1.6398613 2.54505027
## [3,] 1.2327750 -2.2905374 -0.7884015 -1.8265189 2.31454681
## [4,] 0.9049217 -1.0272636 1.5254906 -1.1881498 -0.49687337
## [5,] 1.8362572 -1.3972601 1.7463485 -1.8776631 -0.06191139
## [6,] -0.9874303 1.1292875 -0.2625976 0.8538846 -0.49594616
## liveness loudness speechiness tempo valence
## [1,] -0.4707458 -1.379297138 -0.4860337 -0.7353178 -1.6833010
## [2,] -0.5499596 -1.365667027 -0.4724891 1.2324887 -0.4178244
## [3,] -0.5861009 -2.074765223 -0.4599478 -0.9548168 -1.6778631
## [4,] -0.6420456 0.003161768 -0.4413866 -0.8316688 -1.0587195
## [5,] -0.4905493 -1.494654416 -0.3857032 -0.6414298 -1.6122199
## [6,] -0.7232397 0.871333329 0.5127564 1.2620443 1.2951134
Choosing the k-optimum for spotify_num dataset.
RNGkind(sample.kind = "Rounding")
set.seed(123)
library(factoextra)
fviz_nbclust(x = spotify_num,
FUNcluster = kmeans,
method = "wss")The graph above shows that k=3 isn’t a bad choice. Let’s take 3 as our optimum number of k and then apply it to the k-means clustering.
# clustering with optimum k
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)
spotify_k <- kmeans(spotify_num, 3)
spotify_k## K-means clustering with 3 clusters of sizes 156, 504, 38
##
## Cluster means:
## acousticness danceability duration_ms energy instrumentalness liveness
## 1 1.2654487 -0.975911340 0.23950268 -1.3787957 0.9931063 -0.30279430
## 2 -0.4875179 0.301986643 -0.04078901 0.4065285 -0.2699248 -0.09661905
## 3 1.2710271 0.001076343 -0.44223043 0.2684674 -0.4969072 2.52452396
## loudness speechiness tempo valence
## 1 -1.3092436 -0.4378847 -0.4190837 -0.9301845
## 2 0.4423671 -0.1361938 0.1936066 0.2994787
## 3 -0.4923951 3.6039917 -0.8473860 -0.1533811
##
## Clustering vector:
## [1] 1 1 1 1 1 2 2 1 2 2 1 2 3 2 2 3 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 1 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 3 2 2 1 2 1 2 2 2 1 2 2 2 1 1 3 3 2 1 1 2
## [75] 2 2 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 3 1 2 1 2 2 1
## [112] 2 1 2 3 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 1 2 1 2 3 2 2 2 2 3 2
## [149] 2 2 2 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 3 2
## [186] 1 1 1 2 3 1 2 1 2 2 3 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 2 2 2 2
## [223] 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 1 2 1 2 1 2 2 3 2 3 2
## [260] 1 2 2 3 1 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2
## [297] 1 2 2 2 1 2 1 2 3 1 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 3 2 2 2 2 1 2 1 1 2 1 3
## [334] 2 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 3 1 2 2 3 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [371] 1 2 2 2 2 2 2 1 2 2 3 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3 1 2 1 2 2
## [408] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 1 2 2 3 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [445] 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 3 2 1 2 1 2 2 2 2 1
## [482] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2 2
## [519] 2 2 2 2 3 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 1
## [556] 2 1 1 2 2 2 2 1 3 2 2 2 2 2 1 1 2 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 2
## [593] 2 2 2 3 1 1 2 2 2 1 2 2 2 2 1 2 2 3 2 1 2 2 2 2 3 2 1 2 2 2 2 1 2 2 1 2 2
## [630] 1 2 2 2 3 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2
## [667] 2 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 1 1 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 1313.8981 2723.2050 283.6731
## (between_SS / total_SS = 38.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Summary from the clustering results:
spotify_keep2$cluster <- as.factor(spotify_k$cluster)
spotify_keep2spotify_keep2 %>%
select(c(artist_name, track_name, cluster))Frédéric Chopin’s Nocturnes, Op. 9: No. 2 in E-Flat Major was in cluster 1 which were identified with several audio features such as “high in accousticness, instrumentalness, has a comparatively longer duration but lowest in danceability, speechiness and valence”. It is also in the same cluster as the “Best Part (feat. Daniel Caesar)” by H.E.R.
Meanwhile, the ’97 Bonnie & Clyde by Eminem went into cluster 2 with audio features that were characterized with high in danceability, loudness, energy and valence.
fviz_cluster(object = spotify_k,
data = spotify_num)