Overview

Spotify has become the most popular and widely used music streaming platform today with approximately 345 millions monthly active users. It offered a variety collections of songs, genres and artists from around the globe which listeners can enjoy and have access to. With this report, we are going to analyse and try to do some clustering of songs on Spotify based on its audio features.

Import file

library(readr)

spotify <- read_csv("SpotifyFeatures.csv")
head(spotify)
library(dplyr)
glimpse(spotify)
## Rows: 232,725
## Columns: 18
## $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia~
## $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity       <dbl> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms      <dbl> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.0~
## $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
summary(spotify)
##     genre           artist_name         track_name          track_id        
##  Length:232725      Length:232725      Length:232725      Length:232725     
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    popularity      acousticness     danceability     duration_ms     
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
##  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
##  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
##  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
##  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
##  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
##      energy         instrumentalness        key               liveness      
##  Min.   :2.03e-05   Min.   :0.0000000   Length:232725      Min.   :0.00967  
##  1st Qu.:3.85e-01   1st Qu.:0.0000000   Class :character   1st Qu.:0.09740  
##  Median :6.05e-01   Median :0.0000443   Mode  :character   Median :0.12800  
##  Mean   :5.71e-01   Mean   :0.1483012                      Mean   :0.21501  
##  3rd Qu.:7.87e-01   3rd Qu.:0.0358000                      3rd Qu.:0.26400  
##  Max.   :9.99e-01   Max.   :0.9990000                      Max.   :1.00000  
##     loudness           mode            speechiness         tempo       
##  Min.   :-52.457   Length:232725      Min.   :0.0222   Min.   : 30.38  
##  1st Qu.:-11.771   Class :character   1st Qu.:0.0367   1st Qu.: 92.96  
##  Median : -7.762   Mode  :character   Median :0.0501   Median :115.78  
##  Mean   : -9.570                      Mean   :0.1208   Mean   :117.67  
##  3rd Qu.: -5.501                      3rd Qu.:0.1050   3rd Qu.:139.05  
##  Max.   :  3.744                      Max.   :0.9670   Max.   :242.90  
##  time_signature        valence      
##  Length:232725      Min.   :0.0000  
##  Class :character   1st Qu.:0.2370  
##  Mode  :character   Median :0.4440  
##                     Mean   :0.4549  
##                     3rd Qu.:0.6600  
##                     Max.   :1.0000

Check for missing values

anyNA(spotify)
## [1] FALSE

There’s no missing values in our dataset.

Data Wrangling

Drop column track_id then change columns genre, mode and key to factor.

library(dplyr)

spotify <- spotify %>% 
  select(-track_id) %>% 
  mutate_at(c("genre", "mode", "key","time_signature"), as.factor)

spotify

Exploratory Data Analysis

Feature engineering

We would define popularity as a binary variable and to select a certain songs that have a popularity of more or equal than 57. Tracks that have a popularity of >= 57 will be classified as “popular” and thus will be encode as 1 while tracks that scored below 57 in popularity will be labelled with 0.

# songs that has a popularity score of more than 57 will be labeled as "popular" (1)
# songs that scored below 57 will be labeled with "0"

spotify <- spotify %>%
  mutate(popularity.conv = if_else(popularity >= 57, "1", "0"))

spotify

Filter data with all popular songs.

# filter data with all popular songs
popular <- spotify %>% filter(popularity.conv == "1")
popular

Aggregate the most popular genre based on how often it appeared on our list of popular songs.

library(ggplot2)

# most popular genre
pop.genre <- popular %>% 
  count(genre) %>% 
  rename("total" = "n") %>% 
  arrange(desc(total))

pop.genre %>% 
  head(10) %>%
  ggplot(aes(y = reorder(genre, total), x = total)) +
  geom_bar(aes(fill = total), stat = "identity") +
  scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
  labs(title = "Top 10 most popular genre",
       y = "genre",
       x = "total of popular songs") +
  theme_minimal()

# (low = "#F9CCDA", high = "#5A3DDA")

Most common music key to use in popular songs.

pop.key <- popular %>% 
  count(key) %>% 
  rename("total" = "n") %>% 
  arrange(desc(total))

# visualized it
pop.key %>% 
  ggplot(aes(x = key, y = total)) +
  geom_bar(aes(fill = total), stat = "identity") +
  scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
  labs(title = "Most common key to use in popular songs",
       x = "key",
       y = "total of songs") +
  theme_minimal() +
  theme(legend.position = "none")

There’s an overwhelming proportion of popular songs that used a time signature of 4/4 with roughly more than 4,500 songs.

Most common mode to use in popular songs.

pop.mode <- popular %>% 
  count(mode) %>% 
  rename("total" = "n") %>% 
  arrange(desc(total))

# visualized it
pop.mode %>% 
  ggplot(aes(x = mode, y = total)) +
  geom_bar(aes(fill = total), stat = "identity") +
  scale_fill_gradient(low = "#F9CCDA", high = "#5A3DDA") +
  labs(title = "Most common mode to use in popular songs"
       , x = "mode",
       y = "total of songs") +
  theme_minimal() +
  theme(legend.position = "none")

Next, we’ll check the correlation between each numerical variables. If there’s a strong correlation between variables, then we will be able to use PCA (Principal Component Analysis) to reduce the high dimensions of our spotify dataset.

# correlations between numerical variables
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(spotify, label=T, label_size = 2.9, hjust = 1)
## Warning in ggcorr(spotify, label = T, label_size = 2.9, hjust = 1): data in
## column(s) 'genre', 'artist_name', 'track_name', 'key', 'mode', 'time_signature',
## 'popularity.conv' are not numeric and were ignored

There’s indeed several variables that have a moderate and strong correlation with each other which means we are able to apply PCA and reduce the dimensions of our Spotify dataset.

Positively associated variables: * Loudness & Energy (strong) * Valence & Danceability (moderate) * Speechiness & Liveliness (moderate) * Loudness & Danceability (moderate)

Negatively associated variables: * Energy & Accousticness (strong) * Loudness & Accousticeness (strong) * Loudness & Instrumentallness (moderate)

Data preprocessing

When it comes to clustering, one of the most widely used algorithm to solved such problem is by using K-means. K-means works best with numerical variables and since we’re only interested in clustering the songs based on their audio features,it seems that K-means will be the most appropriate solution to our clustering problem.

For data preprocessing, there are several things that need to be done. First, we will changed the popularity.conv column to a factor type.

library(tidyverse)

spotify_clean <- spotify %>% 
  mutate(popularity.conv = as.factor(popularity.conv))

head(spotify_clean)

Non-numerical variables in spotify_clean dataset.

cat.var <- spotify_clean %>%
  select_if(negate(is.numeric))

head(cat.var)

Check range for numerical variables.

spotify_clean %>%
  select(where(is.numeric),
         -popularity) %>% 
  summary()
##   acousticness     danceability     duration_ms          energy        
##  Min.   :0.0000   Min.   :0.0569   Min.   :  15387   Min.   :2.03e-05  
##  1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857   1st Qu.:3.85e-01  
##  Median :0.2320   Median :0.5710   Median : 220427   Median :6.05e-01  
##  Mean   :0.3686   Mean   :0.5544   Mean   : 235122   Mean   :5.71e-01  
##  3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768   3rd Qu.:7.87e-01  
##  Max.   :0.9960   Max.   :0.9890   Max.   :5552917   Max.   :9.99e-01  
##  instrumentalness       liveness          loudness        speechiness    
##  Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457   Min.   :0.0222  
##  1st Qu.:0.0000000   1st Qu.:0.09740   1st Qu.:-11.771   1st Qu.:0.0367  
##  Median :0.0000443   Median :0.12800   Median : -7.762   Median :0.0501  
##  Mean   :0.1483012   Mean   :0.21501   Mean   : -9.570   Mean   :0.1208  
##  3rd Qu.:0.0358000   3rd Qu.:0.26400   3rd Qu.: -5.501   3rd Qu.:0.1050  
##  Max.   :0.9990000   Max.   :1.00000   Max.   :  3.744   Max.   :0.9670  
##      tempo           valence      
##  Min.   : 30.38   Min.   :0.0000  
##  1st Qu.: 92.96   1st Qu.:0.2370  
##  Median :115.78   Median :0.4440  
##  Mean   :117.67   Mean   :0.4549  
##  3rd Qu.:139.05   3rd Qu.:0.6600  
##  Max.   :242.90   Max.   :1.0000

K-means use euclidean distance to measure the similarities between objects and we would need to scale the numerical variables first before we compute and analyse the dataset with k-means clustering. This was due to the range gap between variables such as column duration_ms with the other variables.

# scalling
num.var <- spotify_clean %>%
  select(where(is.numeric),
         -popularity) %>% 
  scale()

head(num.var)
##      acousticness danceability duration_ms     energy instrumentalness
## [1,]    0.6833748   -0.8909329  -1.1413655  1.2869052      -0.48981747
## [2,]   -0.3454664    0.1919933  -0.8218657  0.6302479      -0.48981747
## [3,]    1.6445663    0.5852948  -0.5452965 -1.6699502      -0.48981747
## [4,]    0.9426992   -1.6936990  -0.6952933 -0.9297874      -0.48981747
## [5,]    1.6389288   -1.2034190  -1.2821808 -1.3131538      -0.08356631
## [6,]    1.0723614    0.1273410  -0.6263486 -1.8073548      -0.48981747
##         liveness   loudness speechiness      tempo    valence
## [1,]  0.66065975  1.2907007  -0.3679692  1.5956039  1.3807413
## [2,] -0.32283477  0.6686811  -0.1830817  1.8232495  1.3884316
## [3,] -0.56492573 -0.7184009  -0.4558311 -0.5883245 -0.3342114
## [4,] -0.58762176 -0.4348159  -0.4380431  1.7505932 -0.8763826
## [5,] -0.06561313 -1.9305971  -0.4051623  0.7414313 -0.2496173
## [6,] -0.54475148 -0.9002886   0.1198533 -0.9769791 -0.3726633

After scalling the numerical variables, we are going to reduce the amount of observations from our dataset by randomly select the songs that we are going to keep for further analysis. The main purpose of doing this is to lessen the computation time in later stage, particularly when we try to find the optimum number of k for clustering.

RNGkind(sample.kind = "Rounding")
set.seed(205)

reduction <- sample(x = nrow(spotify_clean), size = nrow(spotify_clean)*0.015)
spotify_keep <- spotify_clean[reduction,]

reduction2 <- sample(x = nrow(spotify_keep), size = nrow(spotify_keep)*0.2)
spotify_keep2 <- spotify_keep[reduction2,]

K-means works best with numerical data and therefore we would drop categorical columns from spotify_keep2 and to use this dataset to find the optimum k for our clustering.

spotify_keep2 <- spotify_keep2 %>% 
  select(-(is.factor))
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.factor)
## 
##   # Good
##   data %>% select(where(is.factor))
## 
## i Please update your code.
## This message is displayed once per session.
spotify_keep2

Select variables that are numeric and drop column “popularity”.

spotify_num <- spotify_keep2 %>%
  select(where(is.numeric),
         -popularity) %>% 
  scale()

head(spotify_num)
##      acousticness danceability duration_ms     energy instrumentalness
## [1,]    0.4900276   -1.2386902  -0.5672322 -1.5987966       2.55478985
## [2,]    1.7695260   -0.5039828  -0.4813279 -1.6398613       2.54505027
## [3,]    1.2327750   -2.2905374  -0.7884015 -1.8265189       2.31454681
## [4,]    0.9049217   -1.0272636   1.5254906 -1.1881498      -0.49687337
## [5,]    1.8362572   -1.3972601   1.7463485 -1.8776631      -0.06191139
## [6,]   -0.9874303    1.1292875  -0.2625976  0.8538846      -0.49594616
##        liveness     loudness speechiness      tempo    valence
## [1,] -0.4707458 -1.379297138  -0.4860337 -0.7353178 -1.6833010
## [2,] -0.5499596 -1.365667027  -0.4724891  1.2324887 -0.4178244
## [3,] -0.5861009 -2.074765223  -0.4599478 -0.9548168 -1.6778631
## [4,] -0.6420456  0.003161768  -0.4413866 -0.8316688 -1.0587195
## [5,] -0.4905493 -1.494654416  -0.3857032 -0.6414298 -1.6122199
## [6,] -0.7232397  0.871333329   0.5127564  1.2620443  1.2951134

Choosing the k-optimum for spotify_num dataset.

RNGkind(sample.kind = "Rounding")
set.seed(123)

library(factoextra)
fviz_nbclust(x = spotify_num,
             FUNcluster = kmeans,
             method = "wss")

The graph above shows that k=3 isn’t a bad choice. Let’s take 3 as our optimum number of k and then apply it to the k-means clustering.

K-means

# clustering with optimum k
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(100)

spotify_k <- kmeans(spotify_num, 3)
spotify_k
## K-means clustering with 3 clusters of sizes 156, 504, 38
## 
## Cluster means:
##   acousticness danceability duration_ms     energy instrumentalness    liveness
## 1    1.2654487 -0.975911340  0.23950268 -1.3787957        0.9931063 -0.30279430
## 2   -0.4875179  0.301986643 -0.04078901  0.4065285       -0.2699248 -0.09661905
## 3    1.2710271  0.001076343 -0.44223043  0.2684674       -0.4969072  2.52452396
##     loudness speechiness      tempo    valence
## 1 -1.3092436  -0.4378847 -0.4190837 -0.9301845
## 2  0.4423671  -0.1361938  0.1936066  0.2994787
## 3 -0.4923951   3.6039917 -0.8473860 -0.1533811
## 
## Clustering vector:
##   [1] 1 1 1 1 1 2 2 1 2 2 1 2 3 2 2 3 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 1 2 1 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 3 2 2 1 2 1 2 2 2 1 2 2 2 1 1 3 3 2 1 1 2
##  [75] 2 2 2 1 2 2 1 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 1 3 2 2 2 2 2 3 3 1 2 1 2 2 1
## [112] 2 1 2 3 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 1 2 1 2 3 2 2 2 2 3 2
## [149] 2 2 2 2 2 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 3 2
## [186] 1 1 1 2 3 1 2 1 2 2 3 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 2 2 2 2
## [223] 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 1 2 2 2 2 2 1 2 2 1 1 2 1 2 1 2 2 3 2 3 2
## [260] 1 2 2 3 1 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 2 3 2
## [297] 1 2 2 2 1 2 1 2 3 1 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 3 2 2 2 2 1 2 1 1 2 1 3
## [334] 2 1 1 2 1 1 1 2 2 2 2 2 2 2 1 2 3 1 2 2 3 2 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2
## [371] 1 2 2 2 2 2 2 1 2 2 3 2 1 1 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 3 1 2 1 2 2
## [408] 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 1 2 2 3 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2
## [445] 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 3 2 1 2 1 2 2 2 2 1
## [482] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2 2
## [519] 2 2 2 2 3 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 3 1
## [556] 2 1 1 2 2 2 2 1 3 2 2 2 2 2 1 1 2 2 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 2 1 1 2
## [593] 2 2 2 3 1 1 2 2 2 1 2 2 2 2 1 2 2 3 2 1 2 2 2 2 3 2 1 2 2 2 2 1 2 2 1 2 2
## [630] 1 2 2 2 3 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 2 2
## [667] 2 3 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 3 2 2 2 2 1 1 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 1313.8981 2723.2050  283.6731
##  (between_SS / total_SS =  38.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Summary from the clustering results:

  • For each clusters, there’s a total of
    • 156 songs in cluster 1
    • 504 songs in cluster 2
    • 38 songs in cluster 3
  • Audio feature characteristics in each clusters
    • Cluster 1: Songs that were identified with high in accousticness, instrumentalness, has a comparatively longer duration but lowest in danceability, speechiness and valence.
    • Cluster 2: Songs that were identified with high in danceability, loudness, energy and valence.
    • Cluster 3: Songs that were identified with high in liveness and speechiness.
  • BSS/TSS = 0.38
spotify_keep2$cluster <- as.factor(spotify_k$cluster)
spotify_keep2
spotify_keep2 %>% 
  select(c(artist_name, track_name, cluster))

Frédéric Chopin’s Nocturnes, Op. 9: No. 2 in E-Flat Major was in cluster 1 which were identified with several audio features such as “high in accousticness, instrumentalness, has a comparatively longer duration but lowest in danceability, speechiness and valence”. It is also in the same cluster as the “Best Part (feat. Daniel Caesar)” by H.E.R.

Meanwhile, the ’97 Bonnie & Clyde by Eminem went into cluster 2 with audio features that were characterized with high in danceability, loudness, energy and valence.

fviz_cluster(object = spotify_k,
             data = spotify_num)