Preface



With the development of the digital music industry, the need for a music recommendation system is becoming more and more relevant. There are times when someone listens to a song, she/he wants the next song to have the same character. These surely can be overcome by making several groups of songs based on their characteristics or audio features. In this project, we will use K-means to perform clustering. Based on groups created by K-means, when someone listens to a certain song, the system will choose and recommend the next song from the same group. We will also utilise PCA (Principal Component Analysis) to visualize our clusters that have been created.


Dependencies

# load library
library(dplyr)
library(GGally)
library(factoextra)
library(ggiraphExtra)
library(FactoMineR)
library(plotly)


About Dataset



The dataset in this project was obtained from kaggle, where it contains the audio features (and other attributes such as genre, track name, author name, etc.) of 232725 songs in Spotify.

# read data
song <- read.csv("data_input/spotify.csv", header = T, sep = ";")
colnames(song)[1] <- "genre"
# change unappropiate data type
song[,c("genre","key","mode","time_signature")] <- lapply(song[,c("genre","key","mode","time_signature")], as.factor)
# data glimpse
str(song)
## 'data.frame':    232725 obs. of  18 variables:
##  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
##  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
##  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
##  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ mode            : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ time_signature  : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 5 4 4 4 4 4 4 4 ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

This dataset contains 18 variables:

  • genre: the genre of a track
  • artist_name: the composer/singer of a track
  • track_name: the name of a track
  • track_id: the id of a track
  • popularity: popularity scale created by spotify from 0 to 100; the higher the popularity index, the more likely the song will be recommended to new listeners
  • acousticness: a confidence measure from 0.0 to 1.0 of whether the track is acoustic
  • danceability: a confidence measure from 0.0 to 1.0 of whether the track is suitable for dancing
  • duration_ms: the duration of the track in milliseconds
  • energy: a measure from 0.0 to 1.0 which represents a perceptual measure of intensity and activity
  • instrumentalness: a confidence measure from 0.0 to 1.0 of whether the track has no vocals
  • key: the key the track is in
  • liveness: a confidence measure from 0.0 to 1.0 of whether the track was performed live
  • loudness: the overall loudness of a track in decibels (dB)
  • mode: the modality (major or minor) of a track
  • speechiness: a confidence measure from 0.0 to 1.0 of whether the track has the spoken words
  • tempo: the overall estimated tempo of a track in beats per minute (BPM)
  • time_signature: a notational convention to specify how many beats are in each bar of tracks
  • valence: a measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track


Missing Value and Duplicates
# check missing values
colSums(is.na(song))
##            genre      artist_name       track_name         track_id 
##                0                0                0                0 
##       popularity     acousticness     danceability      duration_ms 
##                0                0                0                0 
##           energy instrumentalness              key         liveness 
##                0                0                0                0 
##         loudness             mode      speechiness            tempo 
##                0                0                0                0 
##   time_signature          valence 
##                0                0

The dataset has no missing values.

# remove duplicated tracks
song <- song %>% 
  group_by(track_id) %>% 
  filter (! duplicated(track_id)) %>% 
  as.data.frame()
song$index <- 1:nrow(song)
rownames(song) <- song$index
song <-  song %>% 
  select(-index)
head(song)


Filter

K-means only uses numeric variables. Hence, we can omit non-numeric variables.

# omit non-numeric variables
song_num <- song %>% 
  select_if(is.numeric)
song_num$index <- 1:nrow(song_num)
rownames(song_num) <- song_num$index
song_num <-  song_num %>% 
  select(-index)


Scale
# data summary
summary(song_num)
##    popularity      acousticness     danceability     duration_ms     
##  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
##  1st Qu.: 25.00   1st Qu.:0.0456   1st Qu.:0.4150   1st Qu.: 178253  
##  Median : 37.00   Median :0.2880   Median :0.5580   Median : 219453  
##  Mean   : 36.27   Mean   :0.4041   Mean   :0.5411   Mean   : 236127  
##  3rd Qu.: 49.00   3rd Qu.:0.7910   3rd Qu.:0.6830   3rd Qu.: 268547  
##  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
##      energy         instrumentalness       liveness          loudness      
##  Min.   :2.03e-05   Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457  
##  1st Qu.:3.44e-01   1st Qu.:0.0000000   1st Qu.:0.09750   1st Qu.:-12.851  
##  Median :5.92e-01   Median :0.0000704   Median :0.13000   Median : -8.191  
##  Mean   :5.57e-01   Mean   :0.1720729   Mean   :0.22453   Mean   :-10.138  
##  3rd Qu.:7.89e-01   3rd Qu.:0.0908000   3rd Qu.:0.27700   3rd Qu.: -5.631  
##  Max.   :9.99e-01   Max.   :0.9990000   Max.   :1.00000   Max.   :  3.744  
##   speechiness         tempo           valence      
##  Min.   :0.0222   Min.   : 30.38   Min.   :0.0000  
##  1st Qu.:0.0368   1st Qu.: 92.01   1st Qu.:0.2220  
##  Median :0.0494   Median :115.01   Median :0.4400  
##  Mean   :0.1274   Mean   :117.20   Mean   :0.4516  
##  3rd Qu.:0.1020   3rd Qu.:138.80   3rd Qu.:0.6670  
##  Max.   :0.9670   Max.   :242.90   Max.   :1.0000

The numerical variables in the dataset appear to have varying scales. This can interfere analysis steps of K-means and PCA (see picture below: PC1 is considered to be able to capture the most variance, meanwhile the next PC doesn’t seem to be able to catch any information). Therefore, we have to do some scaling on the data.

The popularity variable will not be included as it won’t be used in the K-means and PCA analysis.

# before scaling
plot(prcomp(song_num %>% select(-popularity)))

# scaling
song_scale <- song_num %>% 
  select(-popularity) %>% 
  scale() %>% 
  as.data.frame()
song_scale$index <- 1:nrow(song_scale)
rownames(song_scale) <- song_scale$index
song_scale <-  song_scale %>% 
  select(-index)

Now, PC1 no longer dominates the other PCs in terms of the amount of variance captured (see picture below).

# after scaling
plot(prcomp(song_scale))

Analysis


Music Recommendation System

Music Recommendation System


Exploratory Data Analysis

1. Correlation between numerical variables

# check correlation between numerical variables
ggcorr(song_num, hjust = 1, layout.exp = 2, label = TRUE)

The heatmap above shows that many variables having no correlation or weak correlation with one another. Only a few variables have a strong correlation, such as energy and loudness; acousticness and energy; acousticness and loudness; danceability and valence; liveness and speechiness.

With this condition, there is a possibility that this dataset has no multicollinearity and might be suitable for classification algorithms which have non-multicollinearity as their assumption. However, PCA can still be freely applied in any analysis, although the goal is no longer to eliminate multicollinearity, but to reduce data dimensions.

2. Correlation test between popularity and other numerical variables

# check correlation between popularity and other variables
# a starred value means that there is a significant correlation
ggduo(song_num, 
      c("acousticness","danceability","duration_ms","energy","instrumentalness"), "popularity", 
      types = list(continuous = "cor"))

# check correlation between popularity and other variables
# a starred value means that there is a significant correlation
ggduo(song_num, 
      c("liveness","loudness","speechiness","tempo","valence"), "popularity", 
      types = list(continuous = "cor"))

Based on the correlation test above, all numerical variables having correlation with popularity.

3. Correlation test between popularity and categorical variables

# check correlation between popularity and other variable
ggduo(song, 
      "key",
      "popularity")

# check correlation between popularity and other variable
ggduo(song, 
      "mode",
      "popularity")

# check correlation between popularity and other variable
ggduo(song, 
      "time_signature",
      "popularity")

According to the boxplots above, it can be concluded that there is no significant difference of popularity in between the categorical levels. So the conclusion is: there is probably no significant relationship between popularity and categorical variables.

4. Identifying variables that affect popularity

If we only use the variables related to popularity as the predictor variables, then the linear regression model will be:

# built linear regression model
summary(lm(popularity~.,data = song_num))
## 
## Call:
## lm(formula = popularity ~ ., data = song_num)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -52.908 -10.326   0.853  10.550  60.547 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.707e+01  3.785e-01 124.353  < 2e-16 ***
## acousticness     -1.066e+01  1.725e-01 -61.778  < 2e-16 ***
## danceability      1.680e+01  2.715e-01  61.901  < 2e-16 ***
## duration_ms       3.282e-06  2.877e-07  11.405  < 2e-16 ***
## energy           -2.451e+00  3.093e-01  -7.924 2.31e-15 ***
## instrumentalness -1.811e+00  1.399e-01 -12.948  < 2e-16 ***
## liveness         -7.987e+00  2.197e-01 -36.356  < 2e-16 ***
## loudness          5.259e-01  1.200e-02  43.822  < 2e-16 ***
## speechiness      -7.736e+00  2.460e-01 -31.441  < 2e-16 ***
## tempo            -4.376e-03  1.246e-03  -3.513 0.000443 ***
## valence          -1.342e+01  1.863e-01 -72.015  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.56 on 176763 degrees of freedom
## Multiple R-squared:  0.1997, Adjusted R-squared:  0.1996 
## F-statistic:  4410 on 10 and 176763 DF,  p-value: < 2.2e-16

From the linear regression, it can be seen that the acousticness, duration_ms, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo, valence have significant effect on popularity.


Clustering

1. Finding the Optimal Number of Clusters

# set.seed(123)

# wss <- function(k) {
#   kmeans(song_scale, k, nstart = 10)$tot.withinss
#   }

k.values <- 1:10

# wss_values <- map_dbl(k.values, wss)

wss_values <- c(1767730,1308303.3,1076345.7,977496.3,891803.7,838118.1,786962.3,742764.8,707008.2,668068.3)

plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of Clusters",
       ylab="Total Within Sum of Squares")

Based on the elbow plot above, we can see that 5 is the optimum number of k. After k = 5, increasing the number of k does not result in a considerable decrease of the total within sum of squares (strong internal cohesion).

2. K-means Clustering

# k-means clustering
RNGkind(sample.kind = "Rounding")
set.seed(123)
km <- kmeans(song_scale, 5)
# cluster size
km$size
## [1] 69077  9943 41984   739 55031
# k-means iteration
km$iter
## [1] 5

The clustering process forms 5 clusters, with the size of each cluster: 69077; 9943; 41984; 739; 55031, and by using 5 iterations (this iteration represents the number of times the k-means algorithm process is repeated until it produces a stable grouping). Looking at the cluster size, it can be seen that the k-means form groupings with an unproportional size.

3. Goodness of Fit

# within sum of square (wss)
# wss value is expected to be small
km$withinss
## [1] 283643.41  55289.87 282451.98  34769.85 269875.35
# between sum of square (bss)
km$betweenss
## [1] 841699.5
# total sum of square (bss)
km$totss
## [1] 1767730
# ratio of bss/tss
# a ratio value close to 1 indicates a good clustering as its result can represent the true distribution of data
# a good clustering doesn't mean an optimal clustering
km$betweenss / km$totss
## [1] 0.4761471

4. Profiling

# add the clustering result to the dataframe
song$cluster <- km$cluster
song$cluster <- as.factor(song$cluster)
song_num$cluster <- km$cluster
song_num$cluster <- as.factor(song_num$cluster)
song_scale$cluster <- km$cluster
song_scale$cluster <- as.factor(song_scale$cluster)
# profiling with radar chart
ggRadar(data = song_num %>% select(-popularity), aes(colour = cluster), legend.position = 'left')   

# profiling with centers
km$centers
##   acousticness danceability duration_ms     energy instrumentalness   liveness
## 1   -0.2905351    0.7865894 -0.12856349  0.1849447       -0.3228912 -0.2663324
## 2    1.0673302    0.1069751 -0.13166323  0.3900697       -0.5289218  2.4191457
## 3    1.1996687   -1.0463695  0.01283230 -1.3186526        0.9436266 -0.2813642
## 4    0.9574224   -1.0075082  8.68601551 -0.8906499        0.8782040  0.1792343
## 5   -0.7562573   -0.1948640  0.05873401  0.7153534       -0.2308296  0.1094695
##     loudness speechiness      tempo     valence
## 1  0.3414617  -0.1511265 -0.2830981  0.63442351
## 2 -0.2922395   3.6661326 -0.6140478 -0.14497882
## 3 -1.2915896  -0.3800776 -0.3899895 -0.94842320
## 4 -1.0544855   0.4066059 -0.5497694 -0.78077720
## 5  0.6237202  -0.1881902  0.7712134 -0.03610609

Profiling:

  • Cluster 1: the tracks have high danceability, high valence (the tracks mostly have happy vibe and are suitable for dancing)
  • Cluster 2: the tracks have short duration_ms, low instrumentalness, high liveness, high speechiness, low tempo (the tracks were mostly performed live, probably a lot of them are in the form of speech/talkshow)
  • Cluster 3: the tracks have high acousticness, low danceability, low energy, high instrumentalness, low liveness, low loudness, low speechiness, low valence (the tracks are mostly slow, soft, and have sad-vibe)
  • Cluster 4: the tracks have long duration_ms
  • Cluster 5: the tracks have low acousticness, high energy, high loudness, high tempo (the tracks are loud and fast, probably a lot of them are in rock genre)

5. Song Recommender

1] Example 1

When someone listens to a song by Rihanna entitled Umbrella,

song %>% filter(artist_name == "Rihanna" & track_name == "Umbrella")

The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “Umbrella” by Rihanna and also have a high popularity).

song_5_umbrella <- song %>% filter(cluster == 5 & genre == "Dance" & mode == "Major" & key == "C#" & time_signature == "4/4") %>% filter(track_id != "49FYlytm3dAAraYgpoJZux")
song_5_umbrella[order(song_5_umbrella$popularity,decreasing=T),] %>% 
  head(3)

2] Example 2

When someone listens to a song by Imagine Dragons entitled Believer,

song %>% filter(artist_name == "Imagine Dragons" & track_name == "Believer")

The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “Believer by” Imagine Dragons and also have a high popularity).

song_1_believer <- song %>% filter(cluster == 1 & genre == "Rock" & mode == "Minor" & key == "A#" & time_signature == "4/4") %>% filter(track_id != "0pqnGHJpmpxLKifKRmU6WP")
song_1_believer[order(song_1_believer$popularity,decreasing=T),] %>% 
  head(3)

3] Example 3

When someone listens to a song by Billie Eilish entitled come out and play,

song %>% filter(artist_name == "Billie Eilish" & track_name == "come out and play")

The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “come out and play” by Billie Eilish and also have a high popularity).

song_3_comeout <- song %>% filter(cluster == 3 & genre == "Pop" & mode == "Major" & key == "C" & time_signature == "4/4") %>% filter(track_id != "7wC5eZcFS1Q1BsQ35DU6H4")
song_3_comeout[order(song_3_comeout$popularity,decreasing=T),] %>% 
  head(2)


Principal Component Analysis (PCA)
# choose numerical and categorical variable
quanti <- song_scale %>% 
  select_if(is.numeric) %>% 
  colnames()

quantivar <- which(colnames(song_scale) %in% quanti)

quali <- song_scale %>% 
  select_if(is.factor) %>% 
  colnames()

qualivar <- which(colnames(song_scale) %in% quali)

1. PCA using FactoMiner

# pca using factominer
song_pca <- PCA(X = song_scale, 
                scale.unit = F,
                quali.sup = qualivar, 
                graph = F, 
                ncp = 10) 
song_pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.5415319              35.415520                          35.41552
## comp 2   1.6946614              16.946710                          52.36223
## comp 3   1.1524565              11.524630                          63.88686
## comp 4   0.9376440               9.376493                          73.26335
## comp 5   0.7828602               7.828647                          81.09200
## comp 6   0.6747342               6.747380                          87.83938
## comp 7   0.4780340               4.780367                          92.61975
## comp 8   0.3612205               3.612226                          96.23197
## comp 9   0.2679926               2.679941                          98.91191
## comp 10  0.1088082               1.088088                         100.00000

With PCA, we can reduce the dimension of the dataset while also retaining as much information as possible. For example, if we want to retain at least 81% of the information from our data, we can take only PC1-PC5, hence we are able to reduce ~50% of dimension from the original data while retaining 81.09200% of the information from the data.

We can extract the values of PC1 to PC5 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or for other purposes (see below for the illustration)

# make a new df from pca
df_pca <- data.frame(song_pca$ind$coord[, 1:5]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:5)
head(df_pca)

However, in visualization, usually use only PC1-PC2 (2 dimensions) or PC1-PC3 (3 dimensions) so the graph won’t be confusing.

2. Individual Factor Map

# individual factor map
plot.PCA(x = song_pca,
         choix = "ind",
         invisible = "quali",
         habillage = "cluster",
         select = "contrib 3")

Based on Individual Factor Map above, we can find out 3 outliers, one of them is in observation of 169722. We can also conclude that: if we use 2 PCs, we will have very little dimensions to represent our large dataset. So, on the visualization above, our cluster looks like intersecting with each other because we don’t have enough dimensions to represent them.

3. Variable Factor Map and Variable Contribution

# variable factor map
plot.PCA(song_pca, choix = "var")

# variable contribution with PC1
fviz_contrib(X = song_pca,
             choice = "var",
             axes = 1)

# variable contribution with PC2
fviz_contrib(X = song_pca,
             choice = "var",
             axes = 2)

From the graphs above, we can conclude that:

  • 3 variables that contribute greatly to PC1: loudness, energy, acousticness
  • 2 variables that contribute greatly to PC2: speechiness, liveness
  • speechiness and liveness have strong positive correlation
  • energy and danceability have strong positive correlation
  • valence and loudness have strong positive correlation
  • speechiness and valence have no correlation
  • speechiness and loudness have no correlation
  • liveness and valence have no correlation
  • energy and acousticness have strong negative correlation
  • danceability and acousticness have strong negative correlation
  • loudness and acousticness have strong negative correlation
  • energy and instrumentalness have strong negative correlation
  • danceability and instrumentalness have strong negative correlation
  • loudness and instrumentalness have strong negative correlation

4. Combining Clustering and PCA

fviz_cluster(object = km,
             data = song_scale %>% select(-cluster),
             labelsize = 3) + theme_minimal()

Same problem like the Individual Factor Map: it looks like we don’t have enough dimensions to represent our large dataset, as on the visualization above, our cluster looks like intersecting with each other.

To overcome this, we may add 1 more dimensions using plotly to see if our clusters is still clumped together.

plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black", 
    "red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))

Conclusion

  • Dimensional reduction can be done using this dataset. To perform dimensionality reduction, we can select several PCs from a total of 10 PCs according to the total information we want to store.
  • Music recommendation system can be well made using k-means algorithm
  • If you want to improve clustering performance, you can use k-modes or k-prototype in which they also take categorical variables into account.