With the development of the digital music industry, the need for a music recommendation system is becoming more and more relevant. There are times when someone listens to a song, she/he wants the next song to have the same character. These surely can be overcome by making several groups of songs based on their characteristics or audio features. In this project, we will use K-means to perform clustering. Based on groups created by K-means, when someone listens to a certain song, the system will choose and recommend the next song from the same group. We will also utilise PCA (Principal Component Analysis) to visualize our clusters that have been created.
# load library
library(dplyr)
library(GGally)
library(factoextra)
library(ggiraphExtra)
library(FactoMineR)
library(plotly)The dataset in this project was obtained from kaggle, where it contains the audio features (and other attributes such as genre, track name, author name, etc.) of 232725 songs in Spotify.
# read data
song <- read.csv("data_input/spotify.csv", header = T, sep = ";")
colnames(song)[1] <- "genre"# change unappropiate data type
song[,c("genre","key","mode","time_signature")] <- lapply(song[,c("genre","key","mode","time_signature")], as.factor)# data glimpse
str(song)## 'data.frame': 232725 obs. of 18 variables:
## $ genre : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
## $ artist_name : chr "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
## $ track_name : chr "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
## $ track_id : chr "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
## $ popularity : int 0 1 3 0 4 0 2 15 0 10 ...
## $ acousticness : num 0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
## $ danceability : num 0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
## $ duration_ms : int 99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
## $ energy : num 0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
## $ instrumentalness: num 0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
## $ key : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
## $ liveness : num 0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
## $ loudness : num -1.83 -5.56 -13.88 -12.18 -21.15 ...
## $ mode : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
## $ speechiness : num 0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
## $ tempo : num 167 174 99.5 171.8 140.6 ...
## $ time_signature : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 5 4 4 4 4 4 4 4 ...
## $ valence : num 0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...
This dataset contains 18 variables:
genre: the genre of a trackartist_name: the composer/singer of a tracktrack_name: the name of a tracktrack_id: the id of a trackpopularity: popularity scale created by spotify from 0
to 100; the higher the popularity index, the more likely the song will
be recommended to new listenersacousticness: a confidence measure from 0.0 to 1.0 of
whether the track is acousticdanceability: a confidence measure from 0.0 to 1.0 of
whether the track is suitable for dancingduration_ms: the duration of the track in
millisecondsenergy: a measure from 0.0 to 1.0 which represents a
perceptual measure of intensity and activityinstrumentalness: a confidence measure from 0.0 to 1.0
of whether the track has no vocalskey: the key the track is inliveness: a confidence measure from 0.0 to 1.0 of
whether the track was performed liveloudness: the overall loudness of a track in decibels
(dB)mode: the modality (major or minor) of a trackspeechiness: a confidence measure from 0.0 to 1.0 of
whether the track has the spoken wordstempo: the overall estimated tempo of a track in beats
per minute (BPM)time_signature: a notational convention to specify how
many beats are in each bar of tracksvalence: a measure from 0.0 to 1.0 describing the
musical positiveness conveyed by a track# check missing values
colSums(is.na(song))## genre artist_name track_name track_id
## 0 0 0 0
## popularity acousticness danceability duration_ms
## 0 0 0 0
## energy instrumentalness key liveness
## 0 0 0 0
## loudness mode speechiness tempo
## 0 0 0 0
## time_signature valence
## 0 0
The dataset has no missing values.
# remove duplicated tracks
song <- song %>%
group_by(track_id) %>%
filter (! duplicated(track_id)) %>%
as.data.frame()
song$index <- 1:nrow(song)
rownames(song) <- song$index
song <- song %>%
select(-index)
head(song)K-means only uses numeric variables. Hence, we can omit non-numeric variables.
# omit non-numeric variables
song_num <- song %>%
select_if(is.numeric)
song_num$index <- 1:nrow(song_num)
rownames(song_num) <- song_num$index
song_num <- song_num %>%
select(-index)# data summary
summary(song_num)## popularity acousticness danceability duration_ms
## Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
## 1st Qu.: 25.00 1st Qu.:0.0456 1st Qu.:0.4150 1st Qu.: 178253
## Median : 37.00 Median :0.2880 Median :0.5580 Median : 219453
## Mean : 36.27 Mean :0.4041 Mean :0.5411 Mean : 236127
## 3rd Qu.: 49.00 3rd Qu.:0.7910 3rd Qu.:0.6830 3rd Qu.: 268547
## Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
## energy instrumentalness liveness loudness
## Min. :2.03e-05 Min. :0.0000000 Min. :0.00967 Min. :-52.457
## 1st Qu.:3.44e-01 1st Qu.:0.0000000 1st Qu.:0.09750 1st Qu.:-12.851
## Median :5.92e-01 Median :0.0000704 Median :0.13000 Median : -8.191
## Mean :5.57e-01 Mean :0.1720729 Mean :0.22453 Mean :-10.138
## 3rd Qu.:7.89e-01 3rd Qu.:0.0908000 3rd Qu.:0.27700 3rd Qu.: -5.631
## Max. :9.99e-01 Max. :0.9990000 Max. :1.00000 Max. : 3.744
## speechiness tempo valence
## Min. :0.0222 Min. : 30.38 Min. :0.0000
## 1st Qu.:0.0368 1st Qu.: 92.01 1st Qu.:0.2220
## Median :0.0494 Median :115.01 Median :0.4400
## Mean :0.1274 Mean :117.20 Mean :0.4516
## 3rd Qu.:0.1020 3rd Qu.:138.80 3rd Qu.:0.6670
## Max. :0.9670 Max. :242.90 Max. :1.0000
The numerical variables in the dataset appear to have varying scales. This can interfere analysis steps of K-means and PCA (see picture below: PC1 is considered to be able to capture the most variance, meanwhile the next PC doesn’t seem to be able to catch any information). Therefore, we have to do some scaling on the data.
The popularity variable will not be included as it won’t be used in the K-means and PCA analysis.
# before scaling
plot(prcomp(song_num %>% select(-popularity)))# scaling
song_scale <- song_num %>%
select(-popularity) %>%
scale() %>%
as.data.frame()
song_scale$index <- 1:nrow(song_scale)
rownames(song_scale) <- song_scale$index
song_scale <- song_scale %>%
select(-index)Now, PC1 no longer dominates the other PCs in terms of the amount of variance captured (see picture below).
# after scaling
plot(prcomp(song_scale))Music Recommendation System
1. Correlation between numerical variables
# check correlation between numerical variables
ggcorr(song_num, hjust = 1, layout.exp = 2, label = TRUE)The heatmap above shows that many variables having no correlation or
weak correlation with one another. Only a few variables have a strong
correlation, such as energy and loudness;
acousticness and energy;
acousticness and loudness;
danceability and valence;
liveness and speechiness.
With this condition, there is a possibility that this dataset has no multicollinearity and might be suitable for classification algorithms which have non-multicollinearity as their assumption. However, PCA can still be freely applied in any analysis, although the goal is no longer to eliminate multicollinearity, but to reduce data dimensions.
2. Correlation test between popularity and other numerical variables
# check correlation between popularity and other variables
# a starred value means that there is a significant correlation
ggduo(song_num,
c("acousticness","danceability","duration_ms","energy","instrumentalness"), "popularity",
types = list(continuous = "cor"))# check correlation between popularity and other variables
# a starred value means that there is a significant correlation
ggduo(song_num,
c("liveness","loudness","speechiness","tempo","valence"), "popularity",
types = list(continuous = "cor"))Based on the correlation test above, all numerical variables having
correlation with popularity.
3. Correlation test between popularity and categorical variables
# check correlation between popularity and other variable
ggduo(song,
"key",
"popularity")# check correlation between popularity and other variable
ggduo(song,
"mode",
"popularity")# check correlation between popularity and other variable
ggduo(song,
"time_signature",
"popularity")According to the boxplots above, it can be concluded that there is no
significant difference of popularity in between the categorical levels.
So the conclusion is: there is probably no significant relationship
between popularity and categorical variables.
4. Identifying variables that affect popularity
If we only use the variables related to popularity as
the predictor variables, then the linear regression model will be:
# built linear regression model
summary(lm(popularity~.,data = song_num))##
## Call:
## lm(formula = popularity ~ ., data = song_num)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.908 -10.326 0.853 10.550 60.547
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.707e+01 3.785e-01 124.353 < 2e-16 ***
## acousticness -1.066e+01 1.725e-01 -61.778 < 2e-16 ***
## danceability 1.680e+01 2.715e-01 61.901 < 2e-16 ***
## duration_ms 3.282e-06 2.877e-07 11.405 < 2e-16 ***
## energy -2.451e+00 3.093e-01 -7.924 2.31e-15 ***
## instrumentalness -1.811e+00 1.399e-01 -12.948 < 2e-16 ***
## liveness -7.987e+00 2.197e-01 -36.356 < 2e-16 ***
## loudness 5.259e-01 1.200e-02 43.822 < 2e-16 ***
## speechiness -7.736e+00 2.460e-01 -31.441 < 2e-16 ***
## tempo -4.376e-03 1.246e-03 -3.513 0.000443 ***
## valence -1.342e+01 1.863e-01 -72.015 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.56 on 176763 degrees of freedom
## Multiple R-squared: 0.1997, Adjusted R-squared: 0.1996
## F-statistic: 4410 on 10 and 176763 DF, p-value: < 2.2e-16
From the linear regression, it can be seen that the
acousticness, duration_ms,
danceability, energy,
instrumentalness, liveness,
loudness, speechiness, tempo,
valence have significant effect on
popularity.
1. Finding the Optimal Number of Clusters
# set.seed(123)
# wss <- function(k) {
# kmeans(song_scale, k, nstart = 10)$tot.withinss
# }
k.values <- 1:10
# wss_values <- map_dbl(k.values, wss)
wss_values <- c(1767730,1308303.3,1076345.7,977496.3,891803.7,838118.1,786962.3,742764.8,707008.2,668068.3)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of Clusters",
ylab="Total Within Sum of Squares")Based on the elbow plot above, we can see that 5 is the optimum number of k. After k = 5, increasing the number of k does not result in a considerable decrease of the total within sum of squares (strong internal cohesion).
2. K-means Clustering
# k-means clustering
RNGkind(sample.kind = "Rounding")
set.seed(123)
km <- kmeans(song_scale, 5)# cluster size
km$size## [1] 69077 9943 41984 739 55031
# k-means iteration
km$iter## [1] 5
The clustering process forms 5 clusters, with the size of each cluster: 69077; 9943; 41984; 739; 55031, and by using 5 iterations (this iteration represents the number of times the k-means algorithm process is repeated until it produces a stable grouping). Looking at the cluster size, it can be seen that the k-means form groupings with an unproportional size.
3. Goodness of Fit
# within sum of square (wss)
# wss value is expected to be small
km$withinss## [1] 283643.41 55289.87 282451.98 34769.85 269875.35
# between sum of square (bss)
km$betweenss## [1] 841699.5
# total sum of square (bss)
km$totss## [1] 1767730
# ratio of bss/tss
# a ratio value close to 1 indicates a good clustering as its result can represent the true distribution of data
# a good clustering doesn't mean an optimal clustering
km$betweenss / km$totss## [1] 0.4761471
4. Profiling
# add the clustering result to the dataframe
song$cluster <- km$cluster
song$cluster <- as.factor(song$cluster)
song_num$cluster <- km$cluster
song_num$cluster <- as.factor(song_num$cluster)
song_scale$cluster <- km$cluster
song_scale$cluster <- as.factor(song_scale$cluster)# profiling with radar chart
ggRadar(data = song_num %>% select(-popularity), aes(colour = cluster), legend.position = 'left') # profiling with centers
km$centers## acousticness danceability duration_ms energy instrumentalness liveness
## 1 -0.2905351 0.7865894 -0.12856349 0.1849447 -0.3228912 -0.2663324
## 2 1.0673302 0.1069751 -0.13166323 0.3900697 -0.5289218 2.4191457
## 3 1.1996687 -1.0463695 0.01283230 -1.3186526 0.9436266 -0.2813642
## 4 0.9574224 -1.0075082 8.68601551 -0.8906499 0.8782040 0.1792343
## 5 -0.7562573 -0.1948640 0.05873401 0.7153534 -0.2308296 0.1094695
## loudness speechiness tempo valence
## 1 0.3414617 -0.1511265 -0.2830981 0.63442351
## 2 -0.2922395 3.6661326 -0.6140478 -0.14497882
## 3 -1.2915896 -0.3800776 -0.3899895 -0.94842320
## 4 -1.0544855 0.4066059 -0.5497694 -0.78077720
## 5 0.6237202 -0.1881902 0.7712134 -0.03610609
Profiling:
danceability, high
valence (the tracks mostly have happy vibe and are suitable
for dancing)duration_ms, low
instrumentalness, high liveness, high
speechiness, low tempo (the tracks were mostly
performed live, probably a lot of them are in the form of
speech/talkshow)acousticness, low
danceability, low energy, high
instrumentalness, low liveness, low
loudness, low speechiness, low
valence (the tracks are mostly slow, soft, and have
sad-vibe)duration_msacousticness, high
energy, high loudness, high tempo
(the tracks are loud and fast, probably a lot of them are in rock
genre)5. Song Recommender
1] Example 1
When someone listens to a song by Rihanna entitled Umbrella,
song %>% filter(artist_name == "Rihanna" & track_name == "Umbrella")The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “Umbrella” by Rihanna and also have a high popularity).
song_5_umbrella <- song %>% filter(cluster == 5 & genre == "Dance" & mode == "Major" & key == "C#" & time_signature == "4/4") %>% filter(track_id != "49FYlytm3dAAraYgpoJZux")
song_5_umbrella[order(song_5_umbrella$popularity,decreasing=T),] %>%
head(3)2] Example 2
When someone listens to a song by Imagine Dragons entitled Believer,
song %>% filter(artist_name == "Imagine Dragons" & track_name == "Believer")The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “Believer by” Imagine Dragons and also have a high popularity).
song_1_believer <- song %>% filter(cluster == 1 & genre == "Rock" & mode == "Minor" & key == "A#" & time_signature == "4/4") %>% filter(track_id != "0pqnGHJpmpxLKifKRmU6WP")
song_1_believer[order(song_1_believer$popularity,decreasing=T),] %>%
head(3)3] Example 3
When someone listens to a song by Billie Eilish entitled come out and play,
song %>% filter(artist_name == "Billie Eilish" & track_name == "come out and play")The recommendation system will suggest the following 3 songs for the next songs (because they have same characteristics as “come out and play” by Billie Eilish and also have a high popularity).
song_3_comeout <- song %>% filter(cluster == 3 & genre == "Pop" & mode == "Major" & key == "C" & time_signature == "4/4") %>% filter(track_id != "7wC5eZcFS1Q1BsQ35DU6H4")
song_3_comeout[order(song_3_comeout$popularity,decreasing=T),] %>%
head(2)# choose numerical and categorical variable
quanti <- song_scale %>%
select_if(is.numeric) %>%
colnames()
quantivar <- which(colnames(song_scale) %in% quanti)
quali <- song_scale %>%
select_if(is.factor) %>%
colnames()
qualivar <- which(colnames(song_scale) %in% quali)1. PCA using FactoMiner
# pca using factominer
song_pca <- PCA(X = song_scale,
scale.unit = F,
quali.sup = qualivar,
graph = F,
ncp = 10)
song_pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 3.5415319 35.415520 35.41552
## comp 2 1.6946614 16.946710 52.36223
## comp 3 1.1524565 11.524630 63.88686
## comp 4 0.9376440 9.376493 73.26335
## comp 5 0.7828602 7.828647 81.09200
## comp 6 0.6747342 6.747380 87.83938
## comp 7 0.4780340 4.780367 92.61975
## comp 8 0.3612205 3.612226 96.23197
## comp 9 0.2679926 2.679941 98.91191
## comp 10 0.1088082 1.088088 100.00000
With PCA, we can reduce the dimension of the dataset while also retaining as much information as possible. For example, if we want to retain at least 81% of the information from our data, we can take only PC1-PC5, hence we are able to reduce ~50% of dimension from the original data while retaining 81.09200% of the information from the data.
We can extract the values of PC1 to PC5 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or for other purposes (see below for the illustration)
# make a new df from pca
df_pca <- data.frame(song_pca$ind$coord[, 1:5]) %>% bind_cols(cluster = as.factor(km$cluster)) %>%
select(cluster, 1:5)
head(df_pca)However, in visualization, usually use only PC1-PC2 (2 dimensions) or PC1-PC3 (3 dimensions) so the graph won’t be confusing.
2. Individual Factor Map
# individual factor map
plot.PCA(x = song_pca,
choix = "ind",
invisible = "quali",
habillage = "cluster",
select = "contrib 3")Based on Individual Factor Map above, we can find out 3 outliers, one of them is in observation of 169722. We can also conclude that: if we use 2 PCs, we will have very little dimensions to represent our large dataset. So, on the visualization above, our cluster looks like intersecting with each other because we don’t have enough dimensions to represent them.
3. Variable Factor Map and Variable Contribution
# variable factor map
plot.PCA(song_pca, choix = "var")# variable contribution with PC1
fviz_contrib(X = song_pca,
choice = "var",
axes = 1)# variable contribution with PC2
fviz_contrib(X = song_pca,
choice = "var",
axes = 2)From the graphs above, we can conclude that:
loudness,
energy, acousticnessspeechiness, livenessspeechiness and liveness have strong
positive correlationenergy and danceability have strong
positive correlationvalence and loudness have strong positive
correlationspeechiness and valence have no
correlationspeechiness and loudness have no
correlationliveness and valence have no
correlationenergy and acousticness have strong
negative correlationdanceability and acousticness have strong
negative correlationloudness and acousticness have strong
negative correlationenergy and instrumentalness have strong
negative correlationdanceability and instrumentalness have
strong negative correlationloudness and instrumentalness have strong
negative correlation4. Combining Clustering and PCA
fviz_cluster(object = km,
data = song_scale %>% select(-cluster),
labelsize = 3) + theme_minimal()Same problem like the Individual Factor Map: it looks like we don’t have enough dimensions to represent our large dataset, as on the visualization above, our cluster looks like intersecting with each other.
To overcome this, we may add 1 more dimensions using plotly to see if our clusters is still clumped together.
plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black",
"red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))