A huge growth in technology development has led to a great shift in the streaming services industry. When it comes to the online music streaming platform, Spotify is undeniably one of the highest contributors in the music industry. Spotify is providing freemium services of digital music, podcast, and video service; providing user accesses to millions of songs and other content from creators all over the world. In this document, I am using Spotify Songs data obtained from TidyTuesday. The goal is to apply unsupervised learning so that we are able to cluster the songs based on its music components.
These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function.
library(tidyverse)
library(plotly)
library(GGally)
library(cowplot)
library(FactoMineR)
library(factoextra)spotify_songs <- read.csv("data_input/spotify_songs.csv")
head(spotify_songs)glimpse(spotify_songs)## Rows: 32,833
## Columns: 23
## $ track_id <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa~
## $ track_name <chr> "I Don't Care (with Justin Bieber) - Loud Lux~
## $ track_artist <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th~
## $ track_popularity <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6~
## $ track_album_id <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6~
## $ track_album_name <chr> "I Don't Care (with Justin Bieber) [Loud Luxu~
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20~
## $ playlist_name <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R~
## $ playlist_id <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf~
## $ playlist_genre <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po~
## $ playlist_subgenre <chr> "dance pop", "dance pop", "dance pop", "dance~
## $ danceability <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4~
## $ energy <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8~
## $ key <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,~
## $ loudness <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38~
## $ mode <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, ~
## $ speechiness <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127~
## $ acousticness <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, ~
## $ instrumentalness <dbl> 0.00000000, 0.00421000, 0.00002330, 0.0000094~
## $ liveness <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143~
## $ valence <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1~
## $ tempo <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1~
## $ duration_ms <int> 194754, 162600, 176616, 169093, 189052, 16304~
Here are some informations about the features:
We can see that there are 32.833 observations and 23 features in this dataset. We can see using glimpse(), some of the data types are incorrect, and there are some features that are insignificant such as IDs. Later on, we are going to cluster songs based on its music components. I do not think that album and playlist matters, since an album/playlist does not necessarily consist of songs with same genre. Furthermore, clustering songs based on its music components is like grouping songs based on its genre, so track popularity and duration will be removed since it has nothing to do with the type of the songs. So, I will remove all IDs, dates, track popularity, and everything related to album. There are some exceptional cases though, that a playlist was intentionally created by songs of same genres. So, considering these cases, I will keep it and drop the remaining features related to playlist.
spotify_songs <- spotify_songs %>%
select(-c(track_id, track_popularity, track_album_id, track_album_name,
track_album_release_date, playlist_name, playlist_id,
playlist_subgenre, duration_ms))
head(spotify_songs)We are left with 14 features related to elements of songs. We can also see that 2 features are of incorrect types.
spotify_songs <- spotify_songs %>%
mutate(playlist_genre = as.factor(playlist_genre),
mode = as.factor(mode)) %>%
relocate(mode, .after = playlist_genre)Now, let’s check whether there is any missing and duplicated value.
Checking missing values in each features.
colSums(is.na(spotify_songs))## track_name track_artist playlist_genre mode
## 5 5 0 0
## danceability energy key loudness
## 0 0 0 0
## speechiness acousticness instrumentalness liveness
## 0 0 0 0
## valence tempo
## 0 0
There are only 5 missing values in the data. Since there are only few missing values, we can just drop them.
spotify_songs <- na.omit(spotify_songs)Now, checking whether there is any duplicated value in the data.
sum(duplicated(spotify_songs))## [1] 2962
spotify_songs[duplicated(spotify_songs) | duplicated(spotify_songs, fromLast = T), ] %>%
arrange(track_name)It is understandable that we have lots of duplicated values. The data should be obtained from different albums and playlists, while a song may exist in an album and multiple playlists. I will remove those duplicated values for the time being, below is the final cleaned data.
spotify_songs <- spotify_songs[!duplicated(spotify_songs), ]
spotify_songsK - Means Clustering is an algorithm to group data into k clusters. The main objective is to find the number of groups (k), for which all observations within a group are similar to each other. This algorithm works by iteratively moving centroids and assigning observations into a cluster based on their Euclidean Distance. Thus, it is necessary to scale the data, else the algorithm will favor features with higher scales.
spotify_z <- spotify_songs %>%
mutate_if(is.numeric, scale)Let’s see how many unique playlist genres are there in the original dataset.
unique(spotify_songs$playlist_genre)## [1] pop rap rock latin r&b edm
## Levels: edm latin pop r&b rap rock
Assuming that the songs consist of only those 6 genres, let’s see whether there are any patterns in the distribution of data. This way, we may be able to see whether the data is suitable for clustering or not.
# Function for plotting
plot_func <- function(start, end){
plots <- list()
j <- start - 1
for(i in start:end){
plots[[i-j]] <- ggplot(spotify_songs,
mapping = aes_string(x = "playlist_genre",
y = names(spotify_songs)[i],
col = "playlist_genre")) +
geom_boxplot() +
labs(title = str_to_title(str_replace_all(names(spotify_songs)[i],
"_", " "))) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_blank(),
legend.position = "none")
}
plot_grid(plotlist = plots,
ncol = 2,
nrow = 2)
}Looking at the plots above, in overall the music components in every genres look similar. Although there are some music components that might be the key difference in a genre, the difference to the others is insignificant. Additionally, most of the features have lots of outliers. I think that this data might not be suitable for clustering. But still, there are possibilities, that the music components in each genre are similar to the other genres because the data provide playlist genre instead of track genre. As I have mentioned before, a playlist may be composed of different genres of musics.
Using K - Means Clustering, we will see whether the the music components can be used to create clusters that resembles the music genres or not. Visualizing the number of total within sum of squares, between sum of squares, and the ratio of between sum of squares and total sum of squares for different number of clusters.
wss(select(spotify_z, is.numeric))bss(select(spotify_z, is.numeric))btratio(select(spotify_z, is.numeric))From the plots above, I think the optimum number of k is 5, since increasing the number of k beyond 5 will not lead to a significant decrease in total within sum of squares, nor a significant increase in between sum of squares and \(\frac{Between Sum of Squares}{Total Sum of Squares}\) ratio. Using \(k = 5\) and visualizing the clustering result.
set.seed(147)
spotify_km <- kmeans(x = select(spotify_z, is.numeric),
centers = 5)
fviz_cluster(object = spotify_km,
data = select(spotify_z, is.numeric),
labelsize = 0,
ggtheme = theme_minimal())The plot above was constructed using principal components obtained from PCA. We will see about PCA later on. For now, just say that it actually uses 2 principle components to represent our data. However, it is not enough since those 2 dimensions can only represent 36.6% variance (informations) in the data. I will try to add 1 more dimension later on in the PCA section to provide better visualization. For now let’s move on to cluster profiling.
Now, we will see the characteristics of each cluster.
# Add cluster into spotify data
spotify_songs$cluster <- as.factor(spotify_km$cluster)
spotify_z$cluster <- as.factor(spotify_km$cluster)
spotify_songs %>%
select(-c(track_name, track_artist, playlist_genre, mode)) %>%
group_by(cluster) %>%
summarise_all(.funs = "mean")We can get some insights:
Principal Component Analysis (PCA) is commonly used to handle dimensionality problem, reducing the dimensionality of the data while retaining most of the informations. The main purpose of this PCA section, is actually to combine it with K-Means Clustering. This way, we can see the visualization of the clusters, and thus see whether the clustering method is suitable or not. Firstly, we want to see whether there are high correlations between numerical features. Strong correlation between numerical features indicates that we can reduce the number of features using PCA.
ggcorr(spotify_songs[, 5:15],
label = T,
label_size = 3,
hjust = 0.9,
size = 3,
layout.exp = 2)From the correlation matrix, it can be seen that the correlations between numerical features are relatively small. This might indicates that for this data, PCA might reduce only a few features to retain as much information as possible. Let’s see how many features can we reduce to be able to retain at least 80% information in the data.
spotify_pca <- PCA(spotify_z[, 4:15],
quali.sup = c(1, 12),
ncp = 10,
graph = F)
spotify_pca$eig## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 2.1635808 21.635808 21.63581
## comp 2 1.5033304 15.033304 36.66911
## comp 3 1.0930163 10.930163 47.59928
## comp 4 1.0097256 10.097256 57.69653
## comp 5 0.9772211 9.772211 67.46874
## comp 6 0.9705000 9.705000 77.17374
## comp 7 0.8583970 8.583970 85.75771
## comp 8 0.6324344 6.324344 92.08206
## comp 9 0.5665472 5.665472 97.74753
## comp 10 0.2252471 2.252471 100.00000
By using 7 principal components, I was able to reduce the number of dimensions from 12 into 7 while retaining ~85.76% information. This is to be expected, that PCA is can only remove only a few features since the correlation between numerical features are relatively low. I will store these 7 principal components in a new dataframe first. Later on, this new dataframe can be analyzed for other purposes.
spotify_df_pca <- spotify_songs %>%
select(c(track_name, track_artist, cluster)) %>%
cbind(spotify_pca$ind$coord[, 1:7]) %>%
relocate(cluster, .after = Dim.7)
spotify_df_pcaActually, we have seen that PCA can be integrated with K - Means clustering to help visualize our result. We have also seen that especially for this data, 2 dimensions are too little to represent the data, and that was why in the plot, the clusters seem like intersecting with each other (below is the same plot that was used to visualize K - Means clustering).
fviz_cluster(object = spotify_km,
data = select(spotify_z, is.numeric),
labelsize = 0,
ggtheme = theme_minimal())Let’s see whether adding 1 more dimension can help visualize the clusters better or not.
spotify_df_pca %>% plot_ly(
x = ~Dim.1,
y = ~Dim.2,
z = ~Dim.3,
color = ~cluster,
colors = c("magenta", "red", "blue", "green", "black")) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = "Dim.1"),
yaxis = list(title = "Dim.2"),
zaxis = list(title = "Dim.3")))By adding 1 more principle component to the plot, although it is still not enough to provide us a “good” visualization on the data clusters, we can see that the clusters seem to be better separated. This indicates that our data might still be clustered well, as the algorithm can put boundaries between each clusters so that they are not clumped together.
For this Spotify Songs dataset, we can group songs into 5 clusters based on its music components. We have also seen that PCA, despite its effectiveness in reducing the number of dimensions for this dataset, can help us visualize the clustering well. Based on the clustering plot using 3 principal components, the distinction between those 5 clusters seems pretty clear, indicating that we have done a good job at clustering the data. However, PCA cannot effectively reduce the number of dimensions on this dataset, since the correlations between numerical features are relatively weak. To retain around 85.76% information of the data, the number of features were only reduced by 5, which is not that significant.