In this day and age, we are pampered with all the technology within our reach such as smartphone. Not only did it is used for calling and chatting our loved ones or people that we are familiar with throughout the world, we can use smartphones for education, utilities, entertainment, and many more. Entertainment category also comes in many other form such as gaming, social media, video, and audio. All of us used to walk around with radio, walkman, or some portable digital audio player when we want to listen to a music on the way to some place. However, it is not very practical as we cannot choose what we want to listen with radio, we have to also carry a disc that we want to listen which we cannot carry them all with us, and we can only select music which are saved in our digital audio player so we have to make sure that we fill it with the music we want to listen at that time.
With how advance our smartphones are now, there is such a music application such as Spotify which we can use to listen to music on the go with our phones without the need of listening to music we do not like, carry a lot of disc, or having to prepare our device by filling it with the music that we like before we go on a trip. The application let us stream music trough internet or saved it in our device trough internet when we like to listen it offline. With the online service, when we create a playlist and listen through all the songs in our playlist, we are automatically recommended to a song where it has the same characteristics to the songs which are listed in our playlist.
The way Spotify application works where it recommends a music based on the songs in our playlist might be done or can be done by using Unsupervised Machine Learning. The Unsupervised Machine Learning method such as K-means can cluster numerical data without any target. As my goal for this project, where I use a Spotify song data lists with many variables that describe the songs, to be able to recommend songs which are similar to the songs that we are frequently listen, K-means method is suitable for this case.
The data that we obtain from open sources might contain some “noise” where we have to clean up before we can start to create cluster for the songs in our data. In this section, we will process our data to clean up the previously mentioned “noise” starting by observing the raw data that we obtain all the way to the process that are necessary to clean up the data after which we can prepare our data for modeling.
library(stats)
library(FactoMineR)
library(factoextra)
library(ggplot2)
library(ggthemes)
library(GGally)
library(ggiraphExtra)
library(wordcloud)
library(tidyverse)
library(dplyr)spotify <- read.csv("data/SpotifyFeatures.csv")
glimpse(spotify)## Rows: 232,725
## Columns: 18
## $ ï..genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",~
## $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willi~
## $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G~
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,~
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123~
## $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G~
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",~
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4~
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
unique(spotify$ï..genre)## [1] "Movie" "R&B" "A Capella"
## [4] "Alternative" "Country" "Dance"
## [7] "Electronic" "Anime" "Folk"
## [10] "Blues" "Opera" "Hip-Hop"
## [13] "Children's Music" "Childrenâ\200\231s Music" "Rap"
## [16] "Indie" "Classical" "Pop"
## [19] "Reggae" "Reggaeton" "Jazz"
## [22] "Rock" "Ska" "Comedy"
## [25] "Soul" "Soundtrack" "World"
From the preview of the data details that we have above, we can see the “noise” which have been mentioned before in one of the variable name which is supposed to be named “genre”. Inside that vairable, we can also see that there are “noise” to one of the observation value which supposed to be named “Children’s Music”. Fortunately, we can simply rename the variable.
spotify <- spotify %>%
rename("genre" = ï..genre) %>%
mutate(genre = recode(genre, "Children’s Music" = "Children's Music"))Due to the nature of the K-means centroid-based clustering algorithm, where it calculate the distance of each point of data from the other and group the data which are closed to each other, it can only process numerical data. As a result, we have to remove the categorical data type that are present in our data. However, “track_id” will not be removed as it will be used to rename the row names of the data. In addition, I will remove the “duration_ms” variable as it is not considered as a song characteristics.
spotify_clean <- spotify %>%
select(-c(genre, artist_name, track_name, key, mode, time_signature, duration_ms))
glimpse(spotify_clean)## Rows: 232,725
## Columns: 11
## $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "~
## $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, ~
## $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,~
## $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41~
## $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270~
## $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123~
## $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105~
## $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -~
## $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953~
## $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8~
## $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533~
One of the data pre-processing steps is to make sure that there are no duplicated and empty data in our dataset. In this case, the duplicated data will be checked based on the “track_id” as we will not be able to set it as the row names if the data contains more than one of the same data.
anyDuplicated(spotify_clean$track_id)## [1] 1349
It appears that we have duplicated data based on the“track_id”, we can remove the duplicated data and keep one of the duplicated data to make the “track_id” unique and set it as our row names.
spotify_clean <- spotify_clean %>%
distinct(track_id, .keep_all = T) %>%
column_to_rownames(var = "track_id")
spotify_cleancolSums(is.na(spotify_clean))## popularity acousticness danceability energy
## 0 0 0 0
## instrumentalness liveness loudness speechiness
## 0 0 0 0
## tempo valence
## 0 0
As seen above, our data does not contain NA or empty values. At this steps, our data has been declared clean so we do not have to clean our data any further.
In the Unsupervised Machine Learning, we are usually dealing with large amount of variables which are usually hard for us to observe the relation between each of every variables in our data. That is where PCA, which stands for Principal Component Analysis, method is used. Although the drawback of using the previously mentioned method is our model will not be interpretable due to the method reduce the information dimentionally in our data, it is done to find some patterns in our data which does not have any target variable.
ggcorr(spotify_clean, label = T, hjust = 1, label_alpha = T)As seen on the correlation plot above, where the darker the color the higher the correlation between the variables, any strong correlation between variables, whether it is a strong positive correlation or a strong negative correlation, is a sign that dimensionality reduction is preferred as it can remove any correlation between variables. Before we can perform a dimensionality reduction using PCA, we have to make sure that our data has the same scale of numbers.
summary(spotify_clean)## popularity acousticness danceability energy
## Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. :0.0000203
## 1st Qu.: 25.00 1st Qu.:0.0456 1st Qu.:0.4150 1st Qu.:0.3440000
## Median : 37.00 Median :0.2880 Median :0.5580 Median :0.5920000
## Mean : 36.27 Mean :0.4041 Mean :0.5411 Mean :0.5570245
## 3rd Qu.: 49.00 3rd Qu.:0.7910 3rd Qu.:0.6830 3rd Qu.:0.7890000
## Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :0.9990000
## instrumentalness liveness loudness speechiness
## Min. :0.0000000 Min. :0.00967 Min. :-52.457 Min. :0.0222
## 1st Qu.:0.0000000 1st Qu.:0.09750 1st Qu.:-12.851 1st Qu.:0.0368
## Median :0.0000704 Median :0.13000 Median : -8.191 Median :0.0494
## Mean :0.1720729 Mean :0.22453 Mean :-10.138 Mean :0.1274
## 3rd Qu.:0.0908000 3rd Qu.:0.27700 3rd Qu.: -5.631 3rd Qu.:0.1020
## Max. :0.9990000 Max. :1.00000 Max. : 3.744 Max. :0.9670
## tempo valence
## Min. : 30.38 Min. :0.0000
## 1st Qu.: 92.01 1st Qu.:0.2220
## Median :115.01 Median :0.4400
## Mean :117.20 Mean :0.4516
## 3rd Qu.:138.80 3rd Qu.:0.6670
## Max. :242.90 Max. :1.0000
As we can see above, we have a quite different range of numbers between variables. As a result, we have to scale our data in order to balance the range of the minimum to maximum data between variables. After scaling, we can proceed to convert our data to a new axis using Principal Component Analysis.
scaled_spotify <- spotify_clean %>%
scale() %>%
as.data.frame()
scaled_spotifypca_spotify <- prcomp(scaled_spotify)
pca_spotify## Standard deviations (1, .., p=10):
## [1] 1.9097239 1.3356420 1.0300322 0.9751636 0.8553537 0.8071697 0.6714870
## [8] 0.5886305 0.5176219 0.3296038
##
## Rotation (n x k) = (10 x 10):
## PC1 PC2 PC3 PC4 PC5
## popularity 0.20661930 -0.29553837 0.10311164 -0.642439663 0.47470374
## acousticness -0.41759986 0.20334215 -0.20748955 0.079690579 0.31334570
## danceability 0.34364557 0.05424717 -0.56305078 0.005183667 0.16769931
## energy 0.45058074 0.09437894 0.26368773 -0.020930766 -0.33676587
## instrumentalness -0.31668975 -0.20186447 0.10199292 0.067567578 -0.43773437
## liveness 0.04052679 0.61334250 0.29133801 -0.070256514 -0.05159956
## loudness 0.46554476 -0.02385385 0.15021171 -0.073289577 -0.16307211
## speechiness 0.03251146 0.64153976 0.03478623 -0.088124731 0.20164467
## tempo 0.16922789 -0.16865187 0.45903733 0.647620087 0.51638745
## valence 0.33681109 0.04398753 -0.47957207 0.372028575 -0.08217996
## PC6 PC7 PC8 PC9 PC10
## popularity -0.34697638 0.25047905 -0.193966279 -0.01563097 -0.01617270
## acousticness 0.01974197 0.12727484 -0.264932755 0.68645252 -0.28049820
## danceability -0.28869187 -0.24694679 0.578117254 0.14836558 -0.18229435
## energy -0.05604515 -0.09142724 -0.245103389 0.11109688 -0.72164455
## instrumentalness -0.77793443 -0.07279872 0.003870636 0.17237742 0.11031068
## liveness -0.15433155 0.58703531 0.397860929 0.03037786 0.04524844
## loudness 0.14865660 -0.10860680 -0.046387218 0.63582027 0.53652420
## speechiness -0.26245949 -0.52276261 -0.346064163 -0.18873364 0.20046595
## tempo -0.19147266 -0.04641350 0.081905838 0.02150373 -0.01218401
## valence -0.18996801 0.46322694 -0.459307568 -0.15230475 0.15699677
from here we have obtain an information of data which has been processed by PCA method. The PC1 will contain the highest information from all of the other PC’s which can be seen in the Standard Deviations of the PCA data that we have. The lower the amount of PC that we use, the higher the information loss the we will be having. We can obtain the converted data of every rows of a column and create a correlation plot to compare the correlation berween the converted and original data that we have.
ggcorr(pca_spotify$x %>% as.data.frame, label = T, hjust = 1)We can see that there are no trace of correlation between the variables. However, the variable names has changed so we could not make an interpretation out of the converted data. However, we can make a visualization of our data using biplot() or plot.PCA() function. Since our data contain a large number of observation, it is faster to use plot.PCA().
pca_f <- PCA(X = spotify_clean,
scale.unit = T,
graph = F,
ncp = 11)
plot.PCA(x = pca_f, choix = "var")From the plot above, we can see some arrow going from the center of the circle out to the direction of the circle line. The longer the arrows are, the higher the contribution of the variables of that arrows will be. The arrows which are close to each other are the variables which have the save characteristics, and the arrows which have an angle close or equal to 90 degree does not have any similarities between each other. The majority of the variables are represented in PC1 which in this case the Dim 1, while the Speechness and liveness are represented more by the PC2 in which in this case is Dim 2 as they are aligned more toward the Dim 2 axis.
fviz_contrib(X = pca_f,
choice = "var",
axes = 1)Further clarification of the contribution of the variables can be seen by using fviz_contrib() function which displays a 2D bar chart that are easier for us to understand. When we set the axes parameter to 1 inside the function, it will display a contribution bar chart of PC1 as seen above. As I have previously mentioned, most of the variables have been represented in the PC1. The variables which are the most parallel and have a long arrow length of the PC1 or Dim 1 axis in the previous chart are the one which are shown to have the most contribution to the PC1 in the contribution bar chart above.
fviz_contrib(X = pca_f,
choice = "var",
axes = 2)When we set the axes parameter to 2, it will display the contribution bar chart of PC2. If we would like to observe every single PC that we have, we can simply plot it one by one by changing the axis parameter to the desired number of PC which we would like to observe. When we take a look at the PCA graph of variables chart, we can see that speechiness and liveness variables are leaning more toward PC2 or Dim 2 as diplayed in the chart. The contribution of both are high as seen in the contribution bar chart which can also be seen in the PCA graph variables chart since they are drawn with an arrow with a long lenght. To determine of how much the information loss that we want to tolerate, we can check our PCA data using summary() function.
summary(pca_spotify)## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.9097 1.3356 1.0300 0.97516 0.85535 0.80717 0.67149
## Proportion of Variance 0.3647 0.1784 0.1061 0.09509 0.07316 0.06515 0.04509
## Cumulative Proportion 0.3647 0.5431 0.6492 0.74429 0.81745 0.88260 0.92769
## PC8 PC9 PC10
## Standard deviation 0.58863 0.51762 0.32960
## Proportion of Variance 0.03465 0.02679 0.01086
## Cumulative Proportion 0.96234 0.98914 1.00000
Take a look at the Cumulative Proportion, As the PCA goes higher, the number of the Cumulative Proportion, which have percentage as the unit of the numbers, goes higher as well. In this case, I would like to keep 90% of the information that I have after the reduction, which means that I will be dealing with data which have 10% information loss due to the dimensionality reduction. this means that I have to take at least 7 number of PC, which in this case until PC number 7.
pca_x_kept <- pca_spotify$x[,1:7]
pca_x_kept %>% as.data.frameNow that our data has gone through dimensionality reduction, the total lost of the information in our data is around 7.23%. However, the overall information of the data are still the same as before, so we do not have to worry about any missing rows or columns in our data. Although uneccesary, if we want to make sure or clarify the changes that has happpend to our data, we can revert our data back to the original form.
pca_rot_kept <- pca_spotify$rotation[,1:7]
pca_return <- pca_x_kept %*% t(pca_rot_kept)
Pca_return <- scale(pca_return, scale = FALSE)pca_return %>% as.data.frame()scaled_spotifyThe number of the columns, the number of the rows, and the name of the variables of the data which has gone trough dimensionaly reduction, does not change when compared to the original data before the dimensionality reduction. However, the changes lies in the observation values where the number of each obervation changed slightly. Now that the dimension has been proved the same with the original data when we convert the reduced data back to its original form. we can now try to model our data.
In this Unsupervised Machine Learning, we do not have a target in our data while we do have our business target which is to be able to recommend similar music to the users taste, we can use clustering by K-means to create a grouping of the spotify audio based on its specifications. with the help of elbow method, we can determine the optimum number of clustering we should create.
RNGkind(sample.kind = "Rounding")
set.seed(126)
# fviz_nbclust(x = scaled_spotify, FUNcluster = kmeans, method = "wss")
wss <- function(k) {
kmeans(pca_x_kept, k, nstart = 10)$tot.withinss
}
k.values <- 1:15
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")The optimum number of cluster for our K-means model are either three of four clusters according to the elbow plot that we have created above. Although there are variables which have strong correlations between each other, I would like to try and create a clustering model based on the original data as a comparison to the PCA data. Similar to the PCA data, the optimum number of the cluster can be found using the elbot method.
RNGkind(sample.kind = "Rounding")
set.seed(126)
# fviz_nbclust(x = scaled_spotify, FUNcluster = kmeans, method = "wss")
wss <- function(k) {
kmeans(spotify_clean, k, nstart = 10 )$tot.withinss
}
k.values <- 1:15
wss_values <- map_dbl(k.values, wss)
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")The elbow plot above from the original data shown us that the optimum number of the cluster for our K-means model fall under the amount of two or three clusters. To know which performs best for our model, we shall try to model our data based on both of the possible numbers of cluster.
RNGkind(sample.kind = "Rounding")
set.seed(126)
km_spotify_pca3 <- kmeans(pca_x_kept, centers = 3)
km_spotify_pca4 <- kmeans(pca_x_kept, centers = 4)
km_spotify_ori4 <- kmeans(spotify_clean, centers = 2)
km_spotify_ori5 <- kmeans(spotify_clean, centers = 3)I create a total of four different K-means clustering according to the elbow method optimum number of clusters. However, there will only be two truly different data which we will be compare between each other, which are the data that goes to the PCA process and the original data that are not processed using PCA. The other 2 K-means are there for us to know which number of cluster we should take by comparing two K-means model of the same data.
fviz_cluster(object = km_spotify_pca3, data = spotify_clean)+
theme_pander()fviz_cluster(object = km_spotify_pca4, data = spotify_clean)+
theme_pander()fviz_cluster(object = km_spotify_ori4, data = spotify_clean)+
theme_pander()fviz_cluster(object = km_spotify_ori5, data = spotify_clean)+
theme_pander()According to the cluster plot above, we can see that the K-means model where it higher amount of cluster than the cluster which falls under the “elbow” in the elbow plot are not able to classify the data as good as the K-means model which has the same number of the cluster as the recommended number by the elbow method. In the case of the original data, both of the K-means model cannot classify the data appropriately. From this result, I decided to use the K-means generated from the PCA data where the clustering number obtain from the elbow method.
spotify_clean$cluster <- km_spotify_pca3$cluster
spotify_clean %>%
group_by(cluster) %>%
summarise_all(mean)From the data that we above, we can make an interpretation of how each and every cluster that we have regarding the audio files that we can listen in spotify. However, we can also plot the data above for an easier interpretaion.
ggRadar(spotify_clean, aes(color = cluster), interactive = T, )From here, we can see the characteristics of each cluster by observing the plot above. The characteristics of each cluster are as follows:
set.seed(126)
track_sluster1 <- spotify_clean %>%
filter(cluster == 1) %>%
rownames_to_column(var = "track_id")
spotify %>%
filter(track_id %in% track_sluster1$track_id) %>%
rename("genre" = genre) %>%
group_by(genre) %>%
summarise(freq = n()) %>%
head(10) %>%
ggplot(aes(reorder(genre, freq), freq))+
geom_col(aes(fill=freq))+
coord_flip()+
scale_fill_gradient(high="turquoise3 ", low="midnightblue")+
theme_pander()+
theme(panel.grid=element_line(linetype="dotdash"))In accordance to the characteristic of the cluster number one, we can see that most of the audio file contains in this cluster are dominated by audio podcast which explain why the speechiness score of this cluster is far higher that the other clusters.
set.seed(126)
track_sluster1 <- spotify_clean %>%
filter(cluster == 2) %>%
rownames_to_column(var = "track_id")
spotify %>%
filter(track_id %in% track_sluster1$track_id) %>%
rename("genre" = genre) %>%
group_by(genre) %>%
summarise(freq = n()) %>%
head(10) %>%
ggplot(aes(reorder(genre, freq), freq))+
geom_col(aes(fill=freq))+
coord_flip()+
scale_fill_gradient(high="turquoise3 ", low="midnightblue")+
theme_pander()+
theme(panel.grid=element_line(linetype="dotdash"))The majority of audio files in the the second cluster are music genre where we can dance to it such as Dance and Electronic music that we are familiar with where the music be used in a club. The other genres complement to the other characteristics of this cluster such as Children’s Music with its upbeat and joyful vibes, the Alternative, Country, Blues, and Anime for its energy, loudness, and fast tempo characteristics.
set.seed(126)
track_sluster1 <- spotify_clean %>%
filter(cluster == 3) %>%
rownames_to_column(var = "track_id")
spotify %>%
filter(track_id %in% track_sluster1$track_id) %>%
rename("genre" = genre) %>%
group_by(genre) %>%
summarise(freq = n()) %>%
head(10) %>%
ggplot(aes(reorder(genre, freq), freq))+
geom_col(aes(fill=freq))+
coord_flip()+
scale_fill_gradient(high="turquoise3 ", low="midnightblue")+
theme_pander()+
theme(panel.grid=element_line(linetype="dotdash"))Most of the audio files included in the third cluster are the type of music we enjoy for the harmonics of the sound that is produced by the many combination of musical instruments such as Classical music and the Disney type of theaterical Childeren’s music. The other genre that are included in the cluster are the type of music that are made to be more dominant to the minor keys and/or more calm and sad to bring the tragic feeling that leads to sadness and emotional burn to our heart.
With the help of Principle Component Analysis, the clustering of the spotify data separates better in comparison the the clustering without the Principle Component Analysis where it reduces the dimension of the data slightly to remove the correlation between the variable inside the data. When clustering with K-means clustering method, the elbow plot plays an important role of determining the best number of cluster which we should use when we do not have any spesific number of cluster which we should used based on our business case.
Although the separation can be considered a good separation, there are audio files such as Children’s Music where it appears in the top 5 most dominant audio files in all of the clusters that we have created. However, we do have to observe deep to the our data to find out why such case like this happened. It could be that the variety of the Children’s music is wide enough to be categorized in all of the three different clusters in our model. Same goes to the dance music, when we talk about dance music, we are very familiar with the kind of dance music that are played in the club. However, in many different countries, dance music can be as simple as guitar and strings where it produces a smooth melody where you can dance to such as song where we can find in most latin countries.