Spotify is a digital music streaming service that provides access to millions songs from artists around the world. Because of many songs available to access, sometimes we are confused to choose the song what we want.
This article will help make clustering songs on Spotify using Machine Learning with K-Means Clustering method, so all songs on spotify will be classified according what we want to listen.
Description Data:
Acousticness: Whether the track is acoustic (Higher value the track is acoustic)
Danceability: How suitable a track is for dancing based (Higher value is most danceable)
Energy: Represents a perceptual measure of intensity and activity (Death metal music has high energy)
instrumentalness: Whether a track contains no vocals (Higher value the track is instrumental)
Liveness: Presence of an audience in the recording (Track was performed live)
Loudness: Overall loudness of a track in decibels (dB)
Speechiness: Presence of spoken words in a track (Tracks may contain both music or speech)
Valence: Musical positiveness conveyed by a track (Tracks with high valence it means happy or cheerful)
The data I get from Kaggle with this following link:
https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db
Activated Library
library(dplyr) #for wrangling data
library(FactoMineR) #for pca
library(factoextra) #for plot
Import Data
rawspotify <- read.csv("SpotifyFeatures.csv")
rawspotify
Filter Popular Song
Select song with popularity more than 75
spotifypolular <- rawspotify %>%
filter(popularity >= 75)
spotifypolular
Filter necessary data
Select Variable who relate to analyze
spotify_clean <- spotifypolular %>%
select(c(acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,valence))
spotify_clean
Check Data Type
glimpse(spotify_clean)
## Rows: 3,593
## Columns: 8
## $ acousticness <dbl> 0.023300, 0.422000, 0.544000, 0.619000, 0.031900, ...
## $ danceability <dbl> 0.845, 0.552, 0.515, 0.672, 0.731, 0.818, 0.559, 0...
## $ energy <dbl> 0.709, 0.650, 0.479, 0.588, 0.861, 0.705, 0.345, 0...
## $ instrumentalness <dbl> 0.00e+00, 2.75e-04, 5.98e-03, 2.41e-01, 0.00e+00, ...
## $ liveness <dbl> 0.0940, 0.3720, 0.1910, 0.0992, 0.0829, 0.6130, 0....
## $ loudness <dbl> -4.547, -7.199, -7.458, -9.573, -5.881, -6.679, -1...
## $ speechiness <dbl> 0.0714, 0.1280, 0.0261, 0.1330, 0.0323, 0.1770, 0....
## $ valence <dbl> 0.6200, 0.3160, 0.2840, 0.2040, 0.7800, 0.7720, 0....
All variable appropriate with data type
Check missing value
colSums(is.na(spotify_clean))
## acousticness danceability energy instrumentalness
## 0 0 0 0
## liveness loudness speechiness valence
## 0 0 0 0
All variable no have missing value
Check range data
summary(spotify_clean)
## acousticness danceability energy instrumentalness
## Min. :0.0000183 Min. :0.217 Min. :0.0511 Min. :0.0000000
## 1st Qu.:0.0302000 1st Qu.:0.588 1st Qu.:0.5280 1st Qu.:0.0000000
## Median :0.1180000 Median :0.690 Median :0.6560 Median :0.0000000
## Mean :0.2024467 Mean :0.680 Mean :0.6426 Mean :0.0069187
## 3rd Qu.:0.3070000 3rd Qu.:0.775 3rd Qu.:0.7690 3rd Qu.:0.0000332
## Max. :0.9890000 Max. :0.965 Max. :0.9890 Max. :0.9280000
## liveness loudness speechiness valence
## Min. :0.0215 Min. :-22.320 Min. :0.0228 Min. :0.0352
## 1st Qu.:0.0949 1st Qu.: -7.513 1st Qu.:0.0417 1st Qu.:0.3190
## Median :0.1200 Median : -5.934 Median :0.0676 Median :0.4840
## Mean :0.1681 Mean : -6.347 Mean :0.1126 Mean :0.4911
## 3rd Qu.:0.2010 3rd Qu.: -4.706 3rd Qu.:0.1470 3rd Qu.:0.6570
## Max. :0.8320 Max. : 0.175 Max. :0.6810 Max. :0.9800
var(spotify_clean)
## acousticness danceability energy instrumentalness
## acousticness 0.0486669591 -0.0029457106 -0.0190052942 0.0006582834
## danceability -0.0029457106 0.0194399385 -0.0026170937 -0.0003472753
## energy -0.0190052942 -0.0026170937 0.0286155400 -0.0003566577
## instrumentalness 0.0006582834 -0.0003472753 -0.0003566577 0.0022049200
## liveness -0.0015510822 -0.0004202905 0.0024352975 0.0003036391
## loudness -0.2130338207 0.0134905267 0.2963591703 -0.0119807231
## speechiness -0.0015862877 0.0041648203 -0.0015152587 -0.0003637197
## valence -0.0054057076 0.0059469199 0.0151149328 -0.0009154053
## liveness loudness speechiness valence
## acousticness -0.0015510822 -0.21303382 -0.0015862877 -0.0054057076
## danceability -0.0004202905 0.01349053 0.0041648203 0.0059469199
## energy 0.0024352975 0.29635917 -0.0015152587 0.0151149328
## instrumentalness 0.0003036391 -0.01198072 -0.0003637197 -0.0009154053
## liveness 0.0145596145 0.01880372 0.0007722376 0.0022851267
## loudness 0.0188037160 5.88518536 -0.0143229921 0.1289770325
## speechiness 0.0007722376 -0.01432299 0.0102833668 0.0006709665
## valence 0.0022851267 0.12897703 0.0006709665 0.0485241808
plot(prcomp(spotify_clean))
After check value and plot variance, we can seen average all varible is difference and variance data variable loudness has very high than other variable.
Data with high scale differences variables is not good for clustering analysis because it can be bias. Variable will be consider to capture the highest variance and other variable will be consider not providing information.
Therefore, we must scaling before doing clustering.
Scaling data
Scaling
spotify_scale <-
scale(spotify_clean) %>%
as.data.frame()
Check range data after scaling
summary(spotify_scale)
## acousticness danceability energy instrumentalness
## Min. :-0.9176 Min. :-3.32076 Min. :-3.49675 Min. :-0.1473
## 1st Qu.:-0.7808 1st Qu.:-0.65987 1st Qu.:-0.67755 1st Qu.:-0.1473
## Median :-0.3828 Median : 0.07169 Median : 0.07912 Median :-0.1473
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.4739 3rd Qu.: 0.68133 3rd Qu.: 0.74713 3rd Qu.:-0.1466
## Max. : 3.5654 Max. : 2.04405 Max. : 2.04766 Max. :19.6156
## liveness loudness speechiness valence
## Min. :-1.2151 Min. :-6.5843 Min. :-0.8859 Min. :-2.06955
## 1st Qu.:-0.6068 1st Qu.:-0.4807 1st Qu.:-0.6995 1st Qu.:-0.78120
## Median :-0.3988 Median : 0.1702 Median :-0.4441 Median :-0.03216
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.2725 3rd Qu.: 0.6764 3rd Qu.: 0.3388 3rd Qu.: 0.75319
## Max. : 5.5019 Max. : 2.6884 Max. : 5.6048 Max. : 2.21950
Check variance after scaling
var(spotify_scale)
## acousticness danceability energy instrumentalness
## acousticness 1.00000000 -0.09576913 -0.50927988 0.06354763
## danceability -0.09576913 1.00000000 -0.11096113 -0.05304323
## energy -0.50927988 -0.11096113 1.00000000 -0.04490081
## instrumentalness 0.06354763 -0.05304323 -0.04490081 1.00000000
## liveness -0.05826970 -0.02498200 0.11930980 0.05359032
## loudness -0.39806260 0.03988425 0.72216648 -0.10517355
## speechiness -0.07090832 0.29456501 -0.08833202 -0.07638406
## valence -0.11123881 0.19362683 0.40562633 -0.08849891
## liveness loudness speechiness valence
## acousticness -0.05826970 -0.39806260 -0.07090832 -0.11123881
## danceability -0.02498200 0.03988425 0.29456501 0.19362683
## energy 0.11930980 0.72216648 -0.08833202 0.40562633
## instrumentalness 0.05359032 -0.10517355 -0.07638406 -0.08849891
## liveness 1.00000000 0.06423751 0.06311147 0.08597184
## loudness 0.06423751 1.00000000 -0.05822185 0.24135328
## speechiness 0.06311147 -0.05822185 1.00000000 0.03003683
## valence 0.08597184 0.24135328 0.03003683 1.00000000
plot(prcomp(spotify_scale))
After processing scaling, data has same value average 0 and value variance gap is normal.
Function of Principal Component Analysis (PCA) is to reduce the dimensions of the data but still keep initial information, by creating new axis that can capture as much information as possible. The axis created is called Principal Component (PC), where the most information is captured by PC1, followed by PC2, etc.
spotify_pca <- PCA(spotify_scale,
scale.unit = F,
graph = F)
Individual Factor Map
Plot of distribution observations to find out data considered an outlier
plot.PCA(spotify_pca,
choix = "ind",
select = "contrib 3",
habillage = 1)
From plot above, we get insight about 3 outlier data in row 477, 1557 and 2438.
Variables Factor Map
To find out variable contributions on each PC and find out the correlation between variable
plot.PCA(spotify_pca,
choix = "var")
fviz_contrib(X = spotify_pca,
choice = "var",
axes = 1)
fviz_contrib(X = spotify_pca,
choice = "var",
axes = 2)
From plot above, we get insight:
Clustering is grouping data based on characteristics. K-means is a centroid-based clustering algorithm, its means each cluster has one centroid representing cluster.
To make clustering with K-Means, the first thing to do is find optimal number of clusters to our model. Use the kmeansTunning () function to find optimal K using the Elbow method.
RNGkind(sample.kind = "Rounding")
set.seed(1616)
fviz_nbclust(spotify_scale, kmeans, method = "wss")
Based on elbow method, we know 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters.
In this step, K value will be implemented into clustering process and create new column cluster for classification each observations.
Make clustering
RNGkind(sample.kind = "Rounding")
set.seed(1616)
# k-means clustering
spotify_clust <- kmeans(spotify_clean, centers = 8)
Clustering results can be seen from 3 values
$ withinss): sum of squares distance from each observation to centroid of each cluster.$ betweenss): sum of squares distance from each centroid to global average. Based on the number of observations in the cluster.$ totss): sum of squares distance from each observation to global average.Within Sum of Squares (WSS)
spotify_clust$withinss
## [1] 191.9313 112.2179 120.9976 142.4012 178.3662 130.6765 211.1565 166.0838
Between Sum of Squares (BSS)
spotify_clust$betweenss
## [1] 20504.64
Total Sum of Squares (TSS)
spotify_clust$totss
## [1] 21758.47
Check Ratio Clustering
((spotify_clust$betweenss)/(spotify_clust$totss))*100
## [1] 94.2375
Result of clustering has great accuracy in 94.2% above. which means is good and you will be able to hear right music based on your mood.
Clustering Plot
fviz_cluster(object=spotify_clust,
data = spotify_clean,
labelsize = 7)
Clustering Data
spotifypolular$cluster <- spotify_clust$cluster
spotifypolular %>%
select(cluster, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence) %>%
group_by(cluster) %>%
summarise_all(mean)
Profiling:
If you are listening “Linkin Park” with trackname “Numb” and you don’t know yet to choose next music after this, this model will show you what next music with the similar taste and composition.
spotifypolular %>%
filter(artist_name == "Linkin Park" & track_name == "Numb")
Result from artist “Linkin Park” and track name “Numb” we have 3 genres with same cluster. In the terms clustering result, we have same result which means 3 of that songs is on “cluster 4”. Cause of the song has 3 genres, it make you more have options to choose genres what you want to hear.
Let’s say, you choose genres “Alternative” and what music next will be suggested on?
spotifypolular %>%
filter(cluster == 4 & ï..genre == "Alternative")
You can filter song with “cluster 4” and genre “Alternative”. After that you can see 5 song with similar taste and composition.
If you are listening track “Just the Way You Are” with tempo more than “100” but you don’t know yet to choose next music after this, this model will show you what next music with the similar taste and composition.
spotifypolular %>%
filter(track_name == "Just the Way You Are" & tempo > 100)
Result from track_name “Just the Way You Are” and tempo more than 100, we have 3 genres with same cluster. In the terms clustering result, we have same result which means 2 of that songs is on “cluster 8”. Cause of the song has 2 genres, it make you more have options to choose genres what you want to hear.
Let’s say, you choose genres “Pop” and what music next will be suggested on?
spotifypolular %>%
filter(cluster == 8 & ï..genre == "Dance")
You can filter song with “cluster 8” and genre “Dance”. After that you can see 115 song with similar taste and composition.
From the unsupervised learning analysis above, we can summarize that:
Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 8 PC according to the total information we want to retain.
We can separate our data into 8 clusters based on all of the numerical features, with more than 94.2% accuracy clustering.