Background

Spotify is a digital music streaming service that provides access to millions songs from artists around the world. Because of many songs available to access, sometimes we are confused to choose the song what we want.

This article will help make clustering songs on Spotify using Machine Learning with K-Means Clustering method, so all songs on spotify will be classified according what we want to listen.

Description Data:

  • Acousticness: Whether the track is acoustic (Higher value the track is acoustic)

  • Danceability: How suitable a track is for dancing based (Higher value is most danceable)

  • Energy: Represents a perceptual measure of intensity and activity (Death metal music has high energy)

  • instrumentalness: Whether a track contains no vocals (Higher value the track is instrumental)

  • Liveness: Presence of an audience in the recording (Track was performed live)

  • Loudness: Overall loudness of a track in decibels (dB)

  • Speechiness: Presence of spoken words in a track (Tracks may contain both music or speech)

  • Valence: Musical positiveness conveyed by a track (Tracks with high valence it means happy or cheerful)

The data I get from Kaggle with this following link:

https://www.kaggle.com/zaheenhamidani/ultimate-spotify-tracks-db

Set Up

Activated Library

library(dplyr) #for wrangling data
library(FactoMineR) #for pca
library(factoextra) #for plot

Import Data

rawspotify <- read.csv("SpotifyFeatures.csv")
rawspotify

Filter Popular Song

Select song with popularity more than 75

spotifypolular <- rawspotify %>% 
  filter(popularity >= 75) 
  
spotifypolular

Filter necessary data

Select Variable who relate to analyze

spotify_clean <- spotifypolular %>% 
 select(c(acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,valence))

spotify_clean

Exploratory Data Analysis

Check Data Type

glimpse(spotify_clean)
## Rows: 3,593
## Columns: 8
## $ acousticness     <dbl> 0.023300, 0.422000, 0.544000, 0.619000, 0.031900, ...
## $ danceability     <dbl> 0.845, 0.552, 0.515, 0.672, 0.731, 0.818, 0.559, 0...
## $ energy           <dbl> 0.709, 0.650, 0.479, 0.588, 0.861, 0.705, 0.345, 0...
## $ instrumentalness <dbl> 0.00e+00, 2.75e-04, 5.98e-03, 2.41e-01, 0.00e+00, ...
## $ liveness         <dbl> 0.0940, 0.3720, 0.1910, 0.0992, 0.0829, 0.6130, 0....
## $ loudness         <dbl> -4.547, -7.199, -7.458, -9.573, -5.881, -6.679, -1...
## $ speechiness      <dbl> 0.0714, 0.1280, 0.0261, 0.1330, 0.0323, 0.1770, 0....
## $ valence          <dbl> 0.6200, 0.3160, 0.2840, 0.2040, 0.7800, 0.7720, 0....

All variable appropriate with data type

Check missing value

colSums(is.na(spotify_clean))
##     acousticness     danceability           energy instrumentalness 
##                0                0                0                0 
##         liveness         loudness      speechiness          valence 
##                0                0                0                0

All variable no have missing value

Check range data

summary(spotify_clean)
##   acousticness        danceability       energy       instrumentalness   
##  Min.   :0.0000183   Min.   :0.217   Min.   :0.0511   Min.   :0.0000000  
##  1st Qu.:0.0302000   1st Qu.:0.588   1st Qu.:0.5280   1st Qu.:0.0000000  
##  Median :0.1180000   Median :0.690   Median :0.6560   Median :0.0000000  
##  Mean   :0.2024467   Mean   :0.680   Mean   :0.6426   Mean   :0.0069187  
##  3rd Qu.:0.3070000   3rd Qu.:0.775   3rd Qu.:0.7690   3rd Qu.:0.0000332  
##  Max.   :0.9890000   Max.   :0.965   Max.   :0.9890   Max.   :0.9280000  
##     liveness         loudness        speechiness        valence      
##  Min.   :0.0215   Min.   :-22.320   Min.   :0.0228   Min.   :0.0352  
##  1st Qu.:0.0949   1st Qu.: -7.513   1st Qu.:0.0417   1st Qu.:0.3190  
##  Median :0.1200   Median : -5.934   Median :0.0676   Median :0.4840  
##  Mean   :0.1681   Mean   : -6.347   Mean   :0.1126   Mean   :0.4911  
##  3rd Qu.:0.2010   3rd Qu.: -4.706   3rd Qu.:0.1470   3rd Qu.:0.6570  
##  Max.   :0.8320   Max.   :  0.175   Max.   :0.6810   Max.   :0.9800
var(spotify_clean)
##                   acousticness  danceability        energy instrumentalness
## acousticness      0.0486669591 -0.0029457106 -0.0190052942     0.0006582834
## danceability     -0.0029457106  0.0194399385 -0.0026170937    -0.0003472753
## energy           -0.0190052942 -0.0026170937  0.0286155400    -0.0003566577
## instrumentalness  0.0006582834 -0.0003472753 -0.0003566577     0.0022049200
## liveness         -0.0015510822 -0.0004202905  0.0024352975     0.0003036391
## loudness         -0.2130338207  0.0134905267  0.2963591703    -0.0119807231
## speechiness      -0.0015862877  0.0041648203 -0.0015152587    -0.0003637197
## valence          -0.0054057076  0.0059469199  0.0151149328    -0.0009154053
##                       liveness    loudness   speechiness       valence
## acousticness     -0.0015510822 -0.21303382 -0.0015862877 -0.0054057076
## danceability     -0.0004202905  0.01349053  0.0041648203  0.0059469199
## energy            0.0024352975  0.29635917 -0.0015152587  0.0151149328
## instrumentalness  0.0003036391 -0.01198072 -0.0003637197 -0.0009154053
## liveness          0.0145596145  0.01880372  0.0007722376  0.0022851267
## loudness          0.0188037160  5.88518536 -0.0143229921  0.1289770325
## speechiness       0.0007722376 -0.01432299  0.0102833668  0.0006709665
## valence           0.0022851267  0.12897703  0.0006709665  0.0485241808
plot(prcomp(spotify_clean))

After check value and plot variance, we can seen average all varible is difference and variance data variable loudness has very high than other variable.

Data with high scale differences variables is not good for clustering analysis because it can be bias. Variable will be consider to capture the highest variance and other variable will be consider not providing information.

Therefore, we must scaling before doing clustering.

Scaling data

Scaling

spotify_scale <- 
  scale(spotify_clean) %>% 
  as.data.frame()

Check range data after scaling

summary(spotify_scale)
##   acousticness      danceability          energy         instrumentalness 
##  Min.   :-0.9176   Min.   :-3.32076   Min.   :-3.49675   Min.   :-0.1473  
##  1st Qu.:-0.7808   1st Qu.:-0.65987   1st Qu.:-0.67755   1st Qu.:-0.1473  
##  Median :-0.3828   Median : 0.07169   Median : 0.07912   Median :-0.1473  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.4739   3rd Qu.: 0.68133   3rd Qu.: 0.74713   3rd Qu.:-0.1466  
##  Max.   : 3.5654   Max.   : 2.04405   Max.   : 2.04766   Max.   :19.6156  
##     liveness          loudness        speechiness         valence        
##  Min.   :-1.2151   Min.   :-6.5843   Min.   :-0.8859   Min.   :-2.06955  
##  1st Qu.:-0.6068   1st Qu.:-0.4807   1st Qu.:-0.6995   1st Qu.:-0.78120  
##  Median :-0.3988   Median : 0.1702   Median :-0.4441   Median :-0.03216  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2725   3rd Qu.: 0.6764   3rd Qu.: 0.3388   3rd Qu.: 0.75319  
##  Max.   : 5.5019   Max.   : 2.6884   Max.   : 5.6048   Max.   : 2.21950

Check variance after scaling

var(spotify_scale)
##                  acousticness danceability      energy instrumentalness
## acousticness       1.00000000  -0.09576913 -0.50927988       0.06354763
## danceability      -0.09576913   1.00000000 -0.11096113      -0.05304323
## energy            -0.50927988  -0.11096113  1.00000000      -0.04490081
## instrumentalness   0.06354763  -0.05304323 -0.04490081       1.00000000
## liveness          -0.05826970  -0.02498200  0.11930980       0.05359032
## loudness          -0.39806260   0.03988425  0.72216648      -0.10517355
## speechiness       -0.07090832   0.29456501 -0.08833202      -0.07638406
## valence           -0.11123881   0.19362683  0.40562633      -0.08849891
##                     liveness    loudness speechiness     valence
## acousticness     -0.05826970 -0.39806260 -0.07090832 -0.11123881
## danceability     -0.02498200  0.03988425  0.29456501  0.19362683
## energy            0.11930980  0.72216648 -0.08833202  0.40562633
## instrumentalness  0.05359032 -0.10517355 -0.07638406 -0.08849891
## liveness          1.00000000  0.06423751  0.06311147  0.08597184
## loudness          0.06423751  1.00000000 -0.05822185  0.24135328
## speechiness       0.06311147 -0.05822185  1.00000000  0.03003683
## valence           0.08597184  0.24135328  0.03003683  1.00000000
plot(prcomp(spotify_scale))

After processing scaling, data has same value average 0 and value variance gap is normal.

Principal Component Analysis

Function of Principal Component Analysis (PCA) is to reduce the dimensions of the data but still keep initial information, by creating new axis that can capture as much information as possible. The axis created is called Principal Component (PC), where the most information is captured by PC1, followed by PC2, etc.

Create PCA

spotify_pca <- PCA(spotify_scale,
                   scale.unit = F,
                   graph = F)

Visualization PCA

Individual Factor Map

Plot of distribution observations to find out data considered an outlier

plot.PCA(spotify_pca, 
         choix = "ind",
         select = "contrib 3",
         habillage = 1)

From plot above, we get insight about 3 outlier data in row 477, 1557 and 2438.

Variables Factor Map

To find out variable contributions on each PC and find out the correlation between variable

plot.PCA(spotify_pca,
         choix = "var")

fviz_contrib(X = spotify_pca,
             choice = "var",
             axes = 1)

fviz_contrib(X = spotify_pca,
             choice = "var",
             axes = 2)

From plot above, we get insight:

  • Two variables most summarized by PC1: energy & loudness
  • Two variables most summarized by PC2: danceability & speechiness
  • Variable with high positive correlation:
    • energy & loudness
    • danceability & speechiness
  • Variable with high negative correlation:
    • energy & speechiness
    • danceability & liveness

K-Means Clustering

Clustering is grouping data based on characteristics. K-means is a centroid-based clustering algorithm, its means each cluster has one centroid representing cluster.

Find K Optimum

To make clustering with K-Means, the first thing to do is find optimal number of clusters to our model. Use the kmeansTunning () function to find optimal K using the Elbow method.

RNGkind(sample.kind = "Rounding")
set.seed(1616)

fviz_nbclust(spotify_scale, kmeans, method = "wss")

Based on elbow method, we know 8 cluster is good enough since there is no significant decline in total within-cluster sum of squares on higher number of clusters.

Clustering

In this step, K value will be implemented into clustering process and create new column cluster for classification each observations.

Make clustering

RNGkind(sample.kind = "Rounding")
set.seed(1616)

# k-means clustering
spotify_clust <- kmeans(spotify_clean, centers = 8)

Goodness of Fit

Clustering results can be seen from 3 values

  • Within Sum of Squares ($ withinss): sum of squares distance from each observation to centroid of each cluster.
  • Between Sum of Squares ($ betweenss): sum of squares distance from each centroid to global average. Based on the number of observations in the cluster.
  • Total Sum of Squares ($ totss): sum of squares distance from each observation to global average.

Within Sum of Squares (WSS)

spotify_clust$withinss
## [1] 191.9313 112.2179 120.9976 142.4012 178.3662 130.6765 211.1565 166.0838

Between Sum of Squares (BSS)

spotify_clust$betweenss
## [1] 20504.64

Total Sum of Squares (TSS)

spotify_clust$totss
## [1] 21758.47

Check Ratio Clustering

((spotify_clust$betweenss)/(spotify_clust$totss))*100
## [1] 94.2375

Result of clustering has great accuracy in 94.2% above. which means is good and you will be able to hear right music based on your mood.

Cluster Profiling

Clustering Plot

fviz_cluster(object=spotify_clust,
             data = spotify_clean,
             labelsize = 7)

Clustering Data

spotifypolular$cluster <- spotify_clust$cluster

spotifypolular %>% 
  select(cluster, acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, valence) %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Profiling:

  • Cluster 1: Song with lot of danceability and energy, but little bit instrumentalness and loudness
  • Cluster 2: Song with lot of energy and valence, but little bit instrumentalness and speechiness
  • Cluster 3: Song with lot of acousticness and instrumentalness, but little bit energy and loudness
  • Cluster 4: Song with lot of energy and liveness, but little bit acousticness and instrumentalness
  • Cluster 5: Song with lot of instrumentalness and acousticness, but little bit energy and loudness
  • Cluster 6: Song with lot of acousticness and danceability, but little bit liveness and loudness
  • Cluster 7: Song with lot of danceability and speechiness, but little bit instrumentalness and valence
  • Cluster 8: Song with lot of energy and valence, but little bit acousticness and instrumentalness

Try To Find Recommendation Song

Example Case 1

If you are listening “Linkin Park” with trackname “Numb” and you don’t know yet to choose next music after this, this model will show you what next music with the similar taste and composition.

spotifypolular %>% 
  filter(artist_name == "Linkin Park" & track_name == "Numb")

Result from artist “Linkin Park” and track name “Numb” we have 3 genres with same cluster. In the terms clustering result, we have same result which means 3 of that songs is on “cluster 4”. Cause of the song has 3 genres, it make you more have options to choose genres what you want to hear.

Let’s say, you choose genres “Alternative” and what music next will be suggested on?

spotifypolular %>% 
  filter(cluster == 4 & ï..genre == "Alternative")

You can filter song with “cluster 4” and genre “Alternative”. After that you can see 5 song with similar taste and composition.

Example Case 2

If you are listening track “Just the Way You Are” with tempo more than “100” but you don’t know yet to choose next music after this, this model will show you what next music with the similar taste and composition.

spotifypolular %>% 
  filter(track_name == "Just the Way You Are" & tempo > 100)

Result from track_name “Just the Way You Are” and tempo more than 100, we have 3 genres with same cluster. In the terms clustering result, we have same result which means 2 of that songs is on “cluster 8”. Cause of the song has 2 genres, it make you more have options to choose genres what you want to hear.

Let’s say, you choose genres “Pop” and what music next will be suggested on?

spotifypolular %>% 
  filter(cluster == 8 & ï..genre == "Dance")

You can filter song with “cluster 8” and genre “Dance”. After that you can see 115 song with similar taste and composition.

Conclusion

From the unsupervised learning analysis above, we can summarize that:

  • Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 8 PC according to the total information we want to retain.

  • We can separate our data into 8 clusters based on all of the numerical features, with more than 94.2% accuracy clustering.