Library

library(tidyverse)
library(GGally)
library(FactoMineR)
library(caret)

Read Data

spotify <- read.csv("SpotifyFeatures.csv")
glimpse(spotify)

Observations: 232,725
Variables: 18
$ genre            <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Mov…
$ artist_name      <fct> Henri Salvador, Martin & les fées, Joseph Williams, …
$ track_name       <fct> "C'est beau de faire un Show", "Perdu d'avance (par …
$ track_id         <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0CoS…
$ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
$ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
$ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
$ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
$ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
$ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.…
$ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G,…
$ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
$ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
$ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Maj…
$ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
$ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
$ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/…
$ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…

Some of the variables explanation:
- key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1
- liveness: Detects the presence of an audience in the recording
- loudness: The overall loudness of a track in decibels (dB)
- mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
- speechiness: detects the presence of spoken words in a track
- tempo: The overall estimated tempo of a track in beats per minute (BPM)
- time_signature
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track

N/A Checking

After loading the data, let’s check whether the data has N/A value or not.

# cek data n/a
as.data.frame(colSums(is.na(spotify)))

                 colSums(is.na(spotify))
genre                                  0
artist_name                            0
track_name                             0
track_id                               0
popularity                             0
acousticness                           0
danceability                           0
duration_ms                            0
energy                                 0
instrumentalness                       0
key                                    0
liveness                               0
loudness                               0
mode                                   0
speechiness                            0
tempo                                  0
time_signature                         0
valence                                0

No N/A data in Spotify.

Filter: 80% and above popular song only

If you look back into the earlier part, we have a whopping data of 232,725! To make our further analysis lighter to process (and to digest), let’s filter the data to only include songs that have >80% popularity.

spotify_80 <- spotify %>% 
  filter(popularity >= 80) 

str(spotify_80)

'data.frame':   1239 obs. of  18 variables:
 $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 2 2 2 10 10 10 10 10 10 10 ...
 $ artist_name     : Factor w/ 14564 levels "!!!","¡MAYDAY!",..: 6441 6441 7684 828 828 5090 828 828 828 828 ...
 $ track_name      : Factor w/ 148615 levels "____45_____",..: 105185 110794 60020 17027 2269 142863 84535 84203 121260 15380 ...
 $ track_id        : Factor w/ 176774 levels "00021Wy6AyMbLP2tqij86e",..: 88810 16327 136389 102494 24164 128356 40457 109930 61946 54466 ...
 $ popularity      : int  83 81 80 99 100 97 92 91 95 91 ...
 $ acousticness    : num  0.422 0.544 0.0103 0.0421 0.578 0.297 0.78 0.451 0.28 0.0815 ...
 $ danceability    : num  0.552 0.515 0.542 0.726 0.725 0.752 0.647 0.747 0.724 0.758 ...
 $ duration_ms     : int  180019 209274 216933 190440 178640 201661 171573 182000 207333 216893 ...
 $ energy          : num  0.65 0.479 0.853 0.554 0.321 0.488 0.309 0.458 0.647 0.665 ...
 $ instrumentalness: num  2.75e-04 5.98e-03 0.00 0.00 0.00 9.11e-06 7.41e-06 0.00 0.00 1.57e-04 ...
 $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 7 7 9 5 10 11 10 5 6 ...
 $ liveness        : num  0.372 0.191 0.108 0.106 0.0884 0.0936 0.202 0.252 0.102 0.216 ...
 $ loudness        : num  -7.2 -7.46 -6.41 -5.29 -10.74 ...
 $ mode            : Factor w/ 2 levels "Major","Minor": 1 1 2 2 2 1 2 1 1 2 ...
 $ speechiness     : num  0.128 0.0261 0.0498 0.0917 0.323 0.0705 0.0366 0.303 0.0658 0.0774 ...
 $ tempo           : num  167.8 89 105.3 170 70.1 ...
 $ time_signature  : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ valence         : num  0.316 0.284 0.37 0.335 0.319 0.533 0.195 0.47 0.435 0.643 ...

Yes! We filtered it down to only 1,239 songs!

Question

Can we classify these songs? If so, what are the characteristics from each of these classification?

EDA

Since we have a lot of variables, let’s do some exploratory data analysis (EDA).

Filter

# Selecting only a few of variables to experiment with
spot.num <- spotify_80 %>% 
  select_if(is.numeric) %>% 
  select(-popularity, -duration_ms, -tempo, -liveness, -loudness)


str(spot.num)

'data.frame':   1239 obs. of  6 variables:
 $ acousticness    : num  0.422 0.544 0.0103 0.0421 0.578 0.297 0.78 0.451 0.28 0.0815 ...
 $ danceability    : num  0.552 0.515 0.542 0.726 0.725 0.752 0.647 0.747 0.724 0.758 ...
 $ energy          : num  0.65 0.479 0.853 0.554 0.321 0.488 0.309 0.458 0.647 0.665 ...
 $ instrumentalness: num  2.75e-04 5.98e-03 0.00 0.00 0.00 9.11e-06 7.41e-06 0.00 0.00 1.57e-04 ...
 $ speechiness     : num  0.128 0.0261 0.0498 0.0917 0.323 0.0705 0.0366 0.303 0.0658 0.0774 ...
 $ valence         : num  0.316 0.284 0.37 0.335 0.319 0.533 0.195 0.47 0.435 0.643 ...

summary(spot.num)

  acousticness       danceability       energy       instrumentalness   
 Min.   :0.000147   Min.   :0.258   Min.   :0.1470   Min.   :0.0000000  
 1st Qu.:0.035550   1st Qu.:0.609   1st Qu.:0.5380   1st Qu.:0.0000000  
 Median :0.125000   Median :0.698   Median :0.6560   Median :0.0000000  
 Mean   :0.206544   Mean   :0.692   Mean   :0.6459   Mean   :0.0038650  
 3rd Qu.:0.305500   3rd Qu.:0.780   3rd Qu.:0.7720   3rd Qu.:0.0000134  
 Max.   :0.922000   Max.   :0.964   Max.   :0.9530   Max.   :0.4330000  
  speechiness        valence      
 Min.   :0.0232   Min.   :0.0379  
 1st Qu.:0.0441   1st Qu.:0.3210  
 Median :0.0739   Median :0.4720  
 Mean   :0.1175   Mean   :0.4837  
 3rd Qu.:0.1535   3rd Qu.:0.6390  
 Max.   :0.5650   Max.   :0.9690

Based on the summary above, we need to scale these variables from spot.num data. If we skip the scaling process, the later PCA (Principle Component Analysis) process will be sensitive to bias. As in, the PC1 value from unscaled data will be assumed to summarized most information.

Since we don’t particularly have target variable, this analysis uses Unsupervised Learning method, which focuses on exploring & analyzing pattern.

# Check the variances from each PC with `prcomp()` from FactoMineR library
plot(prcomp(spot.num))

While this seems okay, we have a variable called instrumentalness that have much smaller value than other variables. So let’s try re-scaling them.

Data Pre-Processing/Scaling

# scaling
spot.num.scale <- scale(spot.num)

# check the PCA again
plot(prcomp(spot.num.scale))

PCA

#menggunakan data yang sudah discale
pca_spot <- PCA(spot.num.scale, scale. = F)

summary(pca_spot)


Call:
PCA(X = spot.num.scale, scale.unit = F) 


Eigenvalues
                       Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
Variance               1.783   1.296   0.977   0.865   0.720   0.355
% of var.             29.742  21.610  16.304  14.420  12.011   5.914
Cumulative % of var.  29.742  51.351  67.656  82.076  94.086 100.000

Individuals (the 10 first)
                     Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
1                |  1.666 | -1.149  0.060  0.476 | -0.588  0.022  0.125 |
2                |  2.641 | -2.316  0.243  0.769 | -1.098  0.075  0.173 |
3                |  2.124 |  0.682  0.021  0.103 | -1.592  0.158  0.562 |
4                |  1.233 | -0.207  0.002  0.028 |  0.219  0.003  0.032 |
5                |  3.401 | -2.304  0.240  0.459 |  2.242  0.313  0.435 |
                  Dim.3    ctr   cos2  
1                -0.513  0.022  0.095 |
2                -0.505  0.021  0.037 |
3                -0.229  0.004  0.012 |
4                -0.043  0.000  0.001 |
5                -0.425  0.015  0.016 |
 [ reached getOption("max.print") -- omitted 5 rows ]

Variables
                    Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
acousticness     | -0.712 28.465  0.508 |  0.003  0.001  0.000 | -0.120  1.485
danceability     |  0.328  6.017  0.107 |  0.717 39.719  0.515 |  0.228  5.323
energy           |  0.794 35.321  0.630 | -0.396 12.113  0.157 |  0.067  0.456
instrumentalness | -0.230  2.966  0.053 | -0.151  1.750  0.023 |  0.952 92.724
speechiness      |  0.072  0.290  0.005 |  0.775 46.405  0.602 |  0.008  0.007
                   cos2  
acousticness      0.015 |
danceability      0.052 |
energy            0.004 |
instrumentalness  0.907 |
speechiness       0.000 |
 [ reached getOption("max.print") -- omitted 1 row ]

Variables that summarized by: - PC1 : valence, loudness, energy, accousticness. The latter has strong negative correlation with the other 3. - PC2 : danceability, speechness, popularity, liveliness, tempo, duration_ms, instrumentalness. The last 3 are negatively correlated with the other 4.

Choosing Optimum K

After PCA and scaling the data, the next step is to implement K-means clustering to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method.

library(factoextra)
fviz_nbclust(spot.num.scale, kmeans, method = "wss")

Based on the plot above, I would say the optimal number of cluser (K) is….7.

Building Cluster

With K=7, let’s implement K-means clustering using spotify_80 data and store it as spot_cluster. Extract the cluster information from the resulting K-means object using cofee_cluster$cluster and add them as a new column named cluster to the coffee dataset.

# set.seed to ensure reproducible example
set.seed(101)

# use kmeans (centers=clusters, which is 8)
spot_cluster <- kmeans(spot.num, centers = 7)

# show how many observations on each cluster
data.frame(cbind(cluster=c(1:7), observation=spot_cluster$size))

  cluster observation
1       1         151
2       2         185
3       3         135
4       4         186
5       5         156
6       6         217
7       7         209

Extract cluster information and save it into the original spotify_80 with the column name cluster.

spotify_80$cluster <- spot_cluster$cluster
tail(spotify_80)

       genre artist_name      track_name               track_id popularity
1234    Soul Alicia Keys          No One 6IwKcFdiRQZOWeYNhUiWIv         80
1235    Soul Leona Lewis   Bleeding Love 7wZUrN8oemZfsEd1CGkbXE         80
1236 Country  Luke Combs Beautiful Crazy 2rxQMGVafnNaRaXlRMWPde         82
     acousticness danceability duration_ms energy instrumentalness key liveness
1234       0.0209        0.644      253813  0.548         8.68e-06  C#   0.1340
1235       0.1880        0.638      262467  0.656         0.00e+00   F   0.1460
1236       0.6760        0.552      193200  0.402         0.00e+00   B   0.0928
     loudness  mode speechiness   tempo time_signature valence cluster
1234   -5.416 Minor      0.0286  90.042            4/4   0.166       4
1235   -5.886 Major      0.0357 104.036            4/4   0.225       4
1236   -7.431 Major      0.0262 103.313            4/4   0.382       3
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

Cluster visualization

fviz_cluster(object=spot_cluster,
             data = spot.num)

Goodness of Fit

We can check it from 3 values: - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster

# withinss
spot_cluster$tot.withinss

[1] 63.97722

Total Sum of Squares totss : signifies the ‘length’ from each observation to global sample mean

# totss
spot_cluster$totss

[1] 184.2524

Between Sum of Squares betweenss : signifies the ‘length’ from each centroid from each cluster to the global sample mean

spot_cluster$betweenss

[1] 120.2752

Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better):

# `betweenss`/`tot.withinss`
((spot_cluster$betweenss)/(spot_cluster$totss))*100

[1] 65.27741

Disclaimer

While 65% definitely does not look good, I have tried to previously adjusted with several things that produce even dissatisfactory result (around 85%-96% betweenss/: - the amount of variables: it seems that if I included every numeric/integer variables, the betweenss/totss keeps getting higher. Hence, I omitted some of variables by the beginning of my analysis. - filtered/not filtered: while analyzing around 200,000 data sounds sophisticated, I found that too many data can clutter the data visualization (plus, in this instance, I would like to be familiar with just popular songs, so I filtered it down).

Answer

So, to answer the initial question to what kind of song characteristics from each clusters:

spotify_80 %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>% 
  select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence)

# A tibble: 7 x 7
  cluster acousticness danceability energy instrumentalness speechiness valence
    <int>        <dbl>        <dbl>  <dbl>            <dbl>       <dbl>   <dbl>
1       1       0.0853        0.840  0.511         0.00449       0.189    0.341
2       2       0.0779        0.597  0.797         0.00327       0.0745   0.445
3       3       0.645         0.595  0.410         0.0124        0.103    0.276
4       4       0.144         0.608  0.610         0.00492       0.0833   0.238
5       5       0.433         0.716  0.606         0.000922      0.114    0.553
6       6       0.148         0.715  0.774         0.00294       0.102    0.808
7       7       0.0726        0.765  0.691         0.000665      0.163    0.586

The characteristics: - cluster 1: Highest danceability, highest speechiness - cluster 2: Highest energy, lowest speechiness - cluster 3: Highest acousticness - cluster 4: Lowest valence - cluster 5: Lowest energy, lowest instrumentalness - cluster 6: Highest valence - cluster 7: Lowest acousticness, lowest instrumentalness

Usage

# Find out the cluster of favorite song
spotify_80 %>% 
  filter(artist_name == "AC/DC", track_name == "Highway to Hell")

  genre artist_name      track_name               track_id popularity
1  Rock       AC/DC Highway to Hell 2zYzyRzz6pRmhPzyfMEC8s         84
  acousticness danceability duration_ms energy instrumentalness key liveness
1       0.0591        0.573      208400  0.913          0.00173  F#    0.156
  loudness  mode speechiness   tempo time_signature valence cluster
1   -4.793 Minor       0.132 115.715            4/4   0.422       2

Say, if I like “Highway to Hell” and I want to branch out to another genre (e.g. R&B), that means I should try these songs:

spotify_80 %>% 
  filter(cluster == 2, genre == "R&B") %>% 
  sample_n(5)

  genre  artist_name                                             track_name
1   R&B Jason Derulo            Goodbye (feat. Nicki Minaj & Willy William)
2   R&B     Rita Ora                                               Anywhere
3   R&B   Tory Lanez TAlk tO Me (with Rich The Kid feat. Lil Wayne) - Remix
                track_id popularity acousticness danceability duration_ms
1 5Xn4IyTtW6FUGIUyWjbUHG         82       0.0776        0.643      195419
2 7EI6Iki24tBHAMxtb4xQN2         80       0.0364        0.628      215064
3 51w9Jbat1pLeWINGCpnQUR         84       0.0226        0.698      248521
  energy instrumentalness key liveness loudness  mode speechiness   tempo
1  0.904                0   C   0.1890   -3.694 Major      0.0739 103.028
2  0.797                0   B   0.1040   -3.953 Minor      0.0596 106.930
3  0.660                0   C   0.0622   -7.883 Major      0.0520 159.949
  time_signature valence cluster
1            4/4   0.481       2
2            4/4   0.321       2
3            4/4   0.451       2
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

Assuming the model is right, now I have some new popular Pop songs to listen to!

Machine Learning/Clustering on Spotify Data