Machine Learning/Clustering on Spotify Data
Read Data
Observations: 232,725
Variables: 18
$ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Mov…
$ artist_name <fct> Henri Salvador, Martin & les fées, Joseph Williams, …
$ track_name <fct> "C'est beau de faire un Show", "Perdu d'avance (par …
$ track_id <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0CoS…
$ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0,…
$ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900…
$ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.4…
$ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293…
$ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.27…
$ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, 0.…
$ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G,…
$ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.10…
$ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, …
$ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Maj…
$ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.95…
$ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, …
$ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/…
$ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.53…
Some of the variables explanation:
- key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation. If no key was detected, the value is -1
- liveness: Detects the presence of an audience in the recording
- loudness: The overall loudness of a track in decibels (dB)
- mode: indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0
- speechiness: detects the presence of spoken words in a track
- tempo: The overall estimated tempo of a track in beats per minute (BPM)
- time_signature
- valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track
N/A Checking
After loading the data, let’s check whether the data has N/A value or not.
colSums(is.na(spotify))
genre 0
artist_name 0
track_name 0
track_id 0
popularity 0
acousticness 0
danceability 0
duration_ms 0
energy 0
instrumentalness 0
key 0
liveness 0
loudness 0
mode 0
speechiness 0
tempo 0
time_signature 0
valence 0
No N/A data in Spotify.
Filter: 80% and above popular song only
If you look back into the earlier part, we have a whopping data of 232,725! To make our further analysis lighter to process (and to digest), let’s filter the data to only include songs that have >80% popularity.
'data.frame': 1239 obs. of 18 variables:
$ genre : Factor w/ 27 levels "A Capella","Alternative",..: 2 2 2 10 10 10 10 10 10 10 ...
$ artist_name : Factor w/ 14564 levels "!!!","¡MAYDAY!",..: 6441 6441 7684 828 828 5090 828 828 828 828 ...
$ track_name : Factor w/ 148615 levels "____45_____",..: 105185 110794 60020 17027 2269 142863 84535 84203 121260 15380 ...
$ track_id : Factor w/ 176774 levels "00021Wy6AyMbLP2tqij86e",..: 88810 16327 136389 102494 24164 128356 40457 109930 61946 54466 ...
$ popularity : int 83 81 80 99 100 97 92 91 95 91 ...
$ acousticness : num 0.422 0.544 0.0103 0.0421 0.578 0.297 0.78 0.451 0.28 0.0815 ...
$ danceability : num 0.552 0.515 0.542 0.726 0.725 0.752 0.647 0.747 0.724 0.758 ...
$ duration_ms : int 180019 209274 216933 190440 178640 201661 171573 182000 207333 216893 ...
$ energy : num 0.65 0.479 0.853 0.554 0.321 0.488 0.309 0.458 0.647 0.665 ...
$ instrumentalness: num 2.75e-04 5.98e-03 0.00 0.00 0.00 9.11e-06 7.41e-06 0.00 0.00 1.57e-04 ...
$ key : Factor w/ 12 levels "A","A#","B","C",..: 5 7 7 9 5 10 11 10 5 6 ...
$ liveness : num 0.372 0.191 0.108 0.106 0.0884 0.0936 0.202 0.252 0.102 0.216 ...
$ loudness : num -7.2 -7.46 -6.41 -5.29 -10.74 ...
$ mode : Factor w/ 2 levels "Major","Minor": 1 1 2 2 2 1 2 1 1 2 ...
$ speechiness : num 0.128 0.0261 0.0498 0.0917 0.323 0.0705 0.0366 0.303 0.0658 0.0774 ...
$ tempo : num 167.8 89 105.3 170 70.1 ...
$ time_signature : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 4 4 4 4 4 4 4 4 ...
$ valence : num 0.316 0.284 0.37 0.335 0.319 0.533 0.195 0.47 0.435 0.643 ...
Yes! We filtered it down to only 1,239 songs!
Question
Can we classify these songs? If so, what are the characteristics from each of these classification?
EDA
Since we have a lot of variables, let’s do some exploratory data analysis (EDA).
Filter
# Selecting only a few of variables to experiment with
spot.num <- spotify_80 %>%
select_if(is.numeric) %>%
select(-popularity, -duration_ms, -tempo, -liveness, -loudness)
str(spot.num)'data.frame': 1239 obs. of 6 variables:
$ acousticness : num 0.422 0.544 0.0103 0.0421 0.578 0.297 0.78 0.451 0.28 0.0815 ...
$ danceability : num 0.552 0.515 0.542 0.726 0.725 0.752 0.647 0.747 0.724 0.758 ...
$ energy : num 0.65 0.479 0.853 0.554 0.321 0.488 0.309 0.458 0.647 0.665 ...
$ instrumentalness: num 2.75e-04 5.98e-03 0.00 0.00 0.00 9.11e-06 7.41e-06 0.00 0.00 1.57e-04 ...
$ speechiness : num 0.128 0.0261 0.0498 0.0917 0.323 0.0705 0.0366 0.303 0.0658 0.0774 ...
$ valence : num 0.316 0.284 0.37 0.335 0.319 0.533 0.195 0.47 0.435 0.643 ...
acousticness danceability energy instrumentalness
Min. :0.000147 Min. :0.258 Min. :0.1470 Min. :0.0000000
1st Qu.:0.035550 1st Qu.:0.609 1st Qu.:0.5380 1st Qu.:0.0000000
Median :0.125000 Median :0.698 Median :0.6560 Median :0.0000000
Mean :0.206544 Mean :0.692 Mean :0.6459 Mean :0.0038650
3rd Qu.:0.305500 3rd Qu.:0.780 3rd Qu.:0.7720 3rd Qu.:0.0000134
Max. :0.922000 Max. :0.964 Max. :0.9530 Max. :0.4330000
speechiness valence
Min. :0.0232 Min. :0.0379
1st Qu.:0.0441 1st Qu.:0.3210
Median :0.0739 Median :0.4720
Mean :0.1175 Mean :0.4837
3rd Qu.:0.1535 3rd Qu.:0.6390
Max. :0.5650 Max. :0.9690
Based on the summary above, we need to scale these variables from spot.num data. If we skip the scaling process, the later PCA (Principle Component Analysis) process will be sensitive to bias. As in, the PC1 value from unscaled data will be assumed to summarized most information.
Since we don’t particularly have target variable, this analysis uses Unsupervised Learning method, which focuses on exploring & analyzing pattern.
While this seems okay, we have a variable called
instrumentalness that have much smaller value than other variables. So let’s try re-scaling them.
Data Pre-Processing/Scaling
PCA
Call:
PCA(X = spot.num.scale, scale.unit = F)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
Variance 1.783 1.296 0.977 0.865 0.720 0.355
% of var. 29.742 21.610 16.304 14.420 12.011 5.914
Cumulative % of var. 29.742 51.351 67.656 82.076 94.086 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr cos2
1 | 1.666 | -1.149 0.060 0.476 | -0.588 0.022 0.125 |
2 | 2.641 | -2.316 0.243 0.769 | -1.098 0.075 0.173 |
3 | 2.124 | 0.682 0.021 0.103 | -1.592 0.158 0.562 |
4 | 1.233 | -0.207 0.002 0.028 | 0.219 0.003 0.032 |
5 | 3.401 | -2.304 0.240 0.459 | 2.242 0.313 0.435 |
Dim.3 ctr cos2
1 -0.513 0.022 0.095 |
2 -0.505 0.021 0.037 |
3 -0.229 0.004 0.012 |
4 -0.043 0.000 0.001 |
5 -0.425 0.015 0.016 |
[ reached getOption("max.print") -- omitted 5 rows ]
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3 ctr
acousticness | -0.712 28.465 0.508 | 0.003 0.001 0.000 | -0.120 1.485
danceability | 0.328 6.017 0.107 | 0.717 39.719 0.515 | 0.228 5.323
energy | 0.794 35.321 0.630 | -0.396 12.113 0.157 | 0.067 0.456
instrumentalness | -0.230 2.966 0.053 | -0.151 1.750 0.023 | 0.952 92.724
speechiness | 0.072 0.290 0.005 | 0.775 46.405 0.602 | 0.008 0.007
cos2
acousticness 0.015 |
danceability 0.052 |
energy 0.004 |
instrumentalness 0.907 |
speechiness 0.000 |
[ reached getOption("max.print") -- omitted 1 row ]
Variables that summarized by: - PC1 : valence, loudness, energy, accousticness. The latter has strong negative correlation with the other 3. - PC2 : danceability, speechness, popularity, liveliness, tempo, duration_ms, instrumentalness. The last 3 are negatively correlated with the other 4.
Choosing Optimum K
After PCA and scaling the data, the next step is to implement K-means clustering to find the optimum cluster number to model our data. Use the defined kmeansTunning() function below to find the optimum K using Elbow method.
Based on the plot above, I would say the optimal number of cluser (K) is….7.
Building Cluster
With K=7, let’s implement K-means clustering using spotify_80 data and store it as spot_cluster. Extract the cluster information from the resulting K-means object using cofee_cluster$cluster and add them as a new column named cluster to the coffee dataset.
# set.seed to ensure reproducible example
set.seed(101)
# use kmeans (centers=clusters, which is 8)
spot_cluster <- kmeans(spot.num, centers = 7)
# show how many observations on each cluster
data.frame(cbind(cluster=c(1:7), observation=spot_cluster$size)) cluster observation
1 1 151
2 2 185
3 3 135
4 4 186
5 5 156
6 6 217
7 7 209
Extract cluster information and save it into the original spotify_80 with the column name cluster.
genre artist_name track_name track_id popularity
1234 Soul Alicia Keys No One 6IwKcFdiRQZOWeYNhUiWIv 80
1235 Soul Leona Lewis Bleeding Love 7wZUrN8oemZfsEd1CGkbXE 80
1236 Country Luke Combs Beautiful Crazy 2rxQMGVafnNaRaXlRMWPde 82
acousticness danceability duration_ms energy instrumentalness key liveness
1234 0.0209 0.644 253813 0.548 8.68e-06 C# 0.1340
1235 0.1880 0.638 262467 0.656 0.00e+00 F 0.1460
1236 0.6760 0.552 193200 0.402 0.00e+00 B 0.0928
loudness mode speechiness tempo time_signature valence cluster
1234 -5.416 Minor 0.0286 90.042 4/4 0.166 4
1235 -5.886 Major 0.0357 104.036 4/4 0.225 4
1236 -7.431 Major 0.0262 103.313 4/4 0.382 3
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
Goodness of Fit
We can check it from 3 values: - Within Sum of Squares tot.withinss : signifies the ‘length’ from each observation to its centroid in each cluster
[1] 63.97722
- Total Sum of Squares
totss: signifies the ‘length’ from each observation to global sample mean
[1] 184.2524
- Between Sum of Squares
betweenss: signifies the ‘length’ from each centroid from each cluster to the global sample mean
[1] 120.2752
Another ‘goodness’ measure can be signifies with a value of betweenss/totss closer the value to 1 or 100%, the better):
[1] 65.27741
Disclaimer
While 65% definitely does not look good, I have tried to previously adjusted with several things that produce even dissatisfactory result (around 85%-96% betweenss/: - the amount of variables: it seems that if I included every numeric/integer variables, the betweenss/totss keeps getting higher. Hence, I omitted some of variables by the beginning of my analysis. - filtered/not filtered: while analyzing around 200,000 data sounds sophisticated, I found that too many data can clutter the data visualization (plus, in this instance, I would like to be familiar with just popular songs, so I filtered it down).
Answer
So, to answer the initial question to what kind of song characteristics from each clusters:
spotify_80 %>%
group_by(cluster) %>%
summarise_all(mean) %>%
select(cluster, acousticness, danceability, energy, instrumentalness, speechiness, valence)# A tibble: 7 x 7
cluster acousticness danceability energy instrumentalness speechiness valence
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.0853 0.840 0.511 0.00449 0.189 0.341
2 2 0.0779 0.597 0.797 0.00327 0.0745 0.445
3 3 0.645 0.595 0.410 0.0124 0.103 0.276
4 4 0.144 0.608 0.610 0.00492 0.0833 0.238
5 5 0.433 0.716 0.606 0.000922 0.114 0.553
6 6 0.148 0.715 0.774 0.00294 0.102 0.808
7 7 0.0726 0.765 0.691 0.000665 0.163 0.586
The characteristics: - cluster 1: Highest danceability, highest speechiness - cluster 2: Highest energy, lowest speechiness - cluster 3: Highest acousticness - cluster 4: Lowest valence - cluster 5: Lowest energy, lowest instrumentalness - cluster 6: Highest valence - cluster 7: Lowest acousticness, lowest instrumentalness
Usage
# Find out the cluster of favorite song
spotify_80 %>%
filter(artist_name == "AC/DC", track_name == "Highway to Hell") genre artist_name track_name track_id popularity
1 Rock AC/DC Highway to Hell 2zYzyRzz6pRmhPzyfMEC8s 84
acousticness danceability duration_ms energy instrumentalness key liveness
1 0.0591 0.573 208400 0.913 0.00173 F# 0.156
loudness mode speechiness tempo time_signature valence cluster
1 -4.793 Minor 0.132 115.715 4/4 0.422 2
Say, if I like “Highway to Hell” and I want to branch out to another genre (e.g. R&B), that means I should try these songs:
genre artist_name track_name
1 R&B Jason Derulo Goodbye (feat. Nicki Minaj & Willy William)
2 R&B Rita Ora Anywhere
3 R&B Tory Lanez TAlk tO Me (with Rich The Kid feat. Lil Wayne) - Remix
track_id popularity acousticness danceability duration_ms
1 5Xn4IyTtW6FUGIUyWjbUHG 82 0.0776 0.643 195419
2 7EI6Iki24tBHAMxtb4xQN2 80 0.0364 0.628 215064
3 51w9Jbat1pLeWINGCpnQUR 84 0.0226 0.698 248521
energy instrumentalness key liveness loudness mode speechiness tempo
1 0.904 0 C 0.1890 -3.694 Major 0.0739 103.028
2 0.797 0 B 0.1040 -3.953 Minor 0.0596 106.930
3 0.660 0 C 0.0622 -7.883 Major 0.0520 159.949
time_signature valence cluster
1 4/4 0.481 2
2 4/4 0.321 2
3 4/4 0.451 2
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
Assuming the model is right, now I have some new popular Pop songs to listen to!