General Brief

This is Learning By Building Project for Unsupervised Learning (UL) course. The data used for this analysis is Spotify Tracks, you can find the data from Kaggle. In this data, there are several columns about tracks/songs profile, and we will use this data to make cluster. This analysis will be separated into 6 parts:

  1. Pre-processing Data
  2. Data exploration
  3. Clustering
  4. Interpretation
  5. Conclusion
  6. Reference

Let’s get started!


Main Objective

In this report, we will role play as Data Scientist, and we want to make cluster for popular songs. Our main objective is to make list of playlist from similar or typical character to our user.

Library Setup

In this report, we will use following library:

#import library
library(tidyverse)
library(glue)
library(plotly)
library(FactoMineR)
library(factoextra)

1 Pre-processing Data

In this part, we’re going to read our data, and prepare it into “clean data”.

1.1 Read The Data

First, let’s read our data and we will assign it as spotify.

spotify <- read.csv("data_input/SpotifyFeatures.csv")
spotify

Hereby following description for every column:

this part refer to Spotify for Developers website, please refer to this link for more detail information.

  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms: The duration of the track in milliseconds.
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.
  • liveliness: Detects the presence of an audience in the recording.
  • instrumentalness: Predicts whether a track contains no vocals.
  • loudness: The overall loudness of a track in decibels (dB)
  • speechiness: Speechiness detects the presence of spoken words in a track.
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.
  • time_signature: An estimated time signature.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM).
  • key: The key the track is in.

1.2 Data Cleansing

1.2.1 Data Type

Check any mismatch data types in our data.

# check data types for each column
str(spotify)
#> 'data.frame':    232725 obs. of  18 variables:
#>  $ ï..genre        : chr  "Movie" "Movie" "Movie" "Movie" ...
#>  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#>  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#>  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#>  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
#>  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#>  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#>  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#>  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#>  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#>  $ key             : chr  "C#" "F#" "C" "C#" ...
#>  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#>  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
#>  $ mode            : chr  "Major" "Minor" "Minor" "Major" ...
#>  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#>  $ tempo           : num  167 174 99.5 171.8 140.6 ...
#>  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
#>  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

There are several columns that has mismatch data types and we will rename ï..genre into genre to make our analysis process easier.

  • ï..genre, key, and mode into factor.
# rename genre column name

spotify <- spotify %>% 
  rename(genre = ï..genre)

# changing data type
spotify_clean <- spotify %>%
  mutate_at(
    vars(
      genre,
      key,
      mode
    ),
    as.factor
  )

str(spotify_clean)
#> 'data.frame':    232725 obs. of  18 variables:
#>  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
#>  $ artist_name     : chr  "Henri Salvador" "Martin & les fées" "Joseph Williams" "Henri Salvador" ...
#>  $ track_name      : chr  "C'est beau de faire un Show" "Perdu d'avance (par Gad Elmaleh)" "Don't Let Me Be Lonely Tonight" "Dis-moi Monsieur Gordon Cooper" ...
#>  $ track_id        : chr  "0BRjO6ga9RKCKjfDqeFgWV" "0BjC1NfoEOOusryehmNudP" "0CoSDzoNIKCRs124s9uTVy" "0Gc6TVm52BwZD07Ki6tIvf" ...
#>  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
#>  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
#>  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
#>  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
#>  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
#>  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
#>  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
#>  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
#>  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
#>  $ mode            : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
#>  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
#>  $ tempo           : num  167 174 99.5 171.8 140.6 ...
#>  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
#>  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

Each column now already in their proper types.

1.2.2 Missing Value

# check missing value
colSums(is.na(spotify_clean))
#>            genre      artist_name       track_name         track_id 
#>                0                0                0                0 
#>       popularity     acousticness     danceability      duration_ms 
#>                0                0                0                0 
#>           energy instrumentalness              key         liveness 
#>                0                0                0                0 
#>         loudness             mode      speechiness            tempo 
#>                0                0                0                0 
#>   time_signature          valence 
#>                0                0

There are no missing values in our data.

2 Data Exploration

After we got clean data, now we’re going to explore our data. We can try to observe our data in general by using summary.

2.1 Data Summary

# data summary
summary(spotify_clean)
#>         genre        artist_name         track_name          track_id        
#>  Comedy    :  9681   Length:232725      Length:232725      Length:232725     
#>  Soundtrack:  9646   Class :character   Class :character   Class :character  
#>  Indie     :  9543   Mode  :character   Mode  :character   Mode  :character  
#>  Jazz      :  9441                                                           
#>  Pop       :  9386                                                           
#>  Electronic:  9377                                                           
#>  (Other)   :175651                                                           
#>    popularity      acousticness     danceability     duration_ms     
#>  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
#>  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
#>  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
#>  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
#>  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
#>  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
#>                                                                      
#>      energy          instrumentalness         key           liveness      
#>  Min.   :0.0000203   Min.   :0.0000000   C      :27583   Min.   :0.00967  
#>  1st Qu.:0.3850000   1st Qu.:0.0000000   G      :26390   1st Qu.:0.09740  
#>  Median :0.6050000   Median :0.0000443   D      :24077   Median :0.12800  
#>  Mean   :0.5709577   Mean   :0.1483012   C#     :23201   Mean   :0.21501  
#>  3rd Qu.:0.7870000   3rd Qu.:0.0358000   A      :22671   3rd Qu.:0.26400  
#>  Max.   :0.9990000   Max.   :0.9990000   F      :20279   Max.   :1.00000  
#>                                          (Other):88524                    
#>     loudness          mode         speechiness         tempo       
#>  Min.   :-52.457   Major:151744   Min.   :0.0222   Min.   : 30.38  
#>  1st Qu.:-11.771   Minor: 80981   1st Qu.:0.0367   1st Qu.: 92.96  
#>  Median : -7.762                  Median :0.0501   Median :115.78  
#>  Mean   : -9.570                  Mean   :0.1208   Mean   :117.67  
#>  3rd Qu.: -5.501                  3rd Qu.:0.1050   3rd Qu.:139.05  
#>  Max.   :  3.744                  Max.   :0.9670   Max.   :242.90  
#>                                                                    
#>  time_signature        valence      
#>  Length:232725      Min.   :0.0000  
#>  Class :character   1st Qu.:0.2370  
#>  Mode  :character   Median :0.4440  
#>                     Mean   :0.4549  
#>                     3rd Qu.:0.6600  
#>                     Max.   :1.0000  
#> 

From summary above, we knew that:

  • There are more songs in Major than Minor.
  • Our data has more than 232.725 rows, and 18 columns. This data has big dimension, and we will need to reduce data dimension without losing information.

2.3 PCA

Our data has 5128 rows and 18 column, we want to reduce the dimension without losing information contain in data. We can make Principal Component Analysis to reduce data dimension without losing information. We want to keep 85% of information from our data.

# chr column
character <- observation %>%
  select_if(is.character) %>%
  colnames()

character_var <- which(colnames(observation) %in% character)

# observation w/o character
observation_nochr <- observation %>% 
  select(-character_var)

# numeric data
observation_numeric <- 
observation_nochr %>% 
  select_if(is.numeric)

# select numeric column
quanti <- observation_nochr %>%
  select_if(is.numeric) %>%
  colnames()

quantivar <- which(colnames(observation_nochr) %in% quanti)

# select categorical column
quali <- observation_nochr %>% 
  select_if(is.factor) %>% 
  colnames()

qualivar <- which(colnames(observation_nochr) %in% quali)
# PCA with FactoMineR
spotify_pca <- PCA(X = observation_nochr,
    scale.unit = T,
    quali.sup = qualivar,
    graph = F)
summary(spotify_pca)
#> 
#> Call:
#> PCA(X = observation_nochr, scale.unit = T, quali.sup = qualivar,  
#>      graph = F) 
#> 
#> 
#> Eigenvalues
#>                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6   Dim.7
#> Variance               2.324   1.387   1.184   1.040   1.023   0.955   0.891
#> % of var.             21.124  12.611  10.763   9.454   9.297   8.678   8.099
#> Cumulative % of var.  21.124  33.735  44.498  53.952  63.249  71.927  80.026
#>                        Dim.8   Dim.9  Dim.10  Dim.11
#> Variance               0.816   0.656   0.502   0.223
#> % of var.              7.420   5.965   4.564   2.025
#> Cumulative % of var.  87.446  93.411  97.975 100.000
#> 
#> Individuals (the 10 first)
#>                      Dist    Dim.1    ctr   cos2    Dim.2    ctr   cos2  
#> 1                |  5.283 |  0.101  0.000  0.000 |  1.597  0.036  0.091 |
#> 2                |  6.596 | -3.447  0.100  0.273 |  3.270  0.150  0.246 |
#> 3                |  4.648 | -0.843  0.006  0.033 |  1.968  0.054  0.179 |
#> 4                |  5.294 | -3.250  0.089  0.377 |  1.231  0.021  0.054 |
#> 5                |  4.390 | -1.383  0.016  0.099 |  2.407  0.081  0.301 |
#> 6                |  4.117 | -0.177  0.000  0.002 |  1.438  0.029  0.122 |
#> 7                |  3.570 |  0.977  0.008  0.075 |  1.150  0.019  0.104 |
#> 8                |  3.930 |  0.431  0.002  0.012 |  1.154  0.019  0.086 |
#> 9                |  4.475 | -1.458  0.018  0.106 |  1.380  0.027  0.095 |
#> 10               |  4.522 | -0.925  0.007  0.042 |  0.477  0.003  0.011 |
#>                   Dim.3    ctr   cos2  
#> 1                -0.410  0.003  0.006 |
#> 2                -0.913  0.014  0.019 |
#> 3                -1.227  0.025  0.070 |
#> 4                -1.997  0.066  0.142 |
#> 5                -0.340  0.002  0.006 |
#> 6                -1.740  0.050  0.179 |
#> 7                -0.323  0.002  0.008 |
#> 8                -0.724  0.009  0.034 |
#> 9                -0.394  0.003  0.008 |
#> 10                0.520  0.004  0.013 |
#> 
#> Variables (the 10 first)
#>                     Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
#> popularity       | -0.002  0.000  0.000 |  0.344  8.531  0.118 | -0.240  4.852
#> acousticness     | -0.680 19.894  0.462 |  0.003  0.001  0.000 | -0.147  1.826
#> danceability     |  0.148  0.946  0.022 |  0.787 44.608  0.619 |  0.049  0.200
#> duration_ms      | -0.070  0.212  0.005 | -0.509 18.677  0.259 |  0.104  0.914
#> energy           |  0.891 34.149  0.793 | -0.198  2.824  0.039 | -0.036  0.108
#> instrumentalness | -0.106  0.485  0.011 | -0.204  3.011  0.042 | -0.037  0.113
#> liveness         |  0.205  1.802  0.042 | -0.229  3.795  0.053 |  0.375 11.895
#> loudness         |  0.827 29.402  0.683 | -0.094  0.631  0.009 | -0.145  1.770
#> speechiness      | -0.050  0.107  0.002 |  0.363  9.496  0.132 |  0.761 48.932
#> tempo            |  0.123  0.647  0.015 | -0.146  1.546  0.021 |  0.578 28.245
#>                    cos2  
#> popularity        0.057 |
#> acousticness      0.022 |
#> danceability      0.002 |
#> duration_ms       0.011 |
#> energy            0.001 |
#> instrumentalness  0.001 |
#> liveness          0.141 |
#> loudness          0.021 |
#> speechiness       0.579 |
#> tempo             0.334 |
#> 
#> Supplementary categories (the 10 first)
#>                      Dist    Dim.1   cos2 v.test    Dim.2   cos2 v.test  
#> Dance            |  0.501 |  0.351  0.490  8.028 | -0.132  0.069 -3.902 |
#> Pop              |  0.192 | -0.072  0.141 -3.741 | -0.109  0.323 -7.325 |
#> Rap              |  0.628 | -0.106  0.029 -2.960 |  0.329  0.274 11.818 |
#> A                |  0.234 |  0.056  0.056  0.713 |  0.060  0.065  0.995 |
#> A#               |  0.196 | -0.120  0.374 -1.566 |  0.089  0.206  1.502 |
#> B                |  0.276 |  0.161  0.342  2.392 |  0.125  0.204  2.394 |
#> C                |  0.324 | -0.166  0.262 -2.737 | -0.203  0.393 -4.338 |
#> C#               |  0.370 |  0.058  0.025  1.160 |  0.271  0.537  6.976 |
#> D                |  0.301 |  0.004  0.000  0.058 | -0.216  0.517 -3.935 |
#> D#               |  0.560 | -0.100  0.032 -0.726 | -0.221  0.156 -2.081 |
#>                   Dim.3   cos2 v.test  
#> Dance            -0.312  0.387 -9.988 |
#> Pop              -0.103  0.289 -7.497 |
#> Rap               0.449  0.511 17.478 |
#> A                 0.059  0.064  1.065 |
#> A#                0.004  0.000  0.079 |
#> B                 0.097  0.124  2.020 |
#> C                -0.162  0.251 -3.753 |
#> C#                0.164  0.197  4.578 |
#> D                 0.118  0.154  2.324 |
#> D#               -0.424  0.574 -4.322 |

2.3.1 PCA Observation

plot.PCA(x = spotify_pca,
         choix = "ind",
         invisible = "quali",
         select = "contrib 5",
         habillage = 1)

Based on plot, there are outlier in observation 3703, 367, 4695, 2881, 4332. We can take out this outlier before clustering.

fviz_contrib(X = spotify_pca,choice = "var", axes = 1)

energy, loudness, acousticness, valence has big influence in PC1

fviz_contrib(X = spotify_pca,choice = "var", axes = 2)

danceability, duration, speechiness has big influence in PC2

3 Clustering

To answer our main objective, we’ll try to make clustering for the song.

# take out outlier
outlier <- c(3703, 367, 4695, 2881, 4332)
observation <- observation[-outlier,]
observation_numeric_clean <- observation_numeric[-outlier,]

# scale data
observation_numeric_scale <- 
  observation_numeric_clean %>%
    scale() %>%
    as.data.frame()
head(observation_numeric_scale)

3.1 K-Mean Clustering

RNGkind(sample.kind = "Rounding")
set.seed(100)

# k-optimum
fviz_nbclust(x = observation_numeric_scale,
             FUNcluster = kmeans,
             method = "wss",
             print.summary = T)

The most significant reduction in total wss is from 1 to 2. But we want more option or profile in our tracks, so we will use 3 cluster.

RNGkind(sample.kind = "Rounding")
set.seed(150)
# Please type your code down below
spotify_clustering <-  kmeans(x = observation_numeric_scale,
                              centers = 3)

# visualization
fviz_cluster(object = spotify_clustering,
             data = observation_numeric_scale)+
  theme_minimal()

observation$cluster <- spotify_clustering$cluster
observation_numeric_clean$cluster <- spotify_clustering$cluster

4 Interpretation

agg_observation <- observation_numeric_clean %>%
  mutate(cluster = as.factor(cluster)) %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
agg_observation

Cluster Profiling:

  • Cluster 1 : low acousticness, mid energy, high danceability, low instrumental, medium liveness, and high speechiness. Song in cluster 1 is not acoustic song, with high speechiness maybe a song full with lyrics (rap song) and maybe it will fit for dance music.
  • Cluster 2 : low acousticness, high energy, medium danceability, medium instrumental, high liveness and valence. Cluster 2 is filled with intense and live recording tracks, and happy/cheerful song.
  • Cluster 3 : high acousticness, low energy, low danceability, low speechiness, high instrumental, and low valence. Cluster 3 is filled with calming instrumental music. This song will be fit if you want calming music, without a lot of lyrics.

We can use this cluster profiling as guide line to create our playlist, for example I want to make a group of cheerful pop songs.

observation %>% 
  filter(cluster == "2") %>% 
  filter(genre == "Pop")

And now you have 1443 song recommendation to create your own playlist.

5 Model Evaluation

To evaluate our model, we will use Within Sum Squares value and Sum of Squares to Total Sum of Squares ratio. Good clustering has low withinss and betweenss to totalss ratio close to 1.

spotify_clustering$withinss
#> [1] 12895.34 21434.82 10859.95
spotify_clustering$betweenss / spotify_clustering$totss
#> [1] 0.1979321

Our withinss is high, and the ratio is very far from 1. Maybe there are still a lots of things to improve in our clustering model.

6 Conclusion

To sum up our analysis, we have to return to our main objective. We have create clustering for popular tracks base on their sound characteristics, and we can make a list of songs recommendation base on our clustering model.

According to visualization, the separation from each cluster is not really good (there are still overlapping cluster) and in quantitative check (withinss, and betweenss to totalss ratio) our clustering is very far from good.

Here is recommendation that may improve our model:

  • Take more time to focus in data preparation, maybe there are a lot of things we can do to prepare our data before using it for clustering.
  • Try to tune model with K value (elbow method).