1 Background

1.1 Brief

A huge growth in technology development has led to a great shift in the streaming services industry. When it comes to the online music streaming platform, Spotify is undeniably one of the highest contributors in the music industry. Spotify is providing freemium services of digital music, podcast, and video service; providing user accesses to millions of songs and other content from creators all over the world. In this document, I am using Spotify Songs data obtained from TidyTuesday. The goal is to apply unsupervised learning so that we are able to cluster the songs based on its music components.

1.2 Libraries and Setup

These following packages are required in this notebook. Use install.packages() to install any packages that are not already downloaded and load them using library() function.

library(tidyverse)
library(plotly)
library(GGally)
library(cowplot)
library(FactoMineR)
library(factoextra)

2 Data Wrangling

2.1 Data Inspection

spotify_songs <- read.csv("data_input/spotify_songs.csv")
head(spotify_songs)
glimpse(spotify_songs)
## Rows: 32,833
## Columns: 23
## $ track_id                 <chr> "6f807x0ima9a1j3VPbc7VN", "0r7CVbZTWZgbTCYdfa~
## $ track_name               <chr> "I Don't Care (with Justin Bieber) - Loud Lux~
## $ track_artist             <chr> "Ed Sheeran", "Maroon 5", "Zara Larsson", "Th~
## $ track_popularity         <int> 66, 67, 70, 60, 69, 67, 62, 69, 68, 67, 58, 6~
## $ track_album_id           <chr> "2oCs0DGTsRO98Gh5ZSl2Cx", "63rPSO264uRjW1X5E6~
## $ track_album_name         <chr> "I Don't Care (with Justin Bieber) [Loud Luxu~
## $ track_album_release_date <chr> "2019-06-14", "2019-12-13", "2019-07-05", "20~
## $ playlist_name            <chr> "Pop Remix", "Pop Remix", "Pop Remix", "Pop R~
## $ playlist_id              <chr> "37i9dQZF1DXcZDD7cfEKhW", "37i9dQZF1DXcZDD7cf~
## $ playlist_genre           <chr> "pop", "pop", "pop", "pop", "pop", "pop", "po~
## $ playlist_subgenre        <chr> "dance pop", "dance pop", "dance pop", "dance~
## $ danceability             <dbl> 0.748, 0.726, 0.675, 0.718, 0.650, 0.675, 0.4~
## $ energy                   <dbl> 0.916, 0.815, 0.931, 0.930, 0.833, 0.919, 0.8~
## $ key                      <int> 6, 11, 1, 7, 1, 8, 5, 4, 8, 2, 6, 8, 1, 5, 5,~
## $ loudness                 <dbl> -2.634, -4.969, -3.432, -3.778, -4.672, -5.38~
## $ mode                     <int> 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, ~
## $ speechiness              <dbl> 0.0583, 0.0373, 0.0742, 0.1020, 0.0359, 0.127~
## $ acousticness             <dbl> 0.10200, 0.07240, 0.07940, 0.02870, 0.08030, ~
## $ instrumentalness         <dbl> 0.00000000, 0.00421000, 0.00002330, 0.0000094~
## $ liveness                 <dbl> 0.0653, 0.3570, 0.1100, 0.2040, 0.0833, 0.143~
## $ valence                  <dbl> 0.518, 0.693, 0.613, 0.277, 0.725, 0.585, 0.1~
## $ tempo                    <dbl> 122.036, 99.972, 124.008, 121.956, 123.976, 1~
## $ duration_ms              <int> 194754, 162600, 176616, 169093, 189052, 16304~

Here are some informations about the features:

  • track_id: Song unique ID
  • track_name: Song name
  • track_artist: Song artist
  • track_popularity: Song popularity (0-100) where higher is better
  • track_album_id: Album unique ID
  • track_album_name: Song album name
  • track_album_release_date: Date when album released
  • playlist_name: Name of playlist
  • playlist_id: Playlist ID
  • playlist_genre: Playlist genre
  • playlist_subgenre: Playlist subgenre
  • danceability: How suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • energy: A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.
  • key: The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • duration_ms: Duration of song in milliseconds

We can see that there are 32.833 observations and 23 features in this dataset. We can see using glimpse(), some of the data types are incorrect, and there are some features that are insignificant such as IDs. Later on, we are going to cluster songs based on its music components. I do not think that album and playlist matters, since an album/playlist does not necessarily consist of songs with same genre. Furthermore, clustering songs based on its music components is like grouping songs based on its genre, so track popularity and duration will be removed since it has nothing to do with the type of the songs. So, I will remove all IDs, dates, track popularity, and everything related to album. There are some exceptional cases though, that a playlist was intentionally created by songs of same genres. So, considering these cases, I will keep it and drop the remaining features related to playlist.

spotify_songs <- spotify_songs %>% 
  select(-c(track_id, track_popularity, track_album_id, track_album_name, 
            track_album_release_date, playlist_name, playlist_id, 
            playlist_subgenre, duration_ms))

head(spotify_songs)

We are left with 14 features related to elements of songs. We can also see that 2 features are of incorrect types.

spotify_songs <- spotify_songs %>% 
  mutate(playlist_genre = as.factor(playlist_genre),
         mode = as.factor(mode)) %>% 
  relocate(mode, .after = playlist_genre)

Now, let’s check whether there is any missing and duplicated value.

2.2 Missing and Duplicated Values

Checking missing values in each features.

colSums(is.na(spotify_songs))
##       track_name     track_artist   playlist_genre             mode 
##                5                5                0                0 
##     danceability           energy              key         loudness 
##                0                0                0                0 
##      speechiness     acousticness instrumentalness         liveness 
##                0                0                0                0 
##          valence            tempo 
##                0                0

There are only 5 missing values in the data. Since there are only few missing values, we can just drop them.

spotify_songs <- na.omit(spotify_songs)

Now, checking whether there is any duplicated value in the data.

sum(duplicated(spotify_songs))
## [1] 2962
spotify_songs[duplicated(spotify_songs) | duplicated(spotify_songs, fromLast = T), ] %>% 
  arrange(track_name)

It is understandable that we have lots of duplicated values. The data should be obtained from different albums and playlists, while a song may exist in an album and multiple playlists. I will remove those duplicated values for the time being, below is the final cleaned data.

spotify_songs <- spotify_songs[!duplicated(spotify_songs), ]
spotify_songs

3 K - Means Clustering

3.1 Possibility for Clustering

K - Means Clustering is an algorithm to group data into k clusters. The main objective is to find the number of groups (k), for which all observations within a group are similar to each other. This algorithm works by iteratively moving centroids and assigning observations into a cluster based on their Euclidean Distance. Thus, it is necessary to scale the data, else the algorithm will favor features with higher scales.

spotify_z <- spotify_songs %>% 
  mutate_if(is.numeric, scale)

Let’s see how many unique playlist genres are there in the original dataset.

unique(spotify_songs$playlist_genre)
## [1] pop   rap   rock  latin r&b   edm  
## Levels: edm latin pop r&b rap rock

Assuming that the songs consist of only those 6 genres, let’s see whether there are any patterns in the distribution of data. This way, we may be able to see whether the data is suitable for clustering or not.

# Function for plotting 

plot_func <- function(start, end){
  plots <- list()
  j <- start - 1
  
  for(i in start:end){
    plots[[i-j]] <- ggplot(spotify_songs, 
                           mapping = aes_string(x = "playlist_genre",
                                                y = names(spotify_songs)[i],
                                                col = "playlist_genre")) +
      geom_boxplot() +
      labs(title = str_to_title(str_replace_all(names(spotify_songs)[i],
                                                "_", " "))) +
      theme_minimal() +
      theme(plot.title = element_text(hjust = 0.5),
            axis.title.x = element_blank(),
            legend.position = "none")
  }
  
  plot_grid(plotlist = plots, 
            ncol = 2,
            nrow = 2)
}

Looking at the plots above, in overall the music components in every genres look similar. Although there are some music components that might be the key difference in a genre, the difference to the others is insignificant. Additionally, most of the features have lots of outliers. I think that this data might not be suitable for clustering. But still, there are possibilities, that the music components in each genre are similar to the other genres because the data provide playlist genre instead of track genre. As I have mentioned before, a playlist may be composed of different genres of musics.

3.2 Optimum Number of Clusters

Using K - Means Clustering, we will see whether the the music components can be used to create clusters that resembles the music genres or not. Visualizing the number of total within sum of squares, between sum of squares, and the ratio of between sum of squares and total sum of squares for different number of clusters.

wss(select(spotify_z, is.numeric))

bss(select(spotify_z, is.numeric))

btratio(select(spotify_z, is.numeric))

From the plots above, I think the optimum number of k is 5, since increasing the number of k beyond 5 will not lead to a significant decrease in total within sum of squares, nor a significant increase in between sum of squares and \(\frac{Between Sum of Squares}{Total Sum of Squares}\) ratio. Using \(k = 5\) and visualizing the clustering result.

set.seed(147)
spotify_km <- kmeans(x = select(spotify_z, is.numeric),
                     centers = 5)

fviz_cluster(object = spotify_km,
             data = select(spotify_z, is.numeric),
             labelsize = 0,
             ggtheme = theme_minimal())

The plot above was constructed using principal components obtained from PCA. We will see about PCA later on. For now, just say that it actually uses 2 principle components to represent our data. However, it is not enough since those 2 dimensions can only represent 36.6% variance (informations) in the data. I will try to add 1 more dimension later on in the PCA section to provide better visualization. For now let’s move on to cluster profiling.

3.3 Cluster Profiling

Now, we will see the characteristics of each cluster.

# Add cluster into spotify data
spotify_songs$cluster <- as.factor(spotify_km$cluster)
spotify_z$cluster <- as.factor(spotify_km$cluster)

spotify_songs %>% 
  select(-c(track_name, track_artist, playlist_genre, mode)) %>% 
  group_by(cluster) %>% 
  summarise_all(.funs = "mean")

We can get some insights:

  • Cluster 1: loud and energetic tracks
  • Cluster 2: acoustic tracks
  • Cluster 3: instrumental tracks (tracks with little vocals/spoken words)
  • Cluster 4: positive (happy, cheerful) tracks suitable for dancing
  • Cluster 5: tracks that contain lots of spoken words, suitable for dancing

4 Principal Component Analysis

Principal Component Analysis (PCA) is commonly used to handle dimensionality problem, reducing the dimensionality of the data while retaining most of the informations. The main purpose of this PCA section, is actually to combine it with K-Means Clustering. This way, we can see the visualization of the clusters, and thus see whether the clustering method is suitable or not. Firstly, we want to see whether there are high correlations between numerical features. Strong correlation between numerical features indicates that we can reduce the number of features using PCA.

ggcorr(spotify_songs[, 5:15], 
       label = T, 
       label_size = 3, 
       hjust = 0.9, 
       size = 3, 
       layout.exp = 2)

From the correlation matrix, it can be seen that the correlations between numerical features are relatively small. This might indicates that for this data, PCA might reduce only a few features to retain as much information as possible. Let’s see how many features can we reduce to be able to retain at least 80% information in the data.

spotify_pca <- PCA(spotify_z[, 4:15],
                   quali.sup = c(1, 12),
                   ncp = 10,
                   graph = F)

spotify_pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   2.1635808              21.635808                          21.63581
## comp 2   1.5033304              15.033304                          36.66911
## comp 3   1.0930163              10.930163                          47.59928
## comp 4   1.0097256              10.097256                          57.69653
## comp 5   0.9772211               9.772211                          67.46874
## comp 6   0.9705000               9.705000                          77.17374
## comp 7   0.8583970               8.583970                          85.75771
## comp 8   0.6324344               6.324344                          92.08206
## comp 9   0.5665472               5.665472                          97.74753
## comp 10  0.2252471               2.252471                         100.00000

By using 7 principal components, I was able to reduce the number of dimensions from 12 into 7 while retaining ~85.76% information. This is to be expected, that PCA is can only remove only a few features since the correlation between numerical features are relatively low. I will store these 7 principal components in a new dataframe first. Later on, this new dataframe can be analyzed for other purposes.

spotify_df_pca <- spotify_songs %>% 
  select(c(track_name, track_artist, cluster)) %>% 
  cbind(spotify_pca$ind$coord[, 1:7]) %>% 
  relocate(cluster, .after = Dim.7)

spotify_df_pca

4.1 Combining PCA with K-Means Clustering

Actually, we have seen that PCA can be integrated with K - Means clustering to help visualize our result. We have also seen that especially for this data, 2 dimensions are too little to represent the data, and that was why in the plot, the clusters seem like intersecting with each other (below is the same plot that was used to visualize K - Means clustering).

fviz_cluster(object = spotify_km,
             data = select(spotify_z, is.numeric),
             labelsize = 0,
             ggtheme = theme_minimal())

Let’s see whether adding 1 more dimension can help visualize the clusters better or not.

spotify_df_pca %>% plot_ly(
  x = ~Dim.1, 
  y = ~Dim.2,
  z = ~Dim.3,
  color = ~cluster,
  colors = c("magenta", "red", "blue", "green", "black")) %>% 
  add_markers() %>% 
  layout(scene = list(xaxis = list(title = "Dim.1"), 
                      yaxis = list(title = "Dim.2"), 
                      zaxis = list(title = "Dim.3")))

By adding 1 more principle component to the plot, although it is still not enough to provide us a “good” visualization on the data clusters, we can see that the clusters seem to be better separated. This indicates that our data might still be clustered well, as the algorithm can put boundaries between each clusters so that they are not clumped together.

5 Conclusion

For this Spotify Songs dataset, we can group songs into 5 clusters based on its music components. We have also seen that PCA, despite its effectiveness in reducing the number of dimensions for this dataset, can help us visualize the clustering well. Based on the clustering plot using 3 principal components, the distinction between those 5 clusters seems pretty clear, indicating that we have done a good job at clustering the data. However, PCA cannot effectively reduce the number of dimensions on this dataset, since the correlations between numerical features are relatively weak. To retain around 85.76% information of the data, the number of features were only reduced by 5, which is not that significant.