1 Prelude

1.1 Fluorescent Adolescent

The cultural shift in how one consumes music has given birth to some poignant analytical features, such as Spotify Wrapped. For those who are not familiar with the term, it is basically a platform that animates some insight into the user’s listening habit¹. The insight is commonly known for its top 5 most streamed artists and songs over the past year. However, the 2019 version differs from its predecessors as Spotify gives us a glance on what we’ve been listening to for the past decade².

Chronologically, Spotify has not been around globally for exactly ten years. In fact, the streaming service first came to prominence in the United States around mid-2011³. Its ability to evoke nostalgia, however, attracts so many people that they began to subcribe as a user. By the end of 2019, Spotify has amassed up to 248 million listeners worldwide. With its catalog ranging from remastered tracks from the 90s to the current bop of the month, we can see why time ceases to exist with this one.

Like the other millions of dedicated users, I too partake in the decade wrapped hype and nostalgia. Although I did not question much regarding the decade result, the current 2019 rewind uncovered some new insight regarding my sentiment towards coming of age music.

This sentiment came as a bit of a shock for someone who was born after the peak period of the band known as Oasis. Still, looking back at the era of pre-teen music, their songs Don’t Look Back in Anger and Champagne Supernova really made an impression on my existentialism, so much so that I have a modern playlist dedicated to 4 of their songs. This playlist went on to include some of my personal favorites from childhood, middle school, and some current ones that defined my last years of coming of age. Safe to say, I went through a period of alternative rock with a little bit of pop, but mostly emo, with this one. You can click on and follow the ‘skating to adulthood’ playlist here.

1.2 thank u, what’s next?

Despite leaning towards a particular sound, my genre up to this point seems to be fluid. In 2019, this claim became apparent when Spotify Wrapped confirmed my assumption.

Looking at the chart, it seems reasonable that pop is my most liked genre considering how popular they have become. Several studies indicate that, psychologically, the repetitive aspect of pop music creates a safe and comfortable feeling that humans are often longing for⁴. Repetition invites the listeners as active participants as it does not take long until one’s feet begin to move following the rhythm of a song. Thus, no matter how quirky you are, one’s taste in music cannot beat the enjoyable repetition of any chart-topping pop song.

Personally, mainstream pop is best enjoyed as a form of guilty pleasure. Whether it’s to replace the radio in your car or to jam together with your friends, it never fails to disappoint. Even if the lyrics are sad, they are happy sad which could unconsciously elevate one’s mood. I relate to this common sense and took an obsessive liking during college. As a form of nostalgia, I decided a few months prior to indulge and compile some enjoyable bops into a ‘pumpkin spice latte’-titled playlist.

The name of the playlist refers to the basic drink that most caucasian women would stereotypically order. It should be noted that the name itself is not to be taken seriously as it is a sarcastic remark of how I, a yellow-brown dude, am embracing my inner “white girl”. A snippet of the playlist can be found, and be followed, here.

1.3 Am I Bored Yet?

One of the conscious reasons to treat mainstream music as a guilty pleasure is the lack of idiosyncrasy. Repetitiveness creates this looping earworm, but it could damage one’s conception of what good music is. Songs are now being valued based on how many people are familiar with the rhythm and by how smart the marketing team is in keeping the song “relevant”. Limiting oneself to mainstream pop music could lead someone to becoming an even more pretentious listener when it comes to artistry.

An alternative that unplugs the concept of mainstream pop, while being a derivation of it, is bedroom pop. This alternative was first noticed as a DIY genre by the Shows & Editorial team at Spotify⁵. Around June 2017, John Stein of Spotify identified some similarities within the DIY ethos. Most of the independent artists on his radar were creating lo-fi psychedelic music that somehow translates into relatable songs without having to step out of their bedroom. It was that level of authenticity that inspires him to support the culture and create a proper genre movement.

By the end of the 2017, Spotify had decided to make a playlist dedicated to bedroom pop. The playlist managed to snag 100,000 followers within a few short months and catapulted the genre into stardom. Currently, it opens with Are You Bored Yet? by Wallows and features 100 songs from other acts like Clairo, Rex Orange County, Dayglow, boy pablo, Peach Pit, Cuco, Cavetown, Still Woozy, SALES, and Gus Dapperton.

Fast forward to 2019, and yours truly decided to join the growing genre. The bedroom pop playlist inspired me to curate my own favorite songs with some slight adjustments. Personally, the artistry of bedroom pop is best enjoyed with some combination of indie pop sounds. The two are not mutually exclusively and they differ in production scope, but the genres coincide in their simplistic approach to creating melancholic rhythm. Thus, an overlapping playlist is justifiable to create. The result of this is ‘bedroom melancholia’, which features mainsteam indie melancholy and bedroom pop tracks. You can find and follow the playlist here.

Mixing bedroom pop with studio-produced indie tracks gave me perspective on the segregation of genres. I would like to think that I have overlooked the insane rigidity of genre matching by focusing more on smooth transition. However, it remains a difficult task to accomplish as I have been selfishly grouping songs into a genre-specific playlist; all this for a potential correlation belief between smooth transitions and genre.

1.4 Spotify Unwrapped

Indeed, my conflicted belief was acknowledged by the folks over at Spotify. Earlier on September 7, 2018, the music streaming giant quietly unveiled the Pollen playlist with a radical premise: it was not organized nor was it clustered by genre. Rather, the approach was consumer-based and very analytical⁶.

The same person listening to Clairo is also likely to be listening to Jpegmafia or Brockhampton. -Connor Lawrence, Chief Marketing Officer for the music analytics company Indify

The Pollen playlist marks the next step in Spotify’s attempts to connect audiences without genre, but it did present some challenges. Part of those challenges is getting used to the jump between tempos or drum programming, in which our ears have been conditioned to think are jarring. However, these challenges were quickly dismissed with Pollen’s rapid growth in listening time and followers. As of December 26, 2019, the playlist has over 820,000 followers and has given quite the exposure for smaller acts like Hope Tala and Still Woozy.

Moving between genre is not something we normally see from programmed offerings, so in the beginning, it did feel like there were some walls there. Now it feels natural. -John Stein, Co-Creator of Spotify’s Pollen Playlist

In an attempt to observe a more genre-less era of music, this publication will perform a set of clustering using the k-means method, as well as the principal component analysis (PCA). The objective is to answer whether a pattern exists within the clustered audio features of a chosen set of songs. This could provide an overall insight on how one would like their songs curated, especially for listeners who enjoy switching genres between songs. Additionally, the PCA parameter could give clustering insight on how the genres are spread amongst each clustered audio features.

2 Verse

Obtaining our internal data could take quite a bit of a toll for beginners. I, a non-programmer myself, found the following subsection to be the most challenging. However, with the help of the internet, those challenges were able to be reduced to mere complexity. In light of this accomplishment, I welcome you to copy and learn the following data preprocessing procedure.

2.1 Intro

Running installed packages is a necessary first step to create any R outcome. Thus, the following chunk displays the packages that will be called out for data preprocessing, with the # symbol explaining each use.

library(spotifyr)   #to connect R studio with the developer features
library(tidyr)      #to change and solve structural issues
library(tidyverse)  #an extension of the tidy package with an API design
library(dplyr)      #to provide flexible grammar for data manipulation
library(DT)         #to present data frame results as a data table

For the spotifyr package, there is something that needs to be addressed. Later on, one might encounter an error problem with the words ‘illegal scope’ popping up. In order to avoid this, it is best to install the package by writing on the R console. Code devtools::install_github('charlie86/spotifyr') should be generated in your console so R studio will install the package from the github network.

Once the installation has finished, the spotifyr package can now be utilized as these next few steps will heavily rely on the package. Perhaps the first noticeable value of this package is its ability to grant an access token. This token works as if the API is granting the Spotify user an access to extract whatever internal data that we need. Of course, the data might be limited to the songs and artists which they have listened to.

In order to get an access token approved, one must set up a developer account with Spotify to access their Web API⁷. This will give one a ‘Client ID’, which is different than your typical Spotify user ID or username, and ‘Client Secret’. The two values, especially Client Secret, should be kept under EXTREME LOCKDOWN. This means that the string of values should not be published to anyone. Once the two client values are obtained, you can pull your access token into R with the get_spotify_access_token() function.

To set up a developer account, one must log in to the website attached within these words. Once you are on the dashboard, click ‘CREATE A CLIENT ID` and state your project name. The ’What are you building?’ section needs to be filled so complete it with whatever option that you desire. I would recommend ‘Website’ and suggest anyone to avoid answering ‘I don’t know’. Then, the result of what you just made will appear in the dashboard. Click that green square and you will be redirected to the content. Within this content, you will get the ‘Client ID’ and ‘Client Secret’ similar to the following red rectangles.

After getting the ‘Client ID’ and ‘Client Secret’, it is best to store each of them as separate objects named id and secret respectively. To do so, the following # symbols can be deleted in order for the syntax to run.

# id <- "######of course i'm censoring this"
# secret <- "######lmao like i'm gonna publish this as well"

Before moving on to get the access token, go back to the developer dashboard and click ‘EDIT SETTINGS’. There is a section called ‘Redirect URIs’ which is apparently needed as a callback URL. If you are a pro in deconstructing URLs, you can pretty much set this to whatever you want. However, if you don’t know anything about how the world wide web works, it is best to put in http://localhost:1410/ like the image below. More information and explanation can be found on the offical Spotify Developer Guide.

Another thing before getting the access token, it is also wise to put or set our defined credentials in the system environment. This means that both id and secret needs to be stored as a value of the System Environment variables ‘SPOTIFY_CLIENT_ID’ and ‘SPOTIFY_CLIENT_SECRET’. The purpose of this is to speed up the authentication process, as well as efficiently running the get_spotify_access_token() argument later on.

Sys.setenv(SPOTIFY_CLIENT_ID = id)
Sys.setenv(SPOTIFY_CLIENT_SECRET = secret)

As teased in the previous paragraph, we will use get_spotify_access_token() to get the access token. Granted, the code will look like the following and you can run the object afterwards to obtain the token value.

access_token <- get_spotify_access_token()

I took an additional step, which could be useful later, by defining my user ID or my username. In other words, this is a subtle hint for the readers to follow me on Spotify.

my_username <- 'valdywiratama'

2.2 First Verse

Audio features are believed to be the main driver of a popular track and that is exactly what we are aiming to get. The spotifyr package has a great function called get_track_audio_features() which allows the user to extract the musicality behind each track. However, these audio features are not easily obtained.

As a start, the package has a maximum number of tracks. The function to get tracks from specific playlists is already included within spotifyr, but the maximum number of tracks you can get at once is set to 100. Therefore, we need to make a custom function that would allow us to get more than 100 tracks from the desired playlists. In order to create the function, the following codes should be generated, with each step being explained in the # symbol.

get_playlist_tracks_custom <- function(playlist_id_1) {
  
  # Step 1: Create a function to save the playlist information
  #         that we later want to extract.
  pp <- get_playlist(playlist_id_1)
    
    # Step 1a: Create an object that would help in saving the number
    #          of tracks within the playlist(s) that we've set.
    number_of_songs <- pp$tracks$total  

  # Step 2: Create a sequence from 0 to the total of `number_of_songs`
  #         while making sure that the limit is set to 100. The number
  #         of limit is given since Spotify only allows a maximum number of
  #         100 tracks in one extraction. Granted, a function that will
  #         loosen this regulation is needed.
  s <- seq(0,number_of_songs, 100)
  temp_s <- list()
  ind <- 1
  
  # Step 3: Finish the function by defining the offset. Offset itself
  #         is a pretty complicated term to elaborate, but it generally
  #         defines the index from which the number of tracks should be
  #         downloaded. In our scenario, the syntax in step 2 will allow
  #         the following index to download the first 100 tracks and then
  #         the rest should there be more than 100 tracks in total.
  for (i in s){
    temp_s[[ind]] <- get_playlist_tracks(playlist_id_1,
                                         offset = i)
    ind <- ind + 1
  }
  return(bind_rows(temp_s))
}

Function get_playlist_tracks_custom has now been stored, but it is not the only function that one needs to build. Another useful function can be developed in order to put all of the songs from multiple playlists into one data frame. The following syntax will show you how the coding works.

put_songs_from_playlists_into_df <- function(user_id, pl_names){
  
  # Step 1: Get all your playlists and then build a filter to select
  #         specific ones for later on.
  my_plists <- as_tibble(get_user_playlists(user_id, limit = 50))
  my_plists.filtered <- my_plists %>%
    filter(name %in% pl_names)
  
  # Step 2: Use the `map iterate` concept, which I am not familiar with,
  #         over `my_plists.filtered` from step 1 and get tracks for
  #         each playlist.
  tracks_from_chosen_playlists <-
    as_tibble(map_df(my_plists.filtered$id,
                     get_playlist_tracks_custom)) 

  # Step 3: Remove duplicates, if there's any, based on the track's
  #         resource identifier, also known as a variable of `track.uri`.
  tracks_from_chosen_playlists_filtered <-
    tracks_from_chosen_playlists %>%
    group_by(track.uri) %>%
    mutate(number = n()) %>% 
    dplyr::filter(number == 1)
  
  return(ungroup(tracks_from_chosen_playlists_filtered))
}

The functions are now ready to use, with the current objective being focused on storing the tracks. To conduct this, the following syntax can be written to save the tracks from your selected playlists into a data frame. Do note, however, that the following content that comes after the put_songs_from_playlists_into_df() function is my own Spotify information so adjust it to your will.

tracks_from_skating <- put_songs_from_playlists_into_df(
  my_username, "skating to adulthood")
tracks_from_skating$track.playlist <- "skating to adulthood"
tracks_from_skating$playlist.external_urls.spotify <-
  "https://open.spotify.com/playlist/5Fjv5Uv5EZB7Y4k3cAQPzE"

tracks_from_pumpkin <- put_songs_from_playlists_into_df(
  my_username, "pumpkin spice latte")
tracks_from_pumpkin$track.playlist <- "pumpkin spice latte"
tracks_from_pumpkin$playlist.external_urls.spotify <-
  "https://open.spotify.com/playlist/2YNWpK4WCnkY2HxSLkkE0F"

tracks_from_melancholy <- put_songs_from_playlists_into_df(
  my_username, "bedroom melancholia")
tracks_from_melancholy$track.playlist <- "bedroom melancholia"
tracks_from_melancholy$playlist.external_urls.spotify <-
  "https://open.spotify.com/playlist/3QxYwuTIGMLxy2bdTeKJE4"

If we take a peek at each table, the number of tracks seems to showcase an unprecedented pattern. Adding the total tracks from tracks_from_skating and tracks_from_pumpkin and the result will amount to almost 114 tracks - the number of tracks in tracks_from_melancholy. This indicates that we could split the data 50:50 if necessary. However, this could also be a passing observation as it might just be another redundant insight.

Regardless of which playlist, an alternative to storing the tracks is to immediately save each into one data frame. However, I decided in the above chunk to not proceed with this. The decision, mostly, has to do with completing the playlist-specific information that the Spotify API does not provide when one extracts the internal content. Therefore, I chose to store it separately and merge it later on like the following.

tracks_from_playlists <- merge(tracks_from_skating,
                               tracks_from_pumpkin, all=TRUE,
                               sort = FALSE)
tracks_from_playlists <- merge(tracks_from_playlists,
                               tracks_from_melancholy, all=TRUE,
                               sort = FALSE)

datatable(tracks_from_playlists,
          extensions = 'FixedColumns',
          options = list(scrollY = "200px",
                         scrollX = TRUE,
                         fixedColumns = TRUE))

From the table above, we can see that there are columns missing which should have desribed audio features. According to the Spotify Developer homepage, these features should include objects like acousticness, danceability, and other scoring variables that make up a song. More explanation on this will be unraveled in subsection 2.3.

The problem with getting audio features rely heavily on Spotify’s limit to extract songs. As explained in the previous chunks, this situation can be solved by allowing an offset to enter the function. In order to proceed with this, one must construct a new function to extract the overall audio features at once.

playlist_audio_features <- function(full_songs_df) {
  
  # Step 1: Create an object that would inform the function to save
  #         the number of songs, with the final result being saved
  #         as a data frame. Make sure that only the important columns
  #         are saved from the previous `tracks_from_playlists`.
  number_of_songs <- nrow(tracks_from_playlists)
  final_df <- data.frame()
  songs <- dplyr::select(full_songs_df, track.id, track.name,
                         track.popularity, track.external_urls.spotify,
                         track.artists, track.playlist,
                         playlist.external_urls.spotify)
  
  # Step 2: Instruct the function download the audio features with the
  #         spotifyr package function called `get_track_audio_features`.
  #         The same logic for the following syntax should resemble the
  #         ones that were established before with `offset`.
  for (i in 1:number_of_songs) {
    
    s <- get_track_audio_features(songs[i,1])
    ss <- bind_cols(songs[i,], s)
    
    # Step 2a: Make sure that the function is instructed to combine the
    #          extraction results into one data frame.
    final_df <- bind_rows(final_df, ss)
    print(ss$track.name)
    
  }
  return(final_df)
}

With playlist_audio_features set in motion, the audio features can now be obtained. In order to do that, the following code might be useful to simplify what we are aiming for.

track_features <- playlist_audio_features(tracks_from_playlists)

track_features <- track_features %>%
  dplyr::select(-type, -id, -track_href, -analysis_url,
                -duration_ms, -time_signature, -key, -mode)

There is a huge debacle whether time_signature, duration_ms, and key should be kept or not, but it is up to the respective user. For this clustering analysis, the variables will not be needed. Therefore, it is best to omit the three.

These next few steps can be skipped altogether, but I would advise to run the following chunk in order to understand the external data that we have obtained. Going back to the previous tables, we can see that track.artists does not have a simplistic set of values. The column is filled with [object Object] which indicates that the value of each cell is an object, most likely to be another data frame. If one encounters this situation, fear not because the set of values is just another case of nested table.

In order to open this nested table and spread out the internal variables, the following chunk could be used. It should be noted, however, that parameter .name_repair is placed simply to avoid an error. You can type the value of said parameter with whatever you like.

track_features_unnest <- track_features %>% 
  unnest(track.artists, .name_repair = "uri1 is for artists")

Once the column is unnested, it would be best if one cross-checks the data through a tabular format. As we can see from the table above, the number of entries, or ‘tracks’ if you will, seems to have increased. This is due to the fact that some tracks are a collaboration effort between many artists.

As an example, the track Let Me Go from pumpkin spice latte credits 4 artists. By unnesting the track.artists cell, the user is essentially registering the song as 4 separate tracks, despite each row having the same track.uri. In order to solve this, duplicates need to be removed with only the top artist being displayed. What transpires afterwards is a question regarding where credit is due, but this is not an SJW forum on artistry. We shall have that discussion some other time.

To remove duplicates, the following syntax can be used.

track_features_unnest1 <-
  track_features_unnest[!duplicated(
    track_features_unnest$track.id),]

The data frame is now set with some slight problems that need to be addressed. Redundant columns are still attached and would be useless to keep. Furthermore, the variable names may not be suitable for other data processing software, namely python, due to the presence of the . symbol. To solve this, we can pipe the code with the previous problem simultaneously. The following syntax will show the procedure, as well as adding the process of changing the track.playlist structural type.

track_features_renamed <- track_features_unnest1 %>%
  rename(track_id = track.id, track_name = track.name,
         track_popularity = track.popularity, track_uri = uri,
         track_ext_url = track.external_urls.spotify,
         artist_id = id, track_artist = name, artist_uri = uri1,
         artist_ext_url = external_urls.spotify,
         track_playlist = track.playlist,
         playlist_ext_url = playlist.external_urls.spotify) %>% 
  select(-href, -type, -.name_repair) %>% 
  mutate(track_playlist = as.factor(track_playlist))

To simplify the readings for outsiders, one can also rearrange the columns which will get the following result.

final_track_features <- track_features_renamed[,c(2,6,3,9,11:19,
                                                  1,7,4:5,20,8,10)]

The data frame is now set and ready for clustering. Before we proceed, make sure that the data frame is saved in a .CSV format to provide a safety net.

write_csv(final_track_features,
          "saved_data_fromR/section_2/final_track_features.csv")

2.3 Second Verse

The table above is stored as a data frame named final_track_features with some important audio features. Each feature will be defined in the following summary.

track_popularity: The popularity of a track which assigns a value between 0 and 100, with 100 being the most popular. It is calculated by algorithm and, for the most part, is based on the total number of plays that each track has had and how recent those plays are.
danceability: An assigned value based on a combination of musical elements such as tempo, rhythm stability, beat strength, and overall regularity. This feature describes how suitable a track is for dancing, with the value of 0.0 being the least danceable and 1.0 being the most danceable.
energy: A measurement of value between 0.0 and 1.0 which represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores lower on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness between tracks. Loudness is the quality of a sound that becomes the primary psychological correlation of physical strength or amplitude. The values typically range between -60 and 0 db.
speechiness: The presence of spoken words within a track. The more exclusively speech-like the recording is - maybe involving talk shows, audio books, and poetries for example, the closer to 1.0 the attribute value is. Values above 0.66 describe tracks that are probably made entirely out of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered. Values below 0.33 are most likely to represent music that are non-speech-like tracks.
acousticness: A confidence measurement between 0.0 and 1.0 which assigns whether the track is acoustic enough or not. The value of 1.0 represents high confidence that the track is an enjoyable acoustic.
instrumentalness: A predicted value to indicate the non-vocal context of a track content. Sounds like “ooh” and “aah” are treated as instrumental in this scenario. Rap or spoken word tracks are clearly grouped as “vocal”. The closer the value is to 1.0, the greater the likelihood of the track containing no vocal content. Values above 0.5 are intended to represent instrumental tracks, with confidence becoming higher as the value approaches 1.0.
liveness: The presence of an audience in the recording. Higher values of liveness represent an increased probability that the track was performed live. Additionally, a value above 0.8 provides strong likelihood that the track is indeed live.
valence: A measurement which describes the musical positiveness conveyed by a track. The values ranges between 0.0 and 1.0. Tracks with higher value of valence sound more positive, which could generate a happy, cheerful, as well as euphoric energy. Meanwhile, tracks with lower value of valence sound more negative, which exude feelings of sadness, anger, and depression.
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

3 Chorus

This third section focuses on how one could break the barrier in curating playlists by using the unsupervised learning model of clustering. As mentioned in subsection 1.4, the process will involve the clustering method of k-means as well as principal component analysis (PCA).

3.1 Pre-chorus

Statistically speaking, clustering refers to the practice of finding meaningful ways to group data, or create subgroups, within a dataset. The objective is usually to have a number of partitions where the observations that fall into each partition are similar to others in that group. One of the ways in which these simalirities can be caught and grouped into an optimum number of clusters is by utilizing the k-means method.

K-means is a centroid-based clustering algorithm that follows a simple procedure of classifying a given dataset into a pre-determined number of clusters. The procedure involves a series of iterations which begins with finding cluster centers, also known as ‘centroids’. These centroids should theoretically be measured with statistics even if the user knows the correct number of clusters. For example, the songs that I have chosen to use were already clustered into 3 playlists so subjectively, I can immediately claim k = 3 and move on to PCA. However, in the spirit of being objective and genre-less, I will re-measure the number of clusters by using what is known as the elbow method.

Elbow method serves as a gateway to finding the possible numbers of an optimum cluster. As the name suggests, these numbers will lie somewhere within the pivoting point of a curve - presumably when it begins to shape like an elbow. The numbers are subjected to a calculation of within sum of squares (WSS), which statistically calculates the variability of the observations within each cluster.

Since the method deals with numbers, final_track_features needs to be subset to obtain only the numerical values of each audio feature. The subset will ultimately look like the following.

feature_range_check <- as.data.frame(final_track_features
                                     [,c(3,5:13)])

summary(feature_range_check)

##  track_popularity  danceability        energy          loudness      
##  Min.   : 5.00    Min.   :0.2170   Min.   :0.0848   Min.   :-15.099  
##  1st Qu.:54.00    1st Qu.:0.4998   1st Qu.:0.4572   1st Qu.: -8.937  
##  Median :66.50    Median :0.5740   Median :0.6185   Median : -7.074  
##  Mean   :60.33    Mean   :0.5876   Mean   :0.5976   Mean   : -7.258  
##  3rd Qu.:73.00    3rd Qu.:0.6910   3rd Qu.:0.7147   3rd Qu.: -5.394  
##  Max.   :87.00    Max.   :0.9120   Max.   :0.9350   Max.   : -2.264  
##   speechiness       acousticness      instrumentalness       liveness      
##  Min.   :0.02340   Min.   :0.000289   Min.   :0.0000000   Min.   :0.03350  
##  1st Qu.:0.03175   1st Qu.:0.027575   1st Qu.:0.0000000   1st Qu.:0.09433  
##  Median :0.04075   Median :0.150500   Median :0.0000232   Median :0.11150  
##  Mean   :0.05852   Mean   :0.259349   Mean   :0.0481866   Mean   :0.14825  
##  3rd Qu.:0.06200   3rd Qu.:0.393000   3rd Qu.:0.0053100   3rd Qu.:0.15450  
##  Max.   :0.42400   Max.   :0.913000   Max.   :0.9100000   Max.   :0.66800  
##     valence           tempo       
##  Min.   :0.0384   Min.   : 71.99  
##  1st Qu.:0.2467   1st Qu.: 99.96  
##  Median :0.4280   Median :117.02  
##  Mean   :0.4476   Mean   :120.98  
##  3rd Qu.:0.5930   3rd Qu.:146.08  
##  Max.   :0.9670   Max.   :203.91

If we summarize the content of each audio feature, we can see that the numeric values have an immensely different levels of range. This means that one variable does not have an equal spread values when compared to the other variables. The most noticeable example is instrumentalness. It has a higher range in its values, which can be seen by the relatively small number of Min. and Median. The numeric scale of tempo, for example, is also not comparable with loudness and instrumentalness. Therefore, the values within these audio features should be scaled first like the following.

library(scales)

scaled_track_features <- scale(dplyr::select(
  final_track_features, -track_name, -track_artist, -track_playlist,
  -track_id, -track_uri, -track_ext_url, -artist_id, -artist_uri,
  -artist_ext_url, -playlist_ext_url), center = TRUE, scale = TRUE)

scaled_final_track_features <- bind_cols(
  dplyr::select(final_track_features, track_name, track_artist,
                track_playlist, track_id, track_uri, track_ext_url,
                artist_id, artist_uri, artist_ext_url, playlist_ext_url),
  as_tibble(scaled_track_features)
  )

Taking a look once again at the summarized range of our newly-scaled values, we can clearly see that the values of each audio feature have been standardized.

feature_range_check2 <- as.data.frame(scaled_final_track_features1
                                     [,c(3,5:13)])

summary(feature_range_check2)

##  track_popularity   danceability          energy           loudness       
##  Min.   :-2.8357   Min.   :-2.60735   Min.   :-2.7965   Min.   :-3.09246  
##  1st Qu.:-0.3246   1st Qu.:-0.61786   1st Qu.:-0.7655   1st Qu.:-0.66215  
##  Median : 0.3160   Median :-0.09542   Median : 0.1138   Median : 0.07229  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.6491   3rd Qu.: 0.72782   3rd Qu.: 0.6387   3rd Qu.: 0.73505  
##  Max.   : 1.3666   Max.   : 2.28282   Max.   : 1.8397   Max.   : 1.96948  
##   speechiness        acousticness     instrumentalness     liveness       
##  Min.   :-0.68523   Min.   :-0.9571   Min.   :-0.2837   Min.   :-1.14178  
##  1st Qu.:-0.52232   1st Qu.:-0.8563   1st Qu.:-0.2837   1st Qu.:-0.53658  
##  Median :-0.34672   Median :-0.4021   Median :-0.2836   Median :-0.36570  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.06788   3rd Qu.: 0.4938   3rd Qu.:-0.2525   3rd Qu.: 0.06214  
##  Max.   : 7.13069   Max.   : 2.4149   Max.   : 5.0743   Max.   : 5.17135  
##     valence             tempo        
##  Min.   :-1.78189   Min.   :-1.6181  
##  1st Qu.:-0.87465   1st Qu.:-0.6943  
##  Median :-0.08542   Median :-0.1308  
##  Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.63306   3rd Qu.: 0.8292  
##  Max.   : 2.26160   Max.   : 2.7391

The calculation of optimum k can now be processed by using the feature_range_check2 data frame. As teased earlier, this data frame is used because WSS and k-means solely deals with numeric values. To obtain the result of the calculation, we can code the following function.

library(ggplot2)
library(plotly)
library(factoextra)

wss <- function(data, maxCluster = 9) {
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 2:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", pch=19)
}

Once the function is set, we can build the plot by marking the method as wss.

set.seed(100)
fviz_nbclust(feature_range_check2, kmeans, method = "wss") + 
  labs(x = "Number of Clusters (k)",
       y = "Within groups sum of squares",
       title = "Exhibit 3.1 Elbow Method for Finding The Optimum
Number of Clusters")

Obtaining the optimum k is highly subjective so one’s ability to deduct results from visualization is challenged here. It seems like the initial bend happened at k = 2. However, the curve really starts to get crooked like an actual broken elbow at k = 5.

In order to solve the conflicting judgement, we can plot the distribution of observations for k = 2, as well as the subsequent k = 3 until k = 5. The plot can be coded using the following syntax.

set.seed(100)
spotify_k2 <- kmeans(feature_range_check2, 2)

fviz_cluster(object = spotify_k2, data =
                           feature_range_check2) + 
  labs(title = "Exhibit 3.2 Clustering Spotify Tracks with k = 2")

For the rest of the numbers of clusters, repeat the aforementioned code chunk by replacing the value of 2 with the targeted value of k until k = 5.

Based on the five exhibits above, we can clearly see that k = 2 is the most optimum number of clusters. Visually, this could be confirm by the lack of overlaping observations. What this means is that each cluster is truly separate, with little to no values sharing the same similarities between the two colored clusters. Exhibit 3.3 seems to suspiciously confirm that k = 3 might also be an optimum number of clusters. However, judging by the number of observations plotted within the green cluster, one should not be tempted to immediately conclude that this might the best option for our analysis.

We hypothesized previously that k = 5 could also be the optimum number of clusters, but this turns out to be incorrect. There are way too many overlapping observations, most notably between the blue and the pink segregation. Thus, for the following PCA, we are going to use k = 2.

There are some additional process that one can extend from clustering, namely displaying the proportion of each cluster, as well as integrating the cluster into the data frame. Through the following code chunk, the proportion of each cluster can be obtained.

spotify_k2$size

## [1] 72 42

Meanwhile, obtaining the information regarding which cluster is which can be conducted by coding the following syntax.

final_track_clustered <- scaled_final_track_features1
final_track_clustered$cluster <- as.factor(spotify_k2$cluster)

The cluster for each track has now been integrated into the data frame. Additionally, we can extend the process by finding the mean values of each cluster, subjected by the audio feature.

final_track_clustered1 <- final_track_clustered[,c(3,5:13,21)]
final_track_clustered1 %>% 
  group_by(cluster) %>% 
  summarize_all("mean")

## # A tibble: 2 x 11
##   cluster track_popularity danceability energy loudness speechiness acousticness
##   <fct>              <dbl>        <dbl>  <dbl>    <dbl>       <dbl>        <dbl>
## 1 1                 0.0520       -0.103  0.602    0.504      -0.141       -0.490
## 2 2                -0.0891        0.177 -1.03    -0.864       0.241        0.840
## # … with 4 more variables: instrumentalness <dbl>, liveness <dbl>,
## #   valence <dbl>, tempo <dbl>

3.2 Bridge

If we analyze the audio features using the Spotify terminology, we could clearly see that some variables migh correlate with one another. A clear example of this is danceability. Its definition in subsection 2.3 cites tempo as one of the variables that determines whether a song is “danceable” enough or not.

library(GGally)
track_k2 <- final_track_clustered1[,-11]
ggcorr(track_k2) +
  labs(title = "Exhibit 3.7 Statistical Correlation between Audio Features")

Further proof can be found from the correlation plot above. While danceability and tempo might not follow the theoretical correlation - with possibilities leaning more towards causation, energy and loudness as well as energy and acousticness show a concerning level of correlation. With this in mind, there is a possibility that grouping these variables as predictors might prove to be insignificant for future regression.

Principal Component Analysis (PCA) shares the same objective, but differs in terms of statistical implementation. In the previous example, the possibility of a correlation was made under the impression that the two were going to be predictors within a regression model. With PCA, however, the analysis looks for correlation within the data and uses that redundancy to create a new matrix Z - essentially a 3 dimensional mathematical form of the standard ‘variable’. This matrix will include just about enough dimensions to explain most of the variance in the original data so features that add little value, or may just represent “noise”, are eliminated and deemed as redundant. The variables inside matrix Z are transformed into principal components or dimensions like the x and y axis in Exhibit 3.2 - Exhibit 3.6.

pca_k2 <- prcomp(track_k2,center = T)
summary(pca_k2)

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6    PC7
## Standard deviation     1.6166 1.2014 1.1415 0.99537 0.98796 0.91115 0.8479
## Proportion of Variance 0.2614 0.1443 0.1303 0.09908 0.09761 0.08302 0.0719
## Cumulative Proportion  0.2614 0.4057 0.5360 0.63508 0.73269 0.81571 0.8876
##                            PC8    PC9    PC10
## Standard deviation     0.74510 0.6340 0.40837
## Proportion of Variance 0.05552 0.0402 0.01668
## Cumulative Proportion  0.94312 0.9833 1.00000

Earlier in track_k2, there were ten variables and now there are ten principal components (PCs). Each of the PC stores the same ten variables, but capture different information. As the number of PC increases, the total information that the higher PCs gain are more significant and thus, the increasing value of cumulative proportion. The highest value of cumulative proportion in PC10 represents the highest number of information retained, which is 100%.

Now that we have information regarding how much information that each PC explains, we can proceed to combine the result with our k-means clustering from spotify_k2. The following code can be generated.

pca_k2_combined <- data.frame(pca_k2$x,
                              cluster = factor(spotify_k2$cluster))

In addition, correlations between each PC can be examined to strengthen the utilization of PCA. To do this, the same ggcorr function can be created.

ggcorr(pca_k2_combined)  +
  labs(title = "Exhibit 3.8 Statistical Correlation between PCs")

Exhibit 3.8 shows that every correlation is exactly zero. This exhibit justifies the efficiency of using PCs instead of variables as the matrices that each PC creates elminate the possibility of a correlation.

PCA itself is a very simple analysis and, in fact, all procedures are almost complete. To better understand the analysis, it is best to visualize the result as a biplot. This plot is a figure which visualizes the distribution of our observations against 2 PCs. In a standardized 2-dimensional curve, one might be familiar with the placement of x and y. Biplot essentially has a similar idea, with x and y being replaced by the two PCs of your choosing - normally PC1 and PC2.

The simplest way to generate a biplot is by using the biplot() function. However, this function does not leave much for the imagination. It plots the observations and the variables into this one big giant knitting mess. An alternative would involve a more complicated function-generating formula like the following.

data_pca <- data.frame(varnames = rownames(pca_k2$rotation),
                       pca_k2$rotation)
x <- "PC1"
y <- "PC2"

data <- data.frame(obsnames=seq(nrow(pca_k2$x)), pca_k2$x)

mult <- min(
  (max(data[,y]) - min(data[,y])/(max(data_pca[,y])
                                  -min(data_pca[,y]))),
  (max(data[,x]) - min(data[,x])/(max(data_pca[,x])
                                  -min(data_pca[,x]))))

data_pca <- transform(data_pca,
                    v1 = .9 * mult * (get(x)),
                    v2 = .9 * mult * (get(y)))

If you don’t understand anything about the function, congratulations because I don’t either! It’s quite difficult to comprehend what the function specifically does, but it ultimately creates an easier segmenting process for plotting the audio features later on. To learn more about this function, I’d recommend checking out this publication as a reference⁸.

The biplot will eventually be coded using the standard ggplot mode with some unusual modifications. Said modifications can be seen by running the following syntax.

library(ggrepel)
ggplot(pca_k2_combined, aes(x = PC1, y = PC2)) +
  geom_hline(aes(yintercept = 0), size = .2) + 
  geom_vline(aes(xintercept = 0), size = .2) +
  coord_equal() +
  geom_point(aes(color = cluster),size = 5) +
  geom_segment(data = data_pca, aes(x = 0, y = 0, xend = v1,
                                    yend = v2), arrow = arrow(
                                      length = unit(0.2, "cm"))) +
  geom_text_repel(data = data_pca, aes(label = str_to_title(varnames)),
                  point.padding = -10, segment.size = 0.5) +
  scale_color_brewer(palette = "Set2")+
  guides(colour = guide_legend(override.aes = list(size=3)))+
  labs(title = "Exhibit 3.9 Biplot for PCA with 2 Clustered Playlists",
       color = "Cluster")

Exhibit 3.9 plots the distribution of observations while displaying the information regarding each cluster’s characteristics in audio features. PC 1 and PC 2 are commonly used because the first two principal components still retain enough “information” within the data or in statistical terms, they retain most of the variance in our data.

Although the axis label only shows up on the bottom and left part of the curve, PC1 is also represented on the right axis while PC2 is similarly represented on the top axis. The position of the arrows represent its contribution or significance to each PC. For PC1, energy contributes the most as it is the closest one to the value of zero and for PC2, acousticness contributes the most using the same logic. In addition, the length of each arrow represents the magnitude or how much audio feature information that both PCs were able to capture collectively.

The biplot is commonly used to analyze the two PCs and its variables, but in this case, we can also analyze the two clusters. In cluster 2, it seems that the songs were grouped as a playlist due to its higher value in acousticness and speechiness. Meanwhile, cluster 1 seems to have relatively higher levels in the rest of the audio features, though the variable of danceability remains ambiguous. From the perspective of its direction, danceability seems to be pointing at cluster 2. However, the overlapping line seems to indicate that it is a higher representative of cluster 1. This confusion can be confirmed by plotting each audio feature against each cluster.

Before we proceed with visualizing each audio feature, it is best to normalize the scaled values. The purpose of this is to get a clearer index for when the values are compared, as well as matching the threshold that Spotify has set in subsection 2.3 for analysis. Normalizing the values can be done by going through the following function first and, again, take it with a grain of salt as it is another mathematical expression.

normalize <- function(x){
  return(
    (x - min(x))/(max(x) - min(x))
    )
}

track_normal <- normalize(final_track_clustered[,c(3,5:13)])
track_normal1 <- cbind(track_normal,
                       final_track_clustered[,-c(3,5:13)])

To visualize every audio feature in one form of presentation, the function facet_wrap() can be utilized. In short, this function allows one to wrap all the visualization results from different variables as long as the variables in question are gathered first as one column⁹. The syntax is pretty long so one can just adopt the following.

track_normal1 %>% 
  group_by(cluster) %>% 
  summarize(Popularity = mean(track_popularity),
            Danceability = mean(danceability),
            Energy = mean(energy),
            Loudness = mean(loudness),
            Speechiness = mean(speechiness),
            Acousticness = mean(acousticness),
            Instrumentalness = mean(instrumentalness),
            Liveness = mean(liveness),
            Valence = mean(valence),
            Tempo = mean(tempo)) %>% 
  select(Popularity, Energy, Loudness, Speechiness, Acousticness, 
         Instrumentalness, Liveness, Valence, Tempo, cluster) %>% 
  gather("Name", "Value", -cluster) %>% 
  ggplot(aes(y = Value, x = cluster, col = Name, group = Name)) +
  geom_point(size = 2)+
  geom_line(size = 0.8)+
  facet_wrap(~Name,scales = "free_y")+
  scale_color_brewer(palette = "Set3")+
  labs(x="Cluster", col = "Audio Feature",
       title = "Exhibit 3.10 Audio Feature Comparison Between Clusters")

From Exhibit 3.10, it is confirmed that cluster 1 dominates in its higher values of almost every audio feature, with the most outstanding feature being energy. It is also statistically confirmed that acousticness is higher in cluster 2. The rest of the features in both clusters remain in tact with Exhibit 3.9 even when the values were adjusted to range between 0 and 1.

track_normal1 %>% 
  group_by(cluster) %>% 
  summarize(Danceability = mean(danceability)) %>% 
  select(Danceability, cluster) %>% 
  gather("name", "value", -cluster) %>% 
  ggplot(aes(y = value, x = cluster, group = name))+
  geom_point(size = 2.5)+
  geom_line(size = 1)+
  labs(y = "Value of Danceability", x="Cluster",
       title = "Exhibit 3.11 Danceability Comparison Between Clusters")

Regarding the ambiguous feature of danceability, we can now confirm that it is relatively higher in cluster 2. This is an unprecedented anomaly considering that the feature is pretty much a polar opposite of acousticness. One usually listens to songs with higher acoustic levels during recent heartbreaks while saving up on dance songs for when the good times occur. However, I must say that some people use the excuse of having a good time (i.e. partying) as a medicine for melancholy. In the spirit of clustering a genre-less playlist, I’m keeping an open mind with this stastical result as the cluster might just work.

If we summarize the result, each cluster can now be defined by its specific characteristic or pattern in audio features. The following is a summary of it.

Cluster 1 is high in energy which gets your blood pumping every time you listen to the songs. The energetic charm is complemented with its relatively upbeat tempo as well as the euphoric state of valence. This playlist is perfect for someone who is longing to come alive.
Cluster 2 is high in accousticness which renders the playlist as a relatively easy-listen, though it may not be represented in the lyrics. This characteristic is balanced out by the higher ability to make the listeners dance simultaneously, especially since the songs are relatively more popular and are suitable for sing alongs as well as karaoke.

Additionally, our analysis can be extended to figure out the most representative songs. As mentioned before, k-means method toys around with its calculation using centroids. Therefore, we can obtain the centroid values of each cluster first.

set.seed(100)
spotify_k2_centroid <- data.frame(spotify_k2$centers,
                                  cluster = rownames(spotify_k2$centers))

Once we’ve obtained the values of centroid, we can create a function to perform some distancing. The purpose of this is to generate a number based on the gap between the variable value of each song and the variable’s centroid. The closer the value to the centroid is, the more representative the song becomes.

track_dist1 <- cbind(track_normal1, index = 1:nrow(track_normal1))
dist1 <- list()

for (i in 1:2){
  d1 <- data.frame(track_dist1[track_dist1$cluster == i,])
  d1 <- data.frame(d1 %>% 
    mutate(dist1 = sqrt((d1[,1] - spotify_k2_centroid[i,1])^2 +
                       (d1[,2] - spotify_k2_centroid[i,2])^2 +
                       (d1[,3] - spotify_k2_centroid[i,3])^2 +
                       (d1[,4] - spotify_k2_centroid[i,4])^2 +
                       (d1[,5] - spotify_k2_centroid[i,5])^2 +
                       (d1[,6] - spotify_k2_centroid[i,6])^2 +
                       (d1[,7] - spotify_k2_centroid[i,7])^2 +
                       (d1[,8] - spotify_k2_centroid[i,8])^2 + 
                       (d1[,9] - spotify_k2_centroid[i,9])^2 +
                       (d1[,10] - spotify_k2_centroid[i,10])^2)) %>% 
    arrange(dist1) %>% 
    head(1))
 dist1[[i]] <- d1
}

track_distance1 <- do.call(rbind, dist1)

track_distance1 %>%
  select(track_name, track_artist, cluster, track_popularity)

##            track_name  track_artist cluster track_popularity
## 1        Read My Mind   The Killers       1       0.05519616
## 2 you were good to me Jeremy Zucker       2       0.41110674

In my opinion, both songs are pretty close to what each clustered playlist represents. If one listens to Read My Mind by The Killers, then one can hear the song opening with a transcendental synthesizer sound that continues on to radiate energy with the use of its drums and electric guitar. This instrumental mix eventually creates an upbeat tempo in the process. The same thing applies to you were good to me by Jeremy Zucker and Chelsea Cutler. The song heavily relies on piano as its main support. It’s stripped and it’s vocal-based, though the centroid gap in danceability is subjectively questionable.

Further statistical proof of my subjective hypothesis can also be represented in the following facet_wrap. The syntax will also be shown as this will be the first of many.

track_distance1 %>% 
  gather("name", "value", 1:10) %>% 
  mutate(label = as.character(
    paste0(cluster, ": ", track_name, " - ", track_artist)),
          text = round(value, 2)) %>%
  arrange(track_artist) %>% 
  ggplot(aes(x = name, y = value, fill = str_to_title(name)))+
  geom_col(position = position_stack(),
           aes(fill =  str_to_title(name)))+
  geom_text(aes(label = text), position = position_stack(vjust = .5),
            size=2.5)+
  facet_wrap(~label)+
  scale_fill_brewer(palette = "Set3")+
  theme(axis.text.x = element_text(size = 0),
        strip.text = element_text(size = 9))+
  labs(x=NULL, y = "Value", fill = "Audio Features",
       title = "Exhibit 3.12 Audio Feature Comparison Between the Most Representative Songs")

As predicted, the statistics display the same result as my previous opinion. The energetic state in which Read My Mind creates is also statistically confirmed, with the intensification of drums and electric guitar influencing the higher value of loudness. Meanwhile, you were good to me rates high in its acousticness while my “stripped” interpretation is confirmed by the low values in energy and loudness.

The least representative song within each playlist can also be shown. By adopting the previous distancing function, one can repeat the code by replacing head(1) with tail(1).

##    track_name          track_artist cluster track_popularity
## 72    Burning The Whitest Boy Alive       1       0.02511921
## 42       Jane             Roy Blair       2       0.25570916

For the least representative song in cluster 1, Burning by The Whitest Boy Alive lacks energy. The least representative song in cluster 2 also shows a similar pattern, in which Roy Blair’s Jane lacks in acousticness. If you’re familiar with the song, the song is mostly supported by synthesizer sounds and drums. However, it does have a comparatively higher value of livenes - a higher trait in the other cluster, since the song ends with a live recording of a check-out counter, so the further deviation from its cluster centroid is statistically justified.

Another insight that we can obtain is the comparison between the most popular track in each cluster. By observing the audio features, excluding track_popularity of course, one can use the following function to obtain the popularity result.

comb1 = list()

for (i in 1:2){
x <- data.frame(track_normal1 %>%
                  filter(cluster == i) %>%
                  arrange(desc(track_popularity)) %>%
                  head(1))
comb1[[i]] <- x
}

combine1 <- do.call(rbind, comb1)

combine1 %>%
  select(track_name, track_artist, cluster, track_popularity)

##         track_name  track_artist cluster track_popularity
## 1    thank u, next Ariana Grande       1        0.4361709
## 2 I Like Me Better          Lauv       2        0.4161196

The two songs presented above are not surprising considering how popular the two were at the time of their respective release. thank u, next by Ariana Grande broke Spotify’s global, single-day streaming record for a female artist 2 days after its release by earning nearly 8.2 million streams. The record was then broken the following day by accumulating 8.5 million global Spotify streams¹⁰. Meanwhile, I Like Me Better by Lauv became a sleeper hit 10 months after its release by making a historic 35-week climb into the Billboard Top 10 chart¹¹. Not to mention, the song is a bop.

If we turn our attention to Exhibit 3.14, we can see that both tracks almost have the same level of audio features. This provides an interesting statistical justification of how generic pop songs can be. What differentiates only lies within the noticeable level of speechiness, in which I Like Me Better’s is justifiable considering that most, if not all, of the electronic mix was layered from the artist’s vocals.

Since we were able to obtain an additional insight by reviewing the most popular songs, we might as well do the same for the least popular one. Using the same function and code, the result can be obtained in a visual plot like the following.

##    track_name          track_artist cluster track_popularity
## 72    Burning The Whitest Boy Alive       1       0.02511921
## 42    Yam Yam           No Vacation       2       0.03013204

From Exhibit 3.15, we can see that Spotify users are not really keen on instrumental songs. Through its definition, this logic makes sense since instrumentalness indicates the lack of vocals. If you listen to Burning by The Whitest Boy Alive, the vocals do not even start until the 30th second mark. Yam Yam by No Vacation also finds itself having a similar pattern as the lead singer begins to belt out the lyrics around the 28th second mark. Both songs even end their first run of vocals after less than 60 seconds of singing. However, music enthusiasts would appreciate the lack of vocals as they killed it in layering the instrumental sounds. Too bad that this plus side is being overlooked by the general audience.

3.3 Interlude

Having the clustering analysis successfully completed, we now have a set of songs which has been classified without looking at the assigned genre. The process of utilizing k-means and PCA involves only the audio features of each song. What the stastical tools cannot capture, however, is the possibility that similarities in audio features could form a genre. Thus, rendering a likelihood that each cluster has the same genre of songs - or a derivation of such.

To review the variability of genres, I decided to extend my research. I know each of the genre by heart - considering that I manually curated the playlists, but I wanted to bring in a more objective assessment. I decided to prioritize the artist’s claim of what the genre is, though I end up falling short in finding reliable information. Furthermore, I avoided using the artist’s genre that was assigned through Spotify’s internal data as each artist may release different types of songs as an experiment. It is best to assume, then, that the classification of genres is better assigned to a song, rather than to an artist.

To begin decoding the genre of each cluster, we could start rearranging the audio features, followed by merging our data frame with the manually-assigned genre information. The following syntax can be used.

final_track_normalized <- track_normal1[,c(11:12,1,13,21,2:10,14:20)]

genre_track_normalized <- read_csv("manually_modified/genre_track_features.csv")

final_track_genre <- cbind(final_track_normalized,genre_track_normalized[,21])
final_track_genre1 <- final_track_genre[,c(1:3,22,4:21)]

Now that the genre has been integrated into the data frame, we can re-visualize the biplot to observe the distribution of genres. The process is similar to subsection 3.2, in which the result of the k-means clustering needs to be integrated into the PCA data frame. What differentiates the following syntax is the addition of genre from final_track_genre1.

pca_genre_k2 <- data.frame(pca_k2$x,
                           cluster = factor(spotify_k2$cluster),
                           genre = factor(final_track_genre1$genre))

From Exhibit 3.6, the distribution of genres within a cluster is pretty variative based on the color alone. However, it is pretty hard to tell whether a particular one dominates in size or not considering the number of genres included. It should be noted that the genres are more likely to be classified as “sub-genres” as the classification of a bedroom pop song, for example, is a specific derivation of indie pop.

To get a sense of the actual distribution of genres, one can utilize the package and function of worldcloud. This package would allow a variable with ‘character’ as their data type to be mapped out into a cloud of words based on the respective frequency. In order to proceed, the following libraries are necessary to activate the function.

library(tm)
library(wordcloud)
library(RColorBrewer)
library(wordcloud2)

The frequency of a displayed word can only be generated if we subset the values in numbers. Therefore the amount of simiar genres need to be automatically calculated first by generating the following syntax.

genre_count_clust1 <- final_track_genre1 %>%
  group_by(track_name, genre, cluster) %>% 
  filter(cluster == 1) %>%
  summarize(genre_count = 1) %>% 
  ungroup() %>% 
  group_by(genre, genre_count, cluster) %>% 
  filter(cluster == 1) %>%
  summarize(total_tracks = sum(genre_count)) %>%
  arrange(desc(total_tracks)) %>% 
  ungroup() %>% 
  select(cluster, genre, total_tracks, -genre_count)

An additional function that one could create for the sake of aesthetic is CapStr. The term is basically short for ‘capitalized structure’ and is pretty self-explanatory in what it does to capitalize the first letters.

CapStr <- function(y) {
  c <- strsplit(y, " ")[[1]]
  paste(toupper(substring(c, 1,1)), substring(c, 2),
      sep = "", collapse = " ")
}

Once the function is complete, the folllowing syntax can be applied. Don’t forget to set the seed to your desired random number, otherwise the randomized placement of the words will differ between one R user and another R user.

set.seed(100)
wordcloud(words = sapply(genre_count_clust1$genre, CapStr),
          freq = genre_count_clust1$total_tracks, min.freq = 1,
          max.words = 200, random.order = FALSE, rot.per=0.35,
          colors = brewer.pal(8, "Dark2"))

Based on the wordcloud, we can clearly see that both Indie Rock and Pop dominates the genre distribution in cluster 1. The playlist is pretty variative in terms of genre, with an implication that indie rock songs might have similar audio features that fans of tropical house or electro-hop enjoy.

If we extend the analysis to popular artists, we can also map out the most dominant ones within cluster 1. This can be done by applying the similar subset logic as the genre_count_clust1 and wordcloud() syntax.

Based on the word cloud above, we can clearly see that songs from Gus Dapperton and Wallows dominate cluster 1. However, what’s fascinating about this insight is how diverse the artists are. Even without knowing the genre, it seems like every track on the clustered playlist comes from different artists, especially knowing the fact that there are only 72 songs.

For the second clustered playlist, one can also obtain the same form of word cloud by repeating the same codes.

From the word cloud above, we can see that the genres within cluster 2 are not as diverse. Bedroom Pop dominates, but what follows are sub-genres that are derivative of pop. There are some songs that fall under the category of R&B and rock, but the sheer amount of frequency are not huge enough. This might as well be the bedroom melancholia playlist, but reduced in size.

When it comes to artists, the distribution within cluster 2 are not as diversed as well, with Rex Orange County having 6 tracks in the cluster. His songs appear in cluster 1 as well, but the proportion is 70 to 30. There is a justification to this, however, as it appears that the artist is on my top discovery of 2019. Thus, it is possible that I might have included more songs from him in one of the playlists, namely bedroom melancholia.

Moving on to decision making, it is very hard in general and the same applies for this subjective art of choosing music. Unlike supervised learning, there are no numerical values to use as a threshold in objective decision making. However, objectivity is overrated in the practical world and thus, the conclusion of which cluster is the best has to be made by combining human knowledge.

Based on the algorithm and the human insight that I provided, the best genre-less playlist curated is cluster 1. There are specific arguments for this consideration as I will elaborate with the following pointers.

The cluster was able to capture one of the most popular songs of 2018 and combine it with one of the least known songs with a 0.01 score based on the audio features alone.
The distribution of genres are more widely spread, with indie rock and psychedelic pop songs being grouped into the same cluster.
Having industry giants like Ariana Grande might help boost the appeal of the clustered playlist so smaller artists like Still Woozy can get the exposure that they rightfully deserve.
There are more observations within the cluster, which gives the listeners more options to experiment with their musical taste.

In order to validate the preceding arguments, it is best to hear the each clustered songs as an actual playlist. Traditionally, it is done by utilizing the create_playlist() function¹². However, this could take a longer time in our scenario as the name of each variable needs to be reverted back to its original form. It is easier to extract the final_track_genre1 data frame and save it as a .CSV format. From there, we can start putting the songs manually on Spotify, considering that the proportion is only 72:42. This might not apply for observations that consist of 1000 songs or above, for example, as it will take a longer time.

The plan for the two playlists is to arrange each based on how close the songs are to the centroid. In order to do this, the variables in final_track_genre1 need to be readjusted to match the column placement of spotify_k2_centroid. The following syntax can be applied.

final_track_genre2 <- final_track_genre1[,c(3,7:15,1:2,4:6,16:22)]

track_dist3 <- cbind(final_track_genre2, index = 1:nrow(final_track_genre2))
dist3 <- list()

for (i in 1:2){
  d3 <- data.frame(track_dist3[track_dist3$cluster == i,])
  d3 <- data.frame(d3 %>% 
    mutate(dist3 = sqrt((d3[,1] - spotify_k2_centroid[i,1])^2 +
                       (d3[,2] - spotify_k2_centroid[i,2])^2 +
                       (d3[,3] - spotify_k2_centroid[i,3])^2 +
                       (d3[,4] - spotify_k2_centroid[i,4])^2 +
                       (d3[,5] - spotify_k2_centroid[i,5])^2 +
                       (d3[,6] - spotify_k2_centroid[i,6])^2 +
                       (d3[,7] - spotify_k2_centroid[i,7])^2 +
                       (d3[,8] - spotify_k2_centroid[i,8])^2 + 
                       (d3[,9] - spotify_k2_centroid[i,9])^2 +
                       (d3[,10] - spotify_k2_centroid[i,10])^2)) %>% 
    arrange(dist3)
    )
 dist3[[i]] <- d3
}

track_distance3 <- do.call(rbind, dist3)

cluster1_playlist <- track_distance3 %>% 
  filter(cluster == 1)

write_csv(cluster1_playlist,
          "saved_data_fromR/section_3/cluster1_playlist.csv")

Cluster 1 is now set to be the genre-less playlist of the next decade. It is currently named algorithm cluster 1 and you can listen to the playlist here to validate its superior quality in capturing a genre-fluid world.

cluster2_playlist <- track_distance3 %>% 
  filter(cluster == 2)

write_csv(cluster2_playlist,
          "saved_data_fromR/section_3/cluster2_playlist.csv")

Cluster 2 is also set to be another fresh take on an algorithm-based playlist. Despite its lack in genre diversity, we should still give credit to this cluster and name the playlist as algorithm cluster 2. You can listen to it by clicking this link.

4 Outro

4.1 Swan Song

Spotify has long been the leading music platform for listeners and they are consistently striving for innovation. One of which has resulted into the concept of a genre-less playlist for fans of Clairo who happen to listen to Brockhampton as well. The playlist seemed like a dull premise, but it works statistically considering all the results that we have obtained using unsupervised learning.

Through the methods of k-means and PCA, this publication is able to capture the optimum number of clusters from 114 songs and map out the similarities in audio features. The songs, obtained from 3 of my genre-specific playlists, were clustered into 2 groups with the first cluster showing a more energetic vibe. This particular trait was deduced from the cluster’s higher average value in energy, as well as tempo and valence.

When two of the clusters are broken down based on the variety of genres and artists, cluster 1 tops the chart. It is fascinating to see how similar songs like thank u, next by Ariana Grande and Radioactive by Imagine Dragons are based on the audio features, despite having completely different genres - one is pop while the latter is alternative rock. To identify in real time, you can cross-check this analysis by listening to the clustered playlist here.

At the end of the day, music is pretty subjective and even the most objective statistical analysis might have its flaws. The most noticeable one in this publication is the chosen number of clusters. There is a debate regarding k = 2 especially when k = 3 visually looks more accurate, though it was established that the proportion of observation in 1 cluster was concerningly small. In response to this, the total number of observations is, indeed, relatively small. One hundred and fourteen songs are practically child’s play, yet I consciously made the decision to proceed and not add up more songs from other playlists because 1 genre will certainly dominate. Speaking of genres, there are two issues that need to be addressed; both have to do with the genre selection. There is a lack of diversity as we barely see any rap and hip hop tracks from my pool of songs since I seldom listen to any. The genre of each track was also manually selected by yours truly so the final decision on which cluster represents genre-fluidity might not be the best.

As a suggestion, future publication could use spotifytracks_db as their data frame to obtain a wider context of songs instead. The dataset, which is available for download in here, contains 232,725 tracks from 26 different genres. It is far more superior than the dataset used in here as it seems to include non-English speaking songs as well. Moreover, behavioral science enthusiasts can extend the analysis to figure out moods based on music. I’m no psychologist, but I do know that this approach is possible using supervised learning through the method of random forest and the utilization of confusion matrix. For more reference on this, you can check out the medium article attached within this link.

In conclusion, there are a lot of analytical areas in which this publication needs to improve, but it is pretty satisfying to confirm how my musical taste was developed. Instead of holding on to a specific genre, my human ears would rather listen to a specific audio pattern regardless of the time period. My preferred genre might have fluctuated from time to time, but the characteristic in audio features remains the same.

4.2 Credits

Special thanks to Algoritma Data Science School as this publication is a part of their learning-by-building initiative to better understand unsupervised learning.

Additional shoutout goes to a github user under the account of AndriiG13 for providing the step-by-step basis of section 2. You can find his similar RPubs publication here.

If you are interested with my other endeavors with Spotify data, you can visit Spotify Trending Analytics (or Spotitrend) on this link. It is an interactive user interface platform which allows one to visualize their very own music streaming analytics. Enjoy!

Milan, A. (2019, December 5). What is Spotify Wrapped and how to get it? Retrieved from Metro.co.uk ↩
Perez, S. (2019, December 5). Spotify Wrapped expands to include your favorite music from the decade, plus podcaster metrics. Retrieved from Tech Crunch ↩
Wang, A. X. (2019, November 25). The 50 Most Important Music Moments of the Decade. Retrieved from Rolling Stone ↩
DEA Music & Art. (2015, December 6). Why Is Pop Music So … Popular? Retrieved from DEA Music & Art ↩
Skelton, E., et al. (2018, April 17). Don’t Call it Bedroom Pop: The New Wave of DIY. Retrieved from Complex ↩
Leight, E. (2019, September 30). Future 25: John Stein, Co-Creator of Spotify’s Genre-Less Pollen Playlist. Retrieved from Rolling Stone ↩
Thompson, C. (2019, July 13). R Wrapper for the ‘Spotify’ Web API. Retrieved from RDocumentation ↩
Budiarto, H. (2019, August 3). Clustering on Spotify Songs. Retrieved from RPubs ↩
Wickham, H. (2009). Wrap A 1d Ribbon Of Panels Into 2d. Retrieved from RDocumentation ↩
Rolli, B. (2018, November 7). Ariana Grande’s ‘Thank U, Next’ Breaks Single-Day Spotify Record For A Female Artist – Twice. Retrieved from Forbes ↩
Trust, G. (2018, June 18). Lauv’s ‘I Like Me Better’ & Dua Lipa’s ‘New Rules’ Set Longevity Records on Pop Songs Chart. Retrieved from Billboard ↩
Grygoryshyn, A. (2019, July 15). Spotify Playlist Clustering Demo. Retrieved from RPubs ↩

Decoding a Genre-Less Spotify

The Perfect Harmony to Clustering Songs

Curated by valdywiratama

Released on Dec 27, 2019