The Concept of Playlists

by Tamanna Saini and Xiaoyu “Timmy” Tang

Introduction

Everybody has music they enjoy, and oftentimes their tastes are reflected by the artists that a person likes. One of the most common problems in music is finding and recommending songs based on a person’s taste. One of the ways to go about it is to classify songs into groups using some sort of metric. Whenever a new song is added to the pool, it is added to one of these groups and will be recommended to people who like music in that group.

Music can be a rather subjective topic to classify, we need to take a lot of variables into consideration to accurately divide the music into the respective clusters. We look at playlists from a data science perspective to see if we can use algorithmic methods to categorically divide songs and to put new songs into those categories.

The Plan

Spotify is a robust music application with a highly documented API for fetching songs. With Spotify we can acquire lots of songs with information about songs.

Since we are trying to categorize them, we will need characteristics of each song. Primarily, we need traits concerning the music. From Spotify’s API, we have access to the following characteristics of music:

Danceability (number) : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
Energy (number) : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. Key (integer) : The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
Loudness (number) : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db. Mode (integer): Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
Speechiness (number) :Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
Acousticness (number): A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
Instrumentalness (number): Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
Liveness (number): Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. Valence (number): A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
Tempo (number): The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

The next thing we need to is to categorize them. Since we don’t have any categories to sort them in initially, we are looking at using unsupervised learning to cluster the songs into groups.

Some unsupervised learning options available to us are K-means and PCA. We can apply both to our dataset(s) and see what conclusions we can reach with the different algorithms.

Once we have trained our models using our data, we can then compare our models using different data sets, and test our models by introducing new songs to the models.

The Process

The first thing we need to do is to get data. For our data set(s), we will get all the songs from our favorite artists. We don’t need everything, so we will filter out some of the information:

Of these characteristics, we want quantitative data, so we’re removing key and mode. We’re not considering the presence of lyrics of not to be important (we’ll count songs and their instrumentals as the same song), so we’re removing instrumentalness. We don’t care for live performances, as there are studio and live performances of the same song, so we exclude liveness as well.

#Get the data from spotify, then cull the unnecessary features
beatles <- get_artist_audio_features('the beatles')
bts <- get_artist_audio_features('BTS')
#Remove other stuff
beatles <- beatles[c(1, 9:19, 30, 35)] 
bts <- bts[c(1, 9:19, 30, 35)] 
#Remove qualitative data
beatles <- beatles[-c(4,6, 9, 10)] 
bts <- bts[-c(4,6, 9, 10)]

#We scale some of the other variables so they exist on a 0-1 scale.
bts$loudness <- abs(bts$loudness)
bts$loudness <- scale(bts$loudness, center = FALSE, scale = max(bts$loudness, na.rm = TRUE)/1)

beatles$loudness <- abs(beatles$loudness)
beatles$loudness <- scale(beatles$loudness, center = FALSE, scale = max(beatles$loudness, na.rm = TRUE)/1)

bts$tempo<- scale(bts$tempo, center = FALSE, scale = max(bts$tempo, na.rm = TRUE)/1)
beatles$tempo <- scale(beatles$tempo, center = FALSE, scale = max(beatles$tempo, na.rm = TRUE)/1)

We can start by looking at some of the characteristics of the Beatles’ songs.

Beatles Songs Characteristic Distributions

We use the characteristic distribution histograms to make sure that the data is not clustered for any of the values and that we have a wide range of data. As we can see in the graphs, except for speechiness all of the other variables are spread across a wide range of data.

BTS Song Characteristic Distribution

For the BTS dataset, we can see that the data for loudness, acousticness and speechiness are focused in some areas but there are still some data for other values so the data is not completely focused on a particular value. The valence and tempo distribution is more widespread as compared to the Beatles dataset.

K-Means

The k-means clustering method is one of the oldest and most approachable unsupervised machine learning technique used to identify clusters of data objects in a dataset. We use k-means on the data to classify the data into clusters. We will first combine both the datasets and get rid of any duplicate data that may be present and then use this combined dataset. In k-means, we then create small and large k-means cluster plots using this dataset. We also graphically plot these clusters. Some of the disadvantages of k-means are that, the user has to specify the number of k(clusters) in the beginning and that it can only handle numerical data. K-means also assumes that each of the clusters has an equal number of observations.

We note for the observations, both BTS and Beatles have three distinct peaks for their tempo distributions. The other distributions didn’t really have any distinct partitions, so our Ks for kmeans are going to be based on the tempo.

#combining both data sets together and cleaning up duplicates. Also sets row names as track_name
bothdata <- rbind(bts, beatles)
bothdata <- distinct(bothdata, track_name, .keep_all=TRUE)
bothdata <- bothdata[c(9, 1:8, 10)]
bothdata2 <- bothdata[,-1]
rownames(bothdata2) <- bothdata[,1]
bothdata2 <- bothdata2[-c(1,9)]

set.seed(123)
small_km_bothdata <- kmeans(bothdata2, 3, nstart = 25)
big_km_bothdata <- kmeans(bothdata2, 6, nstart = 50)

PCA

Principal Component Analysis (PCA) can be defined as reducing the dimensions of the given dataset while still retaining most of its variance.Basically PCAwould reduce the dimensions of the dataset while still conserving most of the information. We use PCA to make sure that the data that we have does properly fit into the clusters and there is no error in the clusters formed.

pca <- prcomp(bothdata2, center=TRUE, scale=TRUE)

Results and Analysis

Visualizations of Cluster plot

aggregate(bothdata2, by=list(cluster=small_km_bothdata$cluster), mean)

##   cluster danceability    energy  loudness speechiness acousticness   valence
## 1       1    0.6741591 0.2291523 0.7417724  0.81375000    0.8506818 0.7471818
## 2       2    0.5719127 0.7199494 0.2729605  0.09709756    0.1201023 0.6331182
## 3       3    0.5298102 0.4124280 0.4346538  0.06298717    0.6509446 0.5372994
##       tempo
## 1 0.5147478
## 2 0.6191805
## 3 0.5717182

It turns out that there are some visible differences looking at just the average of the clusters. For example, while the tempo of each of the clusters is relatively similar, the average energy, loudness, and acousticness is drastically different.

fviz_cluster(small_km_bothdata, bothdata2, axes=c(1,2), labelsize = 1, main="Small K-Means Cluster Plot")

aggregate(bothdata2, by=list(cluster=big_km_bothdata$cluster), mean)

##   cluster danceability    energy  loudness speechiness acousticness   valence
## 1       1    0.5036331 0.4967410 0.3727740  0.06366835    0.2262716 0.3675058
## 2       2    0.6073789 0.6847266 0.2984702  0.07047305    0.1436165 0.7947344
## 3       3    0.4684876 0.2873473 0.5001397  0.07260233    0.7789380 0.3141527
## 4       4    0.5927459 0.5125028 0.3880264  0.06029890    0.6226906 0.7909779
## 5       5    0.5514813 0.8443972 0.2088837  0.14703131    0.0600002 0.5164879
## 6       6    0.6795366 0.2245780 0.7427982  0.83929268    0.8603659 0.7607317
##       tempo
## 1 0.5779462
## 2 0.5727576
## 3 0.5511673
## 4 0.5939729
## 5 0.6870706
## 6 0.5106199

With the introduction of a larger K value, we can see that once again some song characteristics are close in value, while others differ by a lot. It’s hard to tell whether or not having more clusters more accurately identifies groups that the songs belong in, but it is clear from observation that these clusters have different focal points based on the averages of their variables.

fviz_cluster(big_km_bothdata, bothdata2, axes=c(1,2), labelsize = 1, main="Large K-Means Cluster Plot")

bothdata_small <- bothdata2
bothdata_small['kmean'] <- small_km_bothdata$cluster
bothdata_big <- bothdata2
bothdata_big['kmean'] <- big_km_bothdata$cluster
bothdata_small$kmean = as.factor(bothdata_small$kmean)
bothdata_big$kmean = as.factor(bothdata_big$kmean)


small_fig <- plot_ly(data=bothdata_small, x = ~danceability, y = ~energy,z = ~acousticness, color=~kmean, type='scatter3d', mode='markers')
big_fig <- plot_ly(data=bothdata_big, x = ~danceability, y = ~energy,z = ~acousticness, color=~kmean, type='scatter3d', mode='markers', colors=c("red", "orange", "yellow", "green", "blue", "purple", "white", "black", "goldenrod", "grey"))

Silhouette Analysis

The silhouette coefficient is a measure of cluster cohesion and separation. It quantifies how well a data point fits into its assigned cluster based on two factors: How close the data point is to other points in the cluster and how far away the data point is from points in other clusters.

dd <- dist(bothdata2,method="euclidean")
small_silhouette <- fviz_silhouette(silhouette(small_km_bothdata$cluster, dd), main="Small K-Mean Silhouette Analysis")

##   cluster size ave.sil.width
## 1       1   44          0.51
## 2       2  573          0.38
## 3       3  343          0.26

big_silhouette <- fviz_silhouette(silhouette(big_km_bothdata$cluster, dd), main="Big K-Mean Silhouette Analysis")

##   cluster size ave.sil.width
## 1       1  139          0.20
## 2       2  256          0.22
## 3       3  129          0.29
## 4       4  181          0.24
## 5       5  214          0.20
## 6       6   41          0.52

grid.arrange(small_silhouette, big_silhouette)

The silhouette scores for both kmeans is positive, which means that the clusters are relatively far apart. However, the average score does dip with the larger K, indicating that the big K-means contributed unfavorable overlaps in clusters. Based on this and the visualization of the K-means, we can say with a degree of certainty that the smaller K has better clustering results than the larger K.

PCA Variance Contribution

With the summary of PCA, we can see each component’s influence on the separation of the data into clusters.

summary(pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.6332 1.2125 0.9747 0.8779 0.79801 0.60208 0.37715
## Proportion of Variance 0.3811 0.2100 0.1357 0.1101 0.09098 0.05179 0.02032
## Cumulative Proportion  0.3811 0.5911 0.7268 0.8369 0.92789 0.97968 1.00000

With a biplot, we can identify the directions in which each component “branches out”, as well as label the area of influence of each cluster and how they are affected by the components. We’ll be using the smaller Kmeans to showcase this:

fig_pca <- ggbiplot(pca, ellipse=TRUE, circle=TRUE, obs.scale = 1, var.scale = 1,labels.size=3, alpha=0.2, groups=bothdata_small$kmean)+ggtitle("PCA of Beatles + BTS")+
  theme_minimal()+
  theme(legend.position = "bottom")
fig_pca

We can see that acoustiness and loudness are closely correlated. Danceability and valence are somewhat close, as are energy and tempo, although energy has a greater effect than tempo. Speechiness is its own variable.

Implications of analysis:

Something that our analysis seems to hint at is that if we use song characteristics to perform K-means analysis, then the song clusters must by some extent represent genres, which by definition is a category of artistic composition. There is definite correlation between our unsupervised learning and genres of music. However, we would like to stress due to the nature of our dataset and the lack of assumptions on defining categories going in, it would be difficult to

Based on our analysis earlier, it seems that the smaller Kmeans is much better suited for our data set, so any further analysis will be using the smaller Kmeans.

Although hinted in the PCA, it is still nice to plot out the correlations in our variables for the songs.

cor_data <- cor(bothdata_small[1:7])
ord <- order(cor_data[1, ])
data_order <- cor_data[ord, ord]

my_colors <- brewer.pal(5, "Spectral")
my_colors <- colorRampPalette(my_colors)(100)

plotcorr(data_order, col=my_colors[data_order*50+50] , mar=c(1,1,1,1)  )

From this correlogram, we see that it more or less matches up what the PCA biplot showed. This is important as we can see which of the variables truly have an impact, and which of the variables are more or less duplicates of others.

There’s not much we can say for how the variables actually impact the clustering. We can see the averages of each of the clusters’ variables:

bothdata_small$kmean <- as.numeric(bothdata_small$kmean)
cluster1 <- sapply(bothdata_small[bothdata_small$kmean==1, ], mean)
cluster2 <- sapply(bothdata_small[bothdata_small$kmean==2, ], mean)
cluster3 <- sapply(bothdata_small[bothdata_small$kmean==3, ], mean)

## danceability       energy     loudness  speechiness acousticness      valence 
##    0.6741591    0.2291523    0.7417724    0.8137500    0.8506818    0.7471818 
##        tempo        kmean 
##    0.5147478    1.0000000

## danceability       energy     loudness  speechiness acousticness      valence 
##   0.57191274   0.71994939   0.27296052   0.09709756   0.12010231   0.63311815 
##        tempo        kmean 
##   0.61918049   2.00000000

## danceability       energy     loudness  speechiness acousticness      valence 
##   0.52981020   0.41242799   0.43465378   0.06298717   0.65094461   0.53729942 
##        tempo        kmean 
##   0.57171822   3.00000000

We can see that even though we chose K to be 3 based on the tempos, they turn out to be very similar in value as well. Speechiness between 1 and 3 are similar, but cluster 2’s speechiness possesses a drastically different value. We can see the average of energy, loudness, acousticness, and valence have differing values.

Based on the averages of the clusters, when looking at new songs, we can try to add them to the cluster with the following characteristics:

If the speechiness of a song is closer to 0.8 than 0.1, then it is most definitely in cluster 2.
If speechiness is unclear, we know that loudness and acousticness are highly correlated, so we try to match the new song’s loudness and acousticness as a pair.
If loudness and acousticness cannot help us identify the song, then we look at the rest of the variables and try to match up as close as possible.

For these songs at least, the clear defining characteristics appear to be speechiness and loudness/acousticness. We think that since these are pop bands, the other variables tend to be pretty similar and these would the variables that stand out.

I think it’s safe to say that there is some tie between the variables, how they cluster the songs, and what we define as genres of songs. However, without a formalization of the impact of each variable, it’s hard to justify or claim exactly which genre the clusters belong to.

Applications and Impact

While listening to music, usually people like to listen to the songs which have the same vibe and mood, this could greatly affect the person’s mood. Creating the dataset is one of the reservations as there are limited datasets available and they contain different attributes so they would need to be streamlined into one dataset.We can further use the clusters to label the mood the song belongs to and use this data in a model to recommend songs with a similar mood. An ideal model would be able to accurately recommend a song of the same category to the user. If the model is not accurate then the recommended songs would belong to a different category which would also mean that the user would not want to listen to further suggestions by the algorithm. Another application could also be - when the song entered by the user is categorized as a sad song, the recommendation system could be made as such that the suggested songs move towards happier and more energetic songs slowly and could help improve the user’s mood.

Another application could be that the data clusters could be used to group songs of a similar tempo, energy, danceability,etc, which could help to create intelligent playlists with songs of similar characteristics.