Discovering similarities in Spotify music

Ezra Soterion Nugroho

3/8/2020

Brief Introduction

Music taste is very unique, and very personal. It somehow tells the characteristic quality of a person. Of all the millions songs that exist, the fact that we can develop a liking for particular style, genre, or a “subset” of something, I believe is not by mere chance.

Our own preferences are actually exceeds / beyond just genre. There are many factors that can influence someone song preferences.

Here in this brief study, we will try to cluster songs by various factors. Are they somehow similar? What are the similarities between them? And more importantly, is there a way to gather them together based on how they sound?

Import Library

getwd()

## [1] "C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Spotify Classification"

library(tidyverse)

## -- Attaching packages ------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'readr' was built under R version 3.6.3

## -- Conflicts ---------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 3.6.3

library(tidyverse)
library(dplyr)
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Read Data

spotify <- read.csv("C:/Users/Asus/Documents/DataQuest/Algoritma Data Science/R/My Projects/Spotify Classification/SpotifyFeatures.csv")

glimpse(spotify)

## Observations: 232,725
## Variables: 18
## $ ï..genre         <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, M...
## $ artist_name      <fct> Henri Salvador, Martin & les fÃ©es, Joseph William...
## $ track_name       <fct> "C'est beau de faire un Show", "Perdu d'avance (pa...
## $ track_id         <fct> 0BRjO6ga9RKCKjfDqeFgWV, 0BjC1NfoEOOusryehmNudP, 0C...
## $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, ...
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.749...
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0...
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 2122...
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0....
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, ...
## $ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, ...
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0....
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970...
## $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, M...
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0....
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479...
## $ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, ...
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0....

Here are the definition of the variables according to https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/

duration_ms :The duration of the track in milliseconds. key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.

mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.

time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).

acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity.

Energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity.

Instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context.

liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live.

loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks.

speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value.

valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). The distribution of values for this feature look like this:Valence distribution

tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

id : The Spotify ID for the track.

uri : The Spotify URI for the track.

track_href : A link to the Web API endpoint providing full details of the track.

analysis_url : An HTTP URL to access the full audio

Exploratory Data Analysis

Correlation between Variables

ggcorr(spotify,label = T, label_size = 3)

## Warning in ggcorr(spotify, label = T, label_size = 3): data in column(s)
## 'ï..genre', 'artist_name', 'track_name', 'track_id', 'key', 'mode',
## 'time_signature' are not numeric and were ignored

Selecting Variables

Based on my business-wise, we will deselect several variables that probably may not suitable for this projects. Some of my considerations are how much does it related to classification.

ppt <- spotify %>% 
  select_if(is.numeric) %>% 
  select(-popularity)

glimpse(ppt)

## Observations: 232,725
## Variables: 10
## $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.749...
## $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0...
## $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 2122...
## $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0....
## $ instrumentalness <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.23e-01, ...
## $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0....
## $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970...
## $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0....
## $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479...
## $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0....

ggcorr(ppt,label = T, label_size = 3)

Density Analysis

It seems like energy and loudness are highly positively correlated. Also, valence is positively correlated with danceability and energy. Considering happy songs make people energetic and want to dance, the correlation make a lot sense. *Interestingly, speechiness and loudness are negatively correlated with each other.

correlated_density <- ggplot(spotify) +
    geom_density(aes(energy, fill ="energy", alpha = 0.1)) + 
    geom_density(aes(valence, fill ="valence", alpha = 0.1)) + 
    geom_density(aes(danceability, fill ="danceability", alpha = 0.1)) + 
    scale_x_continuous(name = "Energy, Valence and Danceability") +
    scale_y_continuous(name = "Density") +
    ggtitle("Density plot of Energy, Valence and Danceability") +
    theme_bw() +
    theme(plot.title = element_text(size = 14, face = "bold"),
          text = element_text(size = 12)) +
    theme(legend.title=element_blank()) +
    scale_fill_brewer(palette="Accent")

correlated_density

As it can be seen on the graph, since these variables are positively correlated and have limited between (0,1), the distribution of these variables look similar to each other.

Cleaning null values

colSums(is.na(ppt))

##     acousticness     danceability      duration_ms           energy 
##                0                0                0                0 
## instrumentalness         liveness         loudness      speechiness 
##                0                0                0                0 
##            tempo          valence 
##                0                0

Since there are no null values, we don’t need to worry about filling the missing information.

From the PCA graph of individuals we can see that there is no outlier that we need to worry about. But beneath that we can not see anymore information.

In PCA group of variables we can see the correlation between variables.

K-Means

Data clustering is a common data mining technique to create clusters of data that can be identified as “data with the same characteristics”. Before performing data clustering, you will need to remove the identified outlier based the previous individual PCA plot. The observation with coffeeId 1082 is a fairly extending outlier compared to the rest of the observation. Remove the observation from our initial dataset and once again scale the data.

# scaling 
ppt_z <- scale(ppt, center = T, scale = T)

The next step in building a K-means clustering is to find the optimum cluster number to model our data. We will use the function belowe to find the optimum K using Elbow method.

Determining Optimum K value

# elbow plot manual function

wss <- function(data, maxCluster = 9) {
    # Initialize within sum of squares
    set.seed(50)
    SSw <- (nrow(data) - 1) * sum(apply(data, 2, var))
    SSw <- vector()
    for (i in 1:maxCluster) {
        SSw[i] <- sum(kmeans(data, centers = i)$withinss)
    }
    plot(1:maxCluster, SSw, type = "o", xlab = "Number of Clusters", ylab = "Total Within Sum of Squares", pch=19)
}

wss(ppt_z)

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 11636250)

## Warning: did not converge in 10 iterations

Based on the elbow plot generated from the function above, the optimal number of clusters use is 5.

K-means is a clustering algorithm that groups the data based on distance. The resulting clusters are stated to be optimum if the distance between data in the same cluster is low and the distance between data from different clusters is high.

Building Cluster

set.seed(100)
spotify_km <- kmeans(ppt_z, 5)

# Assigning cluster to the first data
ppt$cluster <- spotify_km$cluster

ppt$cluster <- as.factor(ppt$cluster)

Principal Component Analysis

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Create and Visualize PCA

spotify.pca <- PCA(ppt, quali.sup = 11, graph = F, scale.unit = T)

# plot

plot.PCA(spotify.pca, choix = "ind", label = "none", habillage = 11)

Combining PCA with K-Means

plot.PCA(spotify.pca, choix="var", col.ind=ppt$cluster)

fviz_cluster(spotify_km, 
             data = ppt[,-11]) + 
  theme_minimal()

Goodness of Fit

From this test we can measure how good our clustering model with 3 values :

Within Sum of Squares ($ withinss): distance of each observation to the centroid for each cluster -> squared -> summed Total Sum of Squares ($ totss): the distance of each observation to the global sample mean (overall data average). Between Sum of Squares ($ betweenss): centroid distance of each cluster to the global sample mean -> squared -> times the amount of data per cluster -> summed

spotify_km$withinss

## [1] 226112.3 103825.0 320126.0 225670.8 340558.7

spotify_km$tot.withinss

## [1] 1216293

spotify_km$betweenss

## [1] 1110947

spotify_km$totss

## [1] 2327240

# `between_SS / total_SS`
(spotify_km$betweenss/spotify_km$totss)*100

## [1] 47.73669

We can see that the ratio between sum of squares / total sum of squares is close to 1 (100%), which indicating that the our model has made a pretty good cluster.

Summarize Cluster

ppt %>% 
  group_by(cluster) %>% 
  summarise_all("mean")

## # A tibble: 5 x 11
##   cluster acousticness danceability duration_ms energy instrumentalness liveness
##   <fct>          <dbl>        <dbl>       <dbl>  <dbl>            <dbl>    <dbl>
## 1 1             0.843         0.291     264589.  0.166          0.740      0.147
## 2 2             0.789         0.563     244096.  0.664          0.00112    0.729
## 3 3             0.195         0.706     221699.  0.670          0.0572     0.172
## 4 4             0.704         0.506     225735.  0.330          0.0674     0.178
## 5 5             0.0996        0.502     244950.  0.764          0.0946     0.244
## # ... with 4 more variables: loudness <dbl>, speechiness <dbl>, tempo <dbl>,
## #   valence <dbl>

From seeing the graph or in the summarise table we can see the data has been clustered into 5 categories with their own distinct characteristics. Cluster 1 has high instrumentality, and acousticness. Since danceability, energy, valence, and loudness are correlating with each other they go with similar direction in variable graph. Cluster 3 & 5 for example has high value of those factors. When cluster 4 has a quite medicore value for each variable. While cluster 2 has high speechiness and liveness.

Conclusion

In conclusion we have classified songs in spotify into several cluster by considering several variables and seeing the correlation between each variables in the music. For the next development, This analysis may be applied to a personal choice of songs, so we can derived a more personal classification, and answer more question such as how varied are someones Spotify’s saved songs.