Spotify Track’s Popularity Clustering

Yohana

16 August 2020


Hi, welcome to my Learning by Building for Unsupervised Learning. In this project, I will try clustering spotify track’s popularity. The data set is downloaded from kaggle.com

Load Library

Here are the library I will use in this LBB

library(tidyverse)
library(glue)
library(factoextra)
library(FactoMineR)
library(ggplot2)
library(GGally)
library(scales)
library(plotly)
library(corrplot)
library(ggrepel)
library(gganimate)

options(scipen = 9999)

Load Dataset

Let’s just load the data for this project

tracks <- read.csv("SpotifyFeatures.csv")
colnames(tracks)[1] = "genre"
tracks

There are 18 columns in the dataset, which are

genre : Track genre

artist_name : Artist name

track_name : Track title

track_id : Track ID

popularity : Popularity rate (1-100) of the track

acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.

danceability : Describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0.

duration_ms : Track’s duration in milliseconds.

energy : A measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy.

instrumentalness : Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.

key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation.

liveness : Detects the presence of an audience in the recording.

loudness : The loudness of a track in decibels (dB)

mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived

speechiness : Speechiness detects the presence of spoken words in a track.

tempo : An estimated tempo of a track in beats per minute (BPM).

time_signature : Time signature of a track

valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track.

I will use my own creation of theme for plot

white_theme <- theme(
  panel.background = element_rect(fill="white"),
  plot.background = element_rect(fill="white"),
  panel.grid.minor.x = element_blank(),
  panel.grid.major.x = element_blank(),
  panel.grid.minor.y = element_blank(),
  panel.grid.major.y = element_blank(),
  text = element_text(color="black"),
  axis.text = element_text(color="black"),
  strip.background =element_rect(fill="snow3"),
  strip.text = element_text(colour = 'black')
  )

Explanatory Data Analysis

Data Cleaning

Before we continue, let’s check if there are any genre that duplicated

unique(tracks$genre)
##  [1] "Movie"              "R&B"                "A Capella"         
##  [4] "Alternative"        "Country"            "Dance"             
##  [7] "Electronic"         "Anime"              "Folk"              
## [10] "Blues"              "Opera"              "Hip-Hop"           
## [13] "Children's Music"   "Childrenâ\200\231s Music" "Rap"               
## [16] "Indie"              "Classical"          "Pop"               
## [19] "Reggae"             "Reggaeton"          "Jazz"              
## [22] "Rock"               "Ska"                "Comedy"            
## [25] "Soul"               "Soundtrack"         "World"

In genre, we have Children’s Music and Children’s Music which are the same but there are some symbols in it. Let’s just take those symbols out.

tracks <- tracks %>% 
  mutate(genre = as.factor(str_replace_all(genre, "â\200\231", "")))

unique(tracks$genre)
##  [1] Movie            R&B              A Capella        Alternative     
##  [5] Country          Dance            Electronic       Anime           
##  [9] Folk             Blues            Opera            Hip-Hop         
## [13] Children's Music Childrens Music  Rap              Indie           
## [17] Classical        Pop              Reggae           Reggaeton       
## [21] Jazz             Rock             Ska              Comedy          
## [25] Soul             Soundtrack       World           
## 27 Levels: A Capella Alternative Anime Blues ... World

Now we already have one Children’s Music.

I will just take out some columns artist_name, track_name and track_id since we won’t use it and change data types of genre, key and mode into factor using as.factor() and replace some symbols and punctuation in genre using str_replace_all(). But before that, I will just remove some duplicated tracks before continue.

tracks <- tracks[!duplicated(tracks$track_id),]

tracks <- tracks %>% 
  select(-c(artist_name, track_name, track_id)) %>% 
  mutate_if(is.integer, as.numeric) %>% 
  mutate(genre = as.factor(str_replace_all(genre, "'", "")),
         key = as.factor(key),
         mode = as.factor(mode))

Let’s check the data types

str(tracks)
## 'data.frame':    176774 obs. of  15 variables:
##  $ genre           : Factor w/ 26 levels "A Capella","Alternative",..: 15 15 15 15 15 15 15 15 15 15 ...
##  $ popularity      : num  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : num  99373 137373 170267 152427 82625 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ mode            : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ time_signature  : chr  "4/4" "4/4" "5/4" "4/4" ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

Now, we already have proper data types

Let’s check if there are any missing values in our data

colSums(is.na(tracks))
##            genre       popularity     acousticness     danceability 
##                0                0                0                0 
##      duration_ms           energy instrumentalness              key 
##                0                0                0                0 
##         liveness         loudness             mode      speechiness 
##                0                0                0                0 
##            tempo   time_signature          valence 
##                0                0                0

There are no missing values in our data

Correlation

Now, let’s check the correlation between our variables

tracks %>% 
  select(-c(genre, key, mode, time_signature)) %>% 
  cor() %>%
  corrplot(type = "upper", method = "ellipse", tl.cex = 0.8)

  # ggcorr(label = TRUE, hjust = 1, layout.exp = 2)

The most correlated items to define popularity are danceability, energy, loudness, tempo, valence

Now, let’s check those correlation above with density plot

feature <- tracks %>% 
  select(-c(key, mode, time_signature))

feature <- names(feature)[3:12]

tracks %>%
  select(-c(key, mode, time_signature)) %>% 
  select(c(popularity, feature)) %>%
  pivot_longer(cols = feature) %>%
  ggplot(aes(x = value)) +
  geom_density(aes(color = popularity), alpha = 0.5) +
  facet_wrap(~name, ncol = 2, scales = "free") +
  labs(title = "Audio Feature Colleration with Popularity",
       x = NULL, y = "density") +
  theme(axis.text.y = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
  white_theme

Based on the plot above, it looks like danceability, energy, loudness, tempo, valence affect the most aspects for popularity number, just like the same like ggcor() plot.

tracks %>%
  select(-c(key, mode, time_signature, popularity)) %>% 
  select(c(genre, feature)) %>%
  pivot_longer(cols = feature) %>%
  ggplot(aes(x = value)) +
  geom_density(aes(color = genre), alpha = 0.5) +
  facet_wrap(~name, ncol = 2, scales = "free") +
  labs(title = "Spotify Audio Feature Density - by Genre",
       x = NULL, y = "density") +
  theme(axis.text.y = element_blank(),
        plot.title = element_text(hjust = 0.5)) +
   white_theme

Based on the density plot, aspects that related to define genre are danceability, energy, loudness and tempo

Next step, we will find out the distribution of popularity

hist(tracks$popularity)

Data Scaling

Before we scale our data, let’s see which columns that need to be scaled

summary(tracks)
##          genre          popularity      acousticness     danceability   
##  Comedy     :  9674   Min.   :  0.00   Min.   :0.0000   Min.   :0.0569  
##  Electronic :  9149   1st Qu.: 25.00   1st Qu.:0.0456   1st Qu.:0.4150  
##  Alternative:  9095   Median : 37.00   Median :0.2880   Median :0.5580  
##  Anime      :  8935   Mean   : 36.27   Mean   :0.4041   Mean   :0.5411  
##  Classical  :  8711   3rd Qu.: 49.00   3rd Qu.:0.7910   3rd Qu.:0.6830  
##  Reggae     :  8687   Max.   :100.00   Max.   :0.9960   Max.   :0.9890  
##  (Other)    :122523                                                     
##   duration_ms          energy          instrumentalness         key       
##  Min.   :  15387   Min.   :0.0000203   Min.   :0.0000000   C      :20970  
##  1st Qu.: 178253   1st Qu.:0.3440000   1st Qu.:0.0000000   G      :20476  
##  Median : 219453   Median :0.5920000   Median :0.0000704   D      :18643  
##  Mean   : 236127   Mean   :0.5570245   Mean   :0.1720729   A      :17499  
##  3rd Qu.: 268547   3rd Qu.:0.7890000   3rd Qu.:0.0908000   C#     :16856  
##  Max.   :5552917   Max.   :0.9990000   Max.   :0.9990000   F      :15605  
##                                                            (Other):66725  
##     liveness          loudness          mode         speechiness    
##  Min.   :0.00967   Min.   :-52.457   Major:116619   Min.   :0.0222  
##  1st Qu.:0.09750   1st Qu.:-12.851   Minor: 60155   1st Qu.:0.0368  
##  Median :0.13000   Median : -8.191                  Median :0.0494  
##  Mean   :0.22453   Mean   :-10.138                  Mean   :0.1274  
##  3rd Qu.:0.27700   3rd Qu.: -5.631                  3rd Qu.:0.1020  
##  Max.   :1.00000   Max.   :  3.744                  Max.   :0.9670  
##                                                                     
##      tempo        time_signature        valence      
##  Min.   : 30.38   Length:176774      Min.   :0.0000  
##  1st Qu.: 92.01   Class :character   1st Qu.:0.2220  
##  Median :115.01   Mode  :character   Median :0.4400  
##  Mean   :117.20                      Mean   :0.4516  
##  3rd Qu.:138.80                      3rd Qu.:0.6670  
##  Max.   :242.90                      Max.   :1.0000  
## 

Our data has a lot of variances of number, so, we will just scale all numeric type.

Time to scale our data.

tracks_cleaned <- scale(tracks %>% select_if(is.numeric))
plot(prcomp(tracks_cleaned))

We already scaled our data, now let’s continue

K - Means Clustering

First, let’s find the optimum k for our clustering

RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(tracks_cleaned, maxK = 5)

The optimum k we get is 3

Now, time to

set.seed(123)
km <- kmeans(tracks_cleaned, 3, nstart = 20)

summary(km)
##              Length Class  Mode   
## cluster      176774 -none- numeric
## centers          33 -none- numeric
## totss             1 -none- numeric
## withinss          3 -none- numeric
## tot.withinss      1 -none- numeric
## betweenss         1 -none- numeric
## size              3 -none- numeric
## iter              1 -none- numeric
## ifault            1 -none- numeric
clust <- tracks %>%
  select_if(is.numeric) %>% 
  mutate(cluster = as.factor(km$cluster))

clust$popularity <- rescale(clust$popularity, to = c(0,100))
clust$acousticness <- rescale(clust$acousticness, to = c(0,100))
clust$danceability <- rescale(clust$danceability, to = c(0,100))
clust$duration_ms <- rescale(clust$duration_ms, to = c(0,100))
clust$energy <- rescale(clust$energy, to = c(0,100))
clust$instrumentalness <- rescale(clust$instrumentalness, to = c(0,100))
clust$liveness <- rescale(clust$liveness, to = c(0,100))
clust$loudness <- rescale(clust$loudness, to = c(0,100))
clust$speechiness <- rescale(clust$speechiness, to = c(0,100))
clust$tempo <- rescale(clust$tempo, to = c(0,100))
clust$valence <- rescale(clust$valence, to = c(0,100))

clust %>% 
  group_by(cluster) %>% 
  summarise_all(mean) %>%
  pivot_longer(cols = -1) %>% 
  ggplot(aes(x = name, y = value)) +
  geom_col(aes(fill = name)) +
  facet_wrap(~cluster)+
  labs(x = NULL, y = NULL, title = "Cluster's Characteristic")+
  theme(axis.text.x = element_text(angle = 50, hjust = 1),
        plot.title = element_text(hjust = 0.5)) +
  white_theme

From the plot above, we can infer that * Cluster 1 : has best in danceability, energy, loudness and valence * Cluster 2 : has best in acousticness, danceability, energy, liveness, loudness, and speechiness * Cluster 3 : has best in acousticness and loudness

Time to plot our cluster

fviz_cluster(object = km, 
             data = tracks_cleaned, labelsize = 0) + white_theme

From the plot, the most popular songs are in Cluster 2

PCA

pca_spotify <- prcomp(tracks_cleaned,center = T)
summary(pca_spotify)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.9120 1.3358 1.0808 0.99152 0.92525 0.85280 0.80714
## Proportion of Variance 0.3323 0.1622 0.1062 0.08937 0.07783 0.06612 0.05923
## Cumulative Proportion  0.3323 0.4946 0.6008 0.69014 0.76797 0.83408 0.89331
##                            PC8     PC9    PC10    PC11
## Standard deviation     0.67140 0.58850 0.51758 0.32960
## Proportion of Variance 0.04098 0.03148 0.02435 0.00988
## Cumulative Proportion  0.93429 0.96577 0.99012 1.00000
fviz_eig(pca_spotify, ncp = 10, addlabels = T, main = "Variance explained by each dimensions")

In this project, I want to get 80% of data information, so I will use PC1 to PC 6 in order to get 83.3%

pca_tracks <- data.frame(pca_spotify$x, cluster = factor(km$cluster), genre = tracks$genre)

pca_data <- data.frame(varnames = rownames(pca_spotify$rotation),
                       pca_spotify$rotation)
x <- "PC1"
y <- "PC2"

data <- data.frame(obsnames=seq(nrow(pca_spotify$x)), pca_spotify$x)

mult <- min(
  (max(data[,y]) - min(data[,y])/(max(pca_data[,y])
                                  -min(pca_data[,y]))),
  (max(data[,x]) - min(data[,x])/(max(pca_data[,x])
                                  -min(pca_data[,x]))))

pca_data <- transform(pca_data,
                    v1 = .9 * mult * (get(x)),
                    v2 = .9 * mult * (get(y)))

ggplot(pca_tracks, aes(x = PC1, y = PC2)) +
  geom_hline(aes(yintercept = 0), size = 0.2) +
  geom_vline(aes(xintercept = 0), size = 0.2) +
  coord_equal() +
  geom_point(aes(color = cluster),size = 0.2) +
  geom_segment(data = pca_data, aes( x =0, y = 0, xend = v1, yend = v2), arrow = arrow(length = unit(0.2, "cm"))) +
  geom_text_repel(data = pca_data, aes(label = str_to_title(varnames)), point.padding = -10, segment.size = 0.5) +
  scale_color_brewer(palette = "Pastel2") +
  guides(colour = guide_legend(override.aes = list(size = 3))) +
  labs(title = "3 Clusters with PCA and Factor Loading", color = "Cluster", x = NULL, y = NULL) +
  theme(plot.title = element_text(hjust = 0.5)) +
  white_theme

From the Bi-plot above we can see that it have a 4 common directions.

  1. acousticness, duration_ms and instrumentalness are having a negative correlation with energy, valence, danceability and loudness
  2. liveness and speechiness are having negative correlation tempo and popularity

And we get main characteristic, which can explain our Cluster’s Characteristic Graph

  1. Cluster 1 : energy, danceability, valence, loudness, tempo and popularity
  2. Cluster 2 : acousticness, duration_ms, instrumentalness
  3. Cluster 3 : liveness and speechiness

Here are our dimensional cluster plot

pca_tracks <- PCA(X = tracks_cleaned, scale.unit = F, graph = F)
df_pca <- data.frame(pca_tracks$ind$coord[, 1:3]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:3)

plot_ly(df_pca, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("red", 
    "blue", "green")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))

Conclusion

  1. We can classify the tracks we have into 3 Clusters based on the optimum k using elbow-method we got.
  1. To get 80% of information, we get 6 PCs, so we at least the missing information will be just 20%.

Based on Cluster’s Characteristic Plot and Cluster Plot, the most popular song are clustered into Cluster 1, since Cluster 1 has strong characteristics in danceability, energy, loudness, valence. And the value of popularity is the highest from 2 other clusters.