Introduction

In this time, we’ll try to do unsupervised learning by clustering the spotify data using k-means and PCA. The dataset was provided by Kaggle, and the details of data can be accessed here.

Import Library and Setup

library(dplyr)
library(ggplot2)
library(cowplot)
library(GGally)
library(FactoMineR)
library(scales)
library(tidyverse)
library(factoextra)
library(plotly)

Data Import

spotify <- read.csv("SpotifyFeatures.csv", stringsAsFactors = T)
rmarkdown::paged_table(spotify)

The data contains 232,725 obs and 18 variables. The description of variables :

  • track_id : The Spotify ID for the track
  • acousticness : A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability : Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • duration_ms : The duration of the track in milliseconds.
  • energy : Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • key : The estimated overall key of the track. Integers map to pitches using standard Pitch Class notation . E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • liveness : Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness : The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
  • mode : Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness : Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • tempo : The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • time_signature : An estimated overall time signature of a track. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure).
  • valence : A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Data Preparation

In clustering, we need to subset some column which contain no information or low information for clustering. We’ll also change data type.

#change the column name from i..genre to genre
colnames(spotify)[1] = "genre"

spotify1 <- spotify %>% 
  select(-c(track_id, track_name))

Check if there is NA

colSums(is.na(spotify1))
##            genre      artist_name       popularity     acousticness 
##                0                0                0                0 
##     danceability      duration_ms           energy instrumentalness 
##                0                0                0                0 
##              key         liveness         loudness             mode 
##                0                0                0                0 
##      speechiness            tempo   time_signature          valence 
##                0                0                0                0

Exploratory Data Analysis

str(spotify1)
## 'data.frame':    232725 obs. of  16 variables:
##  $ genre           : Factor w/ 27 levels "A Capella","Alternative",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ artist_name     : Factor w/ 14564 levels "'Til Tuesday",..: 5283 8366 6575 5283 4140 5283 8366 7434 2465 7469 ...
##  $ popularity      : int  0 1 3 0 4 0 2 15 0 10 ...
##  $ acousticness    : num  0.611 0.246 0.952 0.703 0.95 0.749 0.344 0.939 0.00104 0.319 ...
##  $ danceability    : num  0.389 0.59 0.663 0.24 0.331 0.578 0.703 0.416 0.734 0.598 ...
##  $ duration_ms     : int  99373 137373 170267 152427 82625 160627 212293 240067 226200 152694 ...
##  $ energy          : num  0.91 0.737 0.131 0.326 0.225 0.0948 0.27 0.269 0.481 0.705 ...
##  $ instrumentalness: num  0 0 0 0 0.123 0 0 0 0.00086 0.00125 ...
##  $ key             : Factor w/ 12 levels "A","A#","B","C",..: 5 10 4 5 9 5 5 10 4 11 ...
##  $ liveness        : num  0.346 0.151 0.103 0.0985 0.202 0.107 0.105 0.113 0.0765 0.349 ...
##  $ loudness        : num  -1.83 -5.56 -13.88 -12.18 -21.15 ...
##  $ mode            : Factor w/ 2 levels "Major","Minor": 1 2 2 1 1 1 1 1 1 1 ...
##  $ speechiness     : num  0.0525 0.0868 0.0362 0.0395 0.0456 0.143 0.953 0.0286 0.046 0.0281 ...
##  $ tempo           : num  167 174 99.5 171.8 140.6 ...
##  $ time_signature  : Factor w/ 5 levels "0/4","1/4","3/4",..: 4 4 5 4 4 4 4 4 4 4 ...
##  $ valence         : num  0.814 0.816 0.368 0.227 0.39 0.358 0.533 0.274 0.765 0.718 ...

Clustering Opportunity

# Inspecting the difference between numerical variable based on Mode

a <- ggplot(spotify1, aes(mode, popularity, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "Popularity")

b <- ggplot(spotify1, aes(mode, acousticness, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "acousticness")

c <- ggplot(spotify1, aes(mode, danceability, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "danceability")

d <- ggplot(spotify1, aes(mode, duration_ms, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "duration_ms")

e <- ggplot(spotify1, aes(mode, energy, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "energy")

f <- ggplot(spotify1, aes(mode, instrumentalness, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "instrumentalness")

g <- ggplot(spotify1, aes(mode, liveness, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "liveness")

h <- ggplot(spotify1, aes(mode, loudness, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "loudness")

i <- ggplot(spotify1, aes(mode, speechiness, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "speechiness")

j <- ggplot(spotify1, aes(mode, tempo, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "tempo")

k <- ggplot(spotify1, aes(mode, valence, fill = mode)) + geom_boxplot(show.legend = F) + 
    theme_minimal() + labs(title = "valence")

plot_grid(a,b,c,d,e,f,g,h,i,j,k, ncol = 4)

From the plot above, we can see that the major and minor mode has almost the same in most aspect.

Possibility for Principle Component Analysis (PCA)

Let’s check the correlation between numeric variables of spotify data.

ggcorr(spotify1, label = TRUE, hjust = 1, layout.exp = 2)

There are some variables that has high correlation such as energy with loudness. Based on this result, we will try to reduce the dimension using PCA.

Before we do the PCA and k-means clustering, we’ll scale the data.

spotify_scale <- scale(spotify1 %>% select_if(is.numeric))

Clustering

Finding Optimum number of K using Elbow Method

In finding number of K in k-means clustering, we can use business knowledge to set the number of k. We can also using Elbow method to find the optimum number of K.

library(factoextra)
fviz_nbclust(spotify_scale , kmeans, method = "wss")
## Error: cannot allocate vector of size 201.8 Gb

It has exceeded the memory size if we use fviz_nbclust() function from factoextra package. We’ll try to use manual function to get the k optimum value.

RNGkind(sample.kind = "Rounding")
kmeansTunning <- function(data, maxK) {
  withinall <- NULL
  total_k <- NULL
  for (i in 2:maxK) {
    set.seed(101)
    temp <- kmeans(data,i)$tot.withinss
    withinall <- append(withinall, temp)
    total_k <- append(total_k,i)
  }
  plot(x = total_k, y = withinall, type = "o", xlab = "Number of Cluster", ylab = "Total within")
}

kmeansTunning(spotify_scale, maxK = 6)

We get the optimum value of k is 3. We’ll try clustering with k = 3

K-means Clustering

set.seed(101)
km <- kmeans(spotify_scale, centers = 3)
summary(km)
##              Length Class  Mode   
## cluster      232725 -none- numeric
## centers          33 -none- numeric
## totss             1 -none- numeric
## withinss          3 -none- numeric
## tot.withinss      1 -none- numeric
## betweenss         1 -none- numeric
## size              3 -none- numeric
## iter              1 -none- numeric
## ifault            1 -none- numeric

We’ll try to get the summary of each cluster characteristic and visualize the data.

#get only the numeric data
df_clust_num <- spotify1 %>% 
              select_if(is.numeric) %>% 
              mutate(cluster = as.factor(km$cluster))

#scaling numeric data from 0 to 100, because we see in last scaling has minus value
df_clust_num$popularity <- rescale(df_clust_num$popularity, to = c(0,100))
df_clust_num$acousticness <- rescale(df_clust_num$acousticness, to = c(0,100))
df_clust_num$danceability <- rescale(df_clust_num$danceability, to = c(0,100))
df_clust_num$duration_ms <- rescale(df_clust_num$duration_ms, to = c(0,100))
df_clust_num$energy <- rescale(df_clust_num$energy, to = c(0,100))
df_clust_num$instrumentalness <- rescale(df_clust_num$instrumentalness, to = c(0,100))
df_clust_num$liveness <- rescale(df_clust_num$liveness, to = c(0,100))
df_clust_num$loudness <- rescale(df_clust_num$loudness, to = c(0,100))
df_clust_num$speechiness <- rescale(df_clust_num$speechiness, to = c(0,100))
df_clust_num$tempo <- rescale(df_clust_num$tempo, to = c(0,100))
df_clust_num$valence <- rescale(df_clust_num$valence, to = c(0,100))

#group_by and summarise mean of each cluster
spotify_profile <- df_clust_num %>% 
                    group_by(cluster) %>% 
                    summarise_all(mean)

spotify_profile
## # A tibble: 3 x 12
##   cluster popularity acousticness danceability duration_ms energy
##   <fct>        <dbl>        <dbl>        <dbl>       <dbl>  <dbl>
## 1 1             20.7         79.2         54.2        4.13   66.0
## 2 2             29.3         83.0         33.1        4.15   21.0
## 3 3             46.0         20.4         59.5        3.90   67.6
## # ... with 6 more variables: instrumentalness <dbl>, liveness <dbl>,
## #   loudness <dbl>, speechiness <dbl>, tempo <dbl>, valence <dbl>
#Visualize profiling
spotify_profile %>%
  pivot_longer(cols = -1) %>% 
  ggplot(aes(x = cluster, y = value)) +
  geom_col(aes(fill = cluster)) +
  facet_wrap(~name)

From the plot, we can see that :
* Cluster 1 was best at liveness, and speechiness.
* Cluster 2 was best at acousticness and instrumentalness.
* Cluster 3 was best at danceability, loudness, popularity, tempo, and valence.
* Cluster 1 and cluster 3 have almost the same performance in energy.
* All cluster have almost the same performance in duration_ms.

PCA

In this PCA, we’ll try to do some dimensional reduction to reduce computing so will not too heavy.

pca_spotify <- PCA(X = spotify_scale, scale.unit = F, graph = F)
pca_spotify$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   3.6104430              32.822350                          32.82235
## comp 2   1.7100248              15.545747                          48.36810
## comp 3   1.1712427              10.647707                          59.01580
## comp 4   0.9998305               9.089407                          68.10521
## comp 5   0.8617186               7.833839                          75.93905
## comp 6   0.7567533               6.879605                          82.81866
## comp 7   0.6378529               5.798688                          88.61734
## comp 8   0.4853764               4.412532                          93.02988
## comp 9   0.3751899               3.410832                          96.44071
## comp 10  0.2767438               2.515864                          98.95657
## comp 11  0.1147767               1.043429                         100.00000

If we try to visualize the percentage of variances captured by each dimensions.

fviz_eig(pca_spotify, ncp = 11, addlabels = T, main = "Variance explained by each dimensions")

If we want to get minimal 75% of data, we’ll using PC1 to PC5 (5 PC / Principal Components) to get total 75.93% information. It means we’ll try to do dimensionality reduction from 11 PC to 5 PC which keep information of 75.93%.

df_pca <- data.frame(pca_spotify$ind$coord[, 1:5]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:5)
rmarkdown::paged_table(spotify)

Now, our dimension data has been reduced and we’ll try to check if there’s outlier in our data.

plot.PCA(x = pca_spotify, choix = "ind", habillage = 5, select = "contrib3")

From the plot, we can see that 3 top outlier was track with id number 218541, 210710 and 210094.

plot.PCA(x = pca_spotify, choix = "var")

Some information we can get from the plot :
1. Variables that were highly contributed to PC1 are energy, valence, danceability, loudness, acousticness, and instrumentalness. 2. Variables that were highly contributed to PC2 are speechiness, liveness, and popularity. 3. Variables acousticness possibly has a negative correlation with popularity.

fviz_cluster(object = km, 
             data = spotify_scale, labelsize = 0) + theme_minimal()

We’ll try to add 1 more dimension using plotly :

df_pca1 <- data.frame(pca_spotify$ind$coord[, 1:3]) %>% bind_cols(cluster = as.factor(km$cluster)) %>% 
    select(cluster, 1:3)

plot_ly(df_pca1, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c("black", 
    "red", "green", "blue")) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Dim.1"), 
    yaxis = list(title = "Dim.2"), zaxis = list(title = "Dim.3")))
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Conclusion

Based on clustering using k-means method and PCA, we can get some conclusion such as :
1. We can separate our data into at least 3 clusters based on all of the numerical features with finding optimum value of k by using elbow-method.
2. We can reduce our dimensions from 18 features into just 5 dimensions and still retain about 75% of the variances using PCA. The dimensionality reduction can be useful if we apply the new PCA for machine learning applications.