The below analysis consists clustering methods of Spotify songs with regards to their attributes. To determine the best clustering method, Kmeans and CLARA Algorithms is performed. In addition, the Principal Component Analysis is used to check whether how many components hold essentail information.

The dataset consists 19 thousand of songs from the Spotify database.

It was downloaded from the Kaggle dataset: https://www.kaggle.com/edalrami/19000-spotify-songs?select=song_data.csv

Every song is described in terms of name, popularity score and the sound attributes. track_name : Title of track popularity : Popularity score (1-100) acousticness : It represents a confidence measure with values in range (0.0;1.0) of whether the track is acoustic.

The value 1.0 results in high confidence the track is acoustic.

danceability : It describes whether a track is suitable to dance based on musical elements.

The value 0.0 means that the track is least danceable. Songs with values 1.0 are the best for dancing.

duration_ms : It represents a duration of the track in milliseconds.

energy : It is a measure of energy which varies from 0.0 to 1.0. Energy shows a measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale.

instrumentalness :It describes whethed a track has a vocal. In this context vocal sounds are trated as instrumental. The highest value takes songs with Rap or spoken words.

key : This is an estimated key of the track. Integer values are mapped to the pitches of Pitch class notation. In case lack of key, the song takes value equal to -1.

liveness : It measures the presence of audience in the recorded song. The higher value increases probability that the track was played live. loudness : The means loudness of a track estimated in decibels (dB).

mode : It represents the modality of the voice. This value is scaled in range (0,1). Songs with value closer to 1, have Major melodic.

speechiness : The score of present spoken words in a track. Speech recordings are scored close to 1.0 value. Values in range (0,33 and 0,66) represent songs which may contain speech and music. The values below 0,33 threshold are represented by music and tracks without speech.

tempo : The average tempo of a track expressed in beats per minute (BPM).

time_signature: Mean signature of a track. This measure specifies number of beats in each bar.

valence : represents the musical positiveness provided by a track.

The below libraries were used to perform calculations and operations for the following analysis:

library(readr)
library(corrplot)
library(RColorBrewer)
library(factoextra)
library(stats)
library(cluster)
library(dplyr, tibble)
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(cclust)
library(flexclust)
song_data <- read_csv("song_data.csv")

#name of variables
names(song_data)
##  [1] "song_name"        "song_popularity"  "song_duration_ms" "acousticness"    
##  [5] "danceability"     "energy"           "instrumentalness" "key"             
##  [9] "liveness"         "loudness"         "audio_mode"       "speechiness"     
## [13] "tempo"            "time_signature"   "audio_valence"

The dataset is checked whether it has null values

any(is.na.data.frame(song_data))
## [1] FALSE

After checking column names, we drop unnecessary factor columns for further analysis

my_data <- subset(song_data, select = -c(song_popularity,song_name) )

song_types <- subset(song_data, select = c(song_popularity,song_name) )

Correlation plot

M <-cor(my_data)
corrplot(M, type="upper", order="hclust",
         col=brewer.pal(n=5, name="RdYlBu"))

cor(my_data$acousticness,my_data$energy)
## [1] -0.6626391
cor(my_data$loudness,my_data$acousticness)
## [1] -0.5577442

The correlation plot presents one pair of highly negatively correlated variables such as energy and accousticness. Loudness with instrumentalness and acousticness might be also considered as negatively correlated. Two scenarios will be considered to check whether highly correlated variables distort clustering. In the first scenario, clustering operations will be performed with all variables. In the latter one, the Accousticness variable is removed to avoid potential double weight in computing the distance between two points.

PCA will be performed to check how much information can store each component.

Accousticness will be removed as it also correlates with loudness.

spot_no_accousticness <- subset(song_data, select = -c(song_popularity,song_name,acousticness))
song_z_no_ac <- as.data.frame(lapply(spot_no_accousticness, scale))

Data are standarized to make all variables equal in terms of weights.

song_z <- as.data.frame(lapply(my_data, scale))

PCA is calculated to check which components explain the most variance of factors.

pca = princomp(song_z, cor=TRUE)

summary(pca)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     1.6638255 1.1960524 1.09193440 1.04160411 0.99974913
## Proportion of Variance 0.2129473 0.1100417 0.09171698 0.08345686 0.07688449
## Cumulative Proportion  0.2129473 0.3229890 0.41470595 0.49816281 0.57504729
##                            Comp.6     Comp.7     Comp.8    Comp.9    Comp.10
## Standard deviation     0.98082359 0.95910303 0.93762132 0.9270671 0.87736569
## Proportion of Variance 0.07400115 0.07075989 0.06762567 0.0661118 0.05921312
## Cumulative Proportion  0.64904844 0.71980833 0.78743401 0.8535458 0.91275893
##                           Comp.11    Comp.12    Comp.13
## Standard deviation     0.76194185 0.62863562 0.39798972
## Proportion of Variance 0.04465811 0.03039867 0.01218429
## Cumulative Proportion  0.95741703 0.98781571 1.00000000

The above graph presents by how much each principal component supports provision of information. Removal of 4 components with the lowest proportion of variance will result in loss of 10 % of the information.

The first scenario considers calculation of Kmeans with 5,3, 2 clusters with and without acousticness variable. For each cluster the Silhouette score is calculated to describe how clusters are densed.

song_attr_clusters_5_no_ac <- kmeans(song_z_no_ac, 5)

sil_1<-silhouette(song_attr_clusters_5_no_ac$cluster, dist(song_z_no_ac))
fviz_silhouette(sil_1)
##   cluster size ave.sil.width
## 1       1 1347          0.04
## 2       2 4810          0.15
## 3       3 3410          0.06
## 4       4 2353          0.04
## 5       5 6915          0.15

5 clusters with all variables:

song_attr_clusters_5 <- kmeans(song_z, 5)

sil_2<-silhouette(song_attr_clusters_5$cluster, dist(song_z))
fviz_silhouette(sil_2)
##   cluster size ave.sil.width
## 1       1 3368          0.10
## 2       2 2543          0.05
## 3       3 7038          0.17
## 4       4 4843          0.04
## 5       5 1043          0.06

The second attempt is to calculate Kmeans for 3 clusters to determine whether Silhouette score will be improved.

The first attempt is the calculation of Kmeans for 3 clusters with removed acousticness:

spot_km_no_ac_3 <- kmeans(x = song_z_no_ac,centers = 3)

fviz_cluster(spot_km_no_ac_3, data = song_z_no_ac)

song_attr_clusters_no_ac_3 <- kmeans(song_z_no_ac, 3)


sil<-silhouette(song_attr_clusters_no_ac_3$cluster, dist(song_z_no_ac))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 8646          0.13
## 2       2 6755          0.07
## 3       3 3434          0.03

The second attempt with Kmeans with all variables:

spot_km_3 <- kmeans(x = song_z,centers = 3)

fviz_cluster(spot_km_3, data = song_z)

song_attr_clusters_3 <- kmeans(song_z_no_ac, 3)


sil<-silhouette(song_attr_clusters_3$cluster, dist(song_z))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 3440          0.04
## 2       2 6820          0.07
## 3       3 8575          0.12

KMeans with all varaibles and 3 clusters has the highest Silhouette score.

The CLARA (Clustering Large Applications) algorithm is performed as it handles large samples. It is also less sensitive to outliers. We consider 3 and 2 clusters in the following analysis.

The first case: CLARA with 3 clusters and all variables:

clara_1 <- clara(song_z, 3, metric = "euclidean", stand = FALSE, 
                samples = 5, pamLike = TRUE)
plot(clara_1, ask = FALSE, main = "CLARA and Silhouette plot")

The second case: CLARA with 3 clusters and without acousticness variable:

clarax_2 <- clara(song_z_no_ac, 3, metric = "euclidean", stand = FALSE, 
                samples = 5, pamLike = TRUE)

plot(clarax_2, ask = FALSE, main = "CLARA and Silhouette plot")

The third case: CLARA with 2 clusters and all variables:

clara_3 <- clara(song_z, 2, metric = "euclidean", stand = FALSE, 
                samples = 5, pamLike = TRUE)
plot(clara_3, ask = FALSE, main = "CLARA and Silhouette plot")

The fourth case: CLARA with 3 clusters and without acousticness variable:

clara_4 <- clara(song_z_no_ac, 2, metric = "euclidean", stand = FALSE, 
                samples = 5, pamLike = TRUE)

plot(clara_4, ask = FALSE, main = "CLARA and Silhouette plot")

In case of CLARA with 2 clusters, the average Silhouette width is the largest (0.29). Values of The silhouette width closer to 1 denotes well densed clusters. A score -1 indicates the worse clusters. Values near 0 represent clusters which overlap. The Silhouette index is positive therefore clusters are

CLARA 2 Clusters Medoids summary

clara_3$medoids
##      song_duration_ms acousticness danceability    energy instrumentalness
## [1,]      0.005450423    -0.434814    0.5465190  0.425058       -0.3342562
## [2,]     -0.792795749     1.300438    0.5784224 -1.667415       -0.3519960
##              key   liveness   loudness audio_mode speechiness       tempo
## [1,] -0.08000777  0.1482781  0.2294863  0.7693976   0.5163984 -0.07205967
## [2,]  0.47330463 -0.2475988 -1.6786439  0.7693976  -0.6945792 -0.40544576
##      time_signature audio_valence
## [1,]      0.1369408     0.3435087
## [2,]      0.1369408    -0.5108368

The both clusters have distinctive attributes. It is noticeable that the medoid of the first cluster has negative acousticness, high score of energy, liveness and loudness. The speechiness

The second cluster has low duration of a song, energy liveness and loudness. In comparison to the medoid of the first cluster it has high acousticness and key.