Introduction

Spotify is an online music entertainment platform that has transformed the way we listen to and enjoy music. Launched in 2008, Spotify has become one of the most popular music streaming services worldwide. The dataset used for clustering analysis using the K-Means method on the Spotify platform consists of a large amount of music data that includes various attributes from different songs. Clustering analysis aims to group songs based on specific attribute similarities, thus enabling a better understanding of listener preferences, music trends, and the potential for collaboration between artists or genres. By employing the K-Means method, the Spotify dataset can be divided into interconnected clusters based on relevant attributes, providing deeper insights into the dynamics of music and listener preferences.

Import Library

The following library that will be used will provide crucial support in managing data manipulation and data preparation for K-Means clustering analysis on the Spotify dataset.

library(dplyr)
library(GGally)
library(inspectdf)
library(ggiraphExtra)
library(factoextra)
library(tidyr)

dplyr: This package provides a set of functions for data manipulation and transformation. It’s used for tasks like filtering, selecting, grouping, and summarizing data.

GGally: This package extends the plotting capabilities of the popular ggplot2 package. It provides additional functions to create various types of visualizations like scatter plot matrices, correlation plots, and more.

inspectdf: This package is designed to help you explore and understand your data. It provides functions to quickly summarize, visualize, and inspect the data’s basic statistics and properties.

ggiraphExtra: This package extends the capabilities of ggplot2 by allowing you to create interactive and animated ggplot graphics using SVG (Scalable Vector Graphics) as output.

factoextra: This package contains various functions to extract and visualize the results of multivariate data analysis, such as principal component analysis (PCA), clustering, and more.

tidyr: This package is used for data tidying tasks. It helps to reorganize and reshape data into a tidy format, where each column represents a variable and each row an observation.

Data Exploratory Analysis

In the Data Exploratory Analysis phase for this case, we will delve deep into the Spotify dataset to uncover valuable initial insights

Reading Dataset

In the ‘Reading Dataset’ stage, the first step is to read the dataset from the ‘SpotifyFeatures.csv’ file and store it in the ‘spotify’ variable.

spotify <- read.csv("data_input/SpotifyFeatures.csv")
head(spotify)

we will conduct an initial exploration of the Spotify dataset using the glimpse() function from the ‘dplyr’ package. This function allows us to take a brief overview of the dataset.

glimpse(spotify)

#> Rows: 232,725
#> Columns: 18
#> $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Here is a brief explanation of each column in the dataset:

genre: This column contains the music genre category for each song, such as “Movie”, “Pop”, “Rock”, and so on.

artist_name: This column contains the name of the artist who created a specific song.

track_name: This column contains the names of the songs.

track_id: This column contains a unique ID for each song.

popularity: This column indicates the popularity level of a song, measured in numbers.

acousticness: This attribute indicates the extent to which a song has acoustic elements in its production.

danceability: This attribute measures how suitable a song is for dancing, with higher values indicating songs that are more rhythmically suitable.

duration_ms: This is the duration of the song in milliseconds (ms).

energy: This attribute describes the energy level in a song, with higher values indicating more energetic songs.

instrumentalness: This attribute indicates the extent to which a song lacks vocals or main instruments.

key: This attribute indicates the basic musical key of the song (e.g., “C#” or “F”).

liveness: This attribute indicates the extent of live performance in the song.

loudness: This is the relative loudness level of the song in decibels (dB).

mode: This attribute indicates whether the song is in major or minor mode.

speechiness: This attribute measures the amount of vocals in the song, with higher values indicating more vocal songs.

tempo: This attribute represents the tempo or speed of the music in beats per minute (BPM).

time_signature: This attribute indicates the music meter, such as “4/4” for a 4/4 time signature.

valence: This attribute describes the level of positivity in the song, with higher values indicating more cheerful or positive songs.

With an understanding of each of these columns, we can begin to explore and analyze the Spotify dataset further for the purpose of cluster analysis.

Checking Data Unique

We will take a closer look at the unique values in several key columns of the Spotify dataset.

# Menggunakan sapply() untuk menghitung jumlah baris unik di setiap kolom
sapply(spotify, function(col) length(unique(col)))

#>            genre      artist_name       track_name         track_id 
#>               27            14564           148615           176774 
#>       popularity     acousticness     danceability      duration_ms 
#>              101             4734             1295            70749 
#>           energy instrumentalness              key         liveness 
#>             2517             5400               12             1732 
#>         loudness             mode      speechiness            tempo 
#>            27923                2             1641            78512 
#>   time_signature          valence 
#>                5             1692

Check for Missing Values

Let’s perform an examination of the missing values within the dataset.

colSums(is.na(spotify))/nrow(spotify)*100

#>            genre      artist_name       track_name         track_id 
#>                0                0                0                0 
#>       popularity     acousticness     danceability      duration_ms 
#>                0                0                0                0 
#>           energy instrumentalness              key         liveness 
#>                0                0                0                0 
#>         loudness             mode      speechiness            tempo 
#>                0                0                0                0 
#>   time_signature          valence 
#>                0                0

In this dataset, there are no missing values in any column, indicating that the data is sufficiently complete to proceed to the next stage of analysis.

Data Wrangling

Data Wrangling is a crucial stage in this analysis, where we will clean, transform the format, and organize the data from the Spotify dataset.

Firstly, we will remove the ‘track_id’ column as this attribute does not provide significant contribution in cluster analysis. In the K-Means method, the use of unique identifiers like ‘track_id’ may not be relevant, and focusing on more descriptive music attributes would be more beneficial in forming meaningful clusters.

Next, we will perform data type conversion for the ‘genre’, ‘key’, ‘mode’, and ‘time_signature’ columns into factors.

spotify_clean <- spotify %>%
  select(-c(track_id)) %>%
  mutate_at(vars(genre, key, mode, time_signature), as.factor)
glimpse(spotify_clean)

#> Rows: 232,725
#> Columns: 17
#> $ genre            <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key              <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

This code snippet is cleaning and preparing the dataset by removing the track_id column and converting specific columns to the factor data type, which is an essential step in data preprocessing for further analysis.

Select Numeric Columns

In K-Means analysis, the selection of numeric columns is performed because this method requires numerical data to calculate distances between data points. Columns such as ‘popularity’, ‘acousticness’, ‘danceability’, ‘duration_ms’, ‘energy’, ‘instrumentalness’, ‘liveness’, ‘loudness’, ‘speechiness’, ‘tempo’, and ‘valence’ provide information about music attributes that can be quantitatively analyzed. By using these attributes, K-Means can group songs based on similarity in specific musical aspects, allowing us to gain deeper insights into patterns within the data.

spotify_num <- spotify_clean %>%
  select_if(is.numeric)
glimpse(spotify_num)

#> Rows: 232,725
#> Columns: 11
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Check for Data Distribution

Before proceeding with further analysis, a crucial step in data exploration is to examine the data distribution. Understanding how the data is spread is a vital initial step, as it can provide insights into the characteristics and patterns that may exist in the dataset.

summary(spotify_num)

#>    popularity      acousticness     danceability     duration_ms     
#>  Min.   :  0.00   Min.   :0.0000   Min.   :0.0569   Min.   :  15387  
#>  1st Qu.: 29.00   1st Qu.:0.0376   1st Qu.:0.4350   1st Qu.: 182857  
#>  Median : 43.00   Median :0.2320   Median :0.5710   Median : 220427  
#>  Mean   : 41.13   Mean   :0.3686   Mean   :0.5544   Mean   : 235122  
#>  3rd Qu.: 55.00   3rd Qu.:0.7220   3rd Qu.:0.6920   3rd Qu.: 265768  
#>  Max.   :100.00   Max.   :0.9960   Max.   :0.9890   Max.   :5552917  
#>      energy          instrumentalness       liveness          loudness      
#>  Min.   :0.0000203   Min.   :0.0000000   Min.   :0.00967   Min.   :-52.457  
#>  1st Qu.:0.3850000   1st Qu.:0.0000000   1st Qu.:0.09740   1st Qu.:-11.771  
#>  Median :0.6050000   Median :0.0000443   Median :0.12800   Median : -7.762  
#>  Mean   :0.5709577   Mean   :0.1483012   Mean   :0.21501   Mean   : -9.570  
#>  3rd Qu.:0.7870000   3rd Qu.:0.0358000   3rd Qu.:0.26400   3rd Qu.: -5.501  
#>  Max.   :0.9990000   Max.   :0.9990000   Max.   :1.00000   Max.   :  3.744  
#>   speechiness         tempo           valence      
#>  Min.   :0.0222   Min.   : 30.38   Min.   :0.0000  
#>  1st Qu.:0.0367   1st Qu.: 92.96   1st Qu.:0.2370  
#>  Median :0.0501   Median :115.78   Median :0.4440  
#>  Mean   :0.1208   Mean   :117.67   Mean   :0.4549  
#>  3rd Qu.:0.1050   3rd Qu.:139.05   3rd Qu.:0.6600  
#>  Max.   :0.9670   Max.   :242.90   Max.   :1.0000

We can see that the variables have different ranges of values. For example, the ‘acousticness’ and ‘instrumentalness’ variables have relatively high standard deviations, indicating significant variation in the data. As a step for further analysis, scaling will help address these magnitude differences, ensuring that each variable has a balanced impact on the analysis we conduct. ## Check Covariance

Covariance measures the degree to which two variables change together. A positive covariance suggests that the two variables tend to increase or decrease together, while a negative covariance suggests an inverse relationship.

cov(spotify_num)

#>                    popularity  acousticness    danceability      duration_ms
#> popularity        330.8741927  -2.460579486     0.866213977        5079.7963
#> acousticness       -2.4605795   0.125860362    -0.024004549         472.7050
#> danceability        0.8662140  -0.024004549     0.034450413       -2776.6700
#> duration_ms      5079.7962942 472.705010446 -2776.670015161 14145750520.8082
#> energy              1.1928936  -0.067816440     0.015931805        -957.2578
#> instrumentalness   -1.1619558   0.033958915    -0.020508345        2737.5056
#> liveness           -0.6058861   0.004853762    -0.001534008         560.8353
#> loudness           39.6070158  -1.468729115     0.488376611      -33970.6429
#> speechiness        -0.5098157   0.009933929     0.004633400        -356.8187
#> tempo              45.5478771  -2.611654392     0.125822789     -104575.8610
#> valence             0.2841956  -0.030059094     0.026411285       -4386.3797
#>                          energy instrumentalness      liveness         loudness
#> popularity          1.192893559     -1.161955848  -0.605886064     39.607015849
#> acousticness       -0.067816440      0.033958915   0.004853762     -1.468729115
#> danceability        0.015931805     -0.020508345  -0.001534008      0.488376611
#> duration_ms      -957.257831674   2737.505647036 560.835264134 -33970.642881913
#> energy              0.069408833     -0.030227884   0.010071149      1.289631260
#> instrumentalness   -0.030227884      0.091668682  -0.008055978     -0.919510999
#> liveness            0.010071149     -0.008055978   0.039312018      0.054333071
#> loudness            1.289631260     -0.919510999   0.054333071     35.978446810
#> speechiness         0.007092851     -0.009950208   0.018764818     -0.002529084
#> tempo               1.862333254     -0.974186536  -0.314619135     42.324447453
#> valence             0.029925682     -0.024214148   0.000608679      0.623816414
#>                     speechiness           tempo         valence
#> popularity         -0.509815661      45.5478771     0.284195566
#> acousticness        0.009933929      -2.6116544    -0.030059094
#> danceability        0.004633400       0.1258228     0.026411285
#> duration_ms      -356.818722591 -104575.8610139 -4386.379669824
#> energy              0.007092851       1.8623333     0.029925682
#> instrumentalness   -0.009950208      -0.9741865    -0.024214148
#> liveness            0.018764818      -0.3146191     0.000608679
#> loudness           -0.002529084      42.3244475     0.623816414
#> speechiness         0.034417042      -0.4674159     0.001150285
#> tempo              -0.467415886     954.7424276     1.083677477
#> valence             0.001150285       1.0836775     0.067634056

Positive covariance between ‘popularity’ and ‘energy’ suggests that more popular songs may have higher energy levels.
Negative covariance between ‘acousticness’ and ‘loudness’ indicates that songs with higher acousticness tend to have lower loudness.
Covariance between ‘duration_ms’ and ‘tempo’ suggests a potential relationship between the length of a song and its tempo.

Scaling

By applying the scaling method, we can generate an equivalent data representation, making the cluster analysis more objective, and allowing information from all attributes to contribute in a balanced manner to the cluster formation.

spotify_scaled <- scale(spotify_num)
plot(prcomp(spotify_scaled))

By applying the prcomp() function after scaling, the data is no longer associated with specific tendencies or biases. As a result, we obtain a more accurate and objective visual representation of the variable distribution in the dataset. Next, we can combine the scaling results with the previous dataset using the following code:

spotify_final <- spotify_clean %>%
  select_if(~!is.numeric(.)) %>%
  cbind(spotify_scaled)
spotify_final

The dataset has been successfully updated by merging the scaling results of numeric variables with the original dataset.

Modeling

In the case of cluster analysis on the Spotify dataset, especially using the K-Means method, the modeling process involves grouping data into clusters determined by the K-Means algorithm. This algorithm seeks cluster centers (centroids) based on the distances between data points, where each data point is assigned to the cluster with the closest centroid. The steps of modeling with K-Means include initializing cluster centroids, calculating distances, cluster assignment, centroid updating, and iterating until convergence.

# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)

spotify_km <- kmeans(x = spotify_scaled,
                    centers = 3)

One commonly used method of evaluation is the elbow method. This method involves plotting the inertia values against the number of clusters used in the analysis. Inertia measures how far the data points are spread within a cluster, and the smaller the inertia value, the better. In the elbow plot, we look for the point where the decrease in inertia becomes slower, resembling an elbow. At this point, adding more clusters no longer significantly reduces inertia, and that is the recommended number of clusters to choose.

# Define the range of clusters you want to consider
num_clusters <- 2:10

# Calculate WSS for each number of clusters
wss <- numeric(length(num_clusters))
for (i in seq_along(num_clusters)) {
  k <- num_clusters[i]
  kmeans_model <- kmeans(spotify_scaled, centers = k, nstart = 10)
  wss[i] <- kmeans_model$tot.withinss
}

# Plot the WSS values against the number of clusters
plot(num_clusters, wss, type = "b", pch = 19, frame = FALSE,
     xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares")

# Add a vertical line at the "elbow point"
elbow_point <- which(diff(wss) <= 0.01 * max(diff(wss)))
abline(v = num_clusters[elbow_point], col = "red")

Based on the visualization of the elbow method, we can observe that the inertia values start to decrease more slowly after the number of clusters reaches 3. This indicates that adding more clusters beyond 3 does not significantly contribute to reducing inertia. Therefore, we can determine that K = 3 is an appropriate choice for the number of clusters in our analysis.

Number of Iterations

The number of iterations in the K-Means clustering process is a significant aspect of the modeling procedure.

spotify_km$iter

#> [1] 4

his value represents how many times the algorithm iterated to optimize the cluster assignments and centroids. A higher number of iterations may indicate that the algorithm required more steps to find the optimal clusters, while a lower number could suggest faster convergence.

Number of Observations

It represents the number of observations that were assigned to each cluster during the clustering process.

spotify_km$size

#> [1]  10347 170413  51965

The output results you provided show the distribution of observations in each cluster after the cluster analysis process using the K-Means method. The breakdown of the number of observations is as follows:

Cluster 1 has 11,147 observations
Cluster 2 has 170,345 observations
Cluster 3 has 51,965 observations

Centroid Center Positions

Centroid centers are the mean values of the attributes within each cluster and represent a central point around which the data points in the cluster are grouped.

spotify_km$centers

#>   popularity acousticness danceability duration_ms     energy instrumentalness
#> 1 -1.1225212    1.1835743   0.04146072  0.07412107  0.3346875       -0.4851339
#> 2  0.2668908   -0.4652109   0.30807513 -0.03047521  0.3974330       -0.2755425
#> 3 -0.6517258    1.2899362  -1.01855096  0.08518122 -1.3699753        1.0002059
#>      liveness   loudness speechiness      tempo    valence
#> 1  2.58008878 -0.4063925   4.0199624 -0.6222912 -0.1471545
#> 2 -0.08061124  0.4354810  -0.1294428  0.1507078  0.2825528
#> 3 -0.24937893 -1.3471891  -0.3759417 -0.3703210 -0.8972973

Cluster Labels

Cluster Labels The information about the assigned clusters for the data points can be obtained using the following code:

head(spotify_km$cluster)

#> [1] 2 2 3 3 3 3

These labels indicate the groupings to which the data points belong based on the K-Means clustering algorithm.

Goodness of Fit

In cluster analysis, it is important to evaluate how well the created cluster model fits the data. We can perform evaluations using various methods, one of which is by examining the within-cluster sum of squares (WSS) and between-cluster sum of squares (BSS).

First, we can examine the WSS by calculating the total variability of the data within the clusters. The WSS values can be accessed using the following code:

Checking WSS and BSS/TSS

spotify_km$tot.withinss

#> [1] 1661576

The lower the value of WSS, the denser and more compact the formed clusters, indicating a better cluster model. However, it’s important to note that, based solely on the WSS value, we cannot definitively determine whether the clustering is already optimal or not. Further analysis and comparison with other evaluation methods are needed to make a more informed assessment of the clustering model’s performance.

Furthermore, we can also evaluate by examining the BSS/TSS ratio, which measures how far the clusters are spread overall in the dataset. This ratio can be calculated using the following code:

spotify_km$betweenss/spotify_km$totss

#> [1] 0.3509376

A higher ratio indicates that the clusters are well spread, which also indicates a better clustering model. This ratio ranges between 0 and 1, where a value closer to 1 suggests that the clusters are more distinct and well-separated within the dataset.

The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS) is calculated to be approximately 0.3509376. This ratio provides insight into how well the clusters are spread out overall within the dataset. A higher ratio, closer to 1, suggests that the clusters are well-dispersed, indicating a more effective cluster model in capturing distinct patterns and variations among the data points.

However, it is important to note that selecting a too small value for the number of clusters (k) can result in a very low WSS and a favorable BSS/TSS ratio. However, these outcomes may no longer be significantly representative due to the potential presence of a cluster containing only one observation (*the clustering objective is not achieved). Therefore, when evaluating a cluster model, it is necessary to consider the trade-off between a low WSS value and the appropriateness of the resulting cluster representation. In practice, determining the optimal number of clusters often involves the use of visualization methods and other tests, such as the elbow method or silhouette analysis, to help select the most suitable cluster model for the data.

Profiling

In the context of cluster analysis on the Spotify dataset, profiling can help us describe the detailed characteristics of the music within each cluster. For example, we can identify clusters with high values in attributes like danceability, energy, and loudness, which could be interpreted as clusters of high-energy songs suitable for dancing and entertainment. On the other hand, clusters with high values in attributes like acousticness and instrumentalness might indicate songs with a strong acoustic and instrumental emphasis.

To integrate the cluster information back into our dataset, we employ the following code within our analysis:

# Assign cluster column into the dataset
spotify_num $cluster <- spotify_km$cluster
head(spotify_num)

This step enables us to associate each data point with its respective cluster assignment. By appending the cluster column to our dataset, we can now gain insights into how individual songs are categorized within the identified clusters.

we proceed with the process of profiling by summarizing the data within each cluster. This is achieved through the following code:

# melakukan profiling dengan summarise data
spotify_centroid <- spotify_num %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

spotify_centroid

Furthermore, to delve deeper into the profiling process, we utilize the following code:

spotify_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    group_min = which.min(value),
    group_max = which.max(value))

In this step, we transform the centroid profiles into a longer format to facilitate comparison and analysis. By pivoting the data, we can identify the attributes that have the minimum and maximum values within each cluster. This allows us to pinpoint the specific musical traits that contribute significantly to the distinctions between clusters. By understanding which attributes vary the most within each cluster, we gain a finer-grained understanding of the unique musical characteristics that define each group.

Clustering Visualization (PCA Biplot + Kmeans)

In the Clustering Visualization stage, we use the PCA (Principal Component Analysis) Biplot method to visualize the results of cluster analysis using the K-Means algorithm. PCA Biplot helps reduce the dimensions of the data and depict the relationships between songs in a two-dimensional graph. Each point on the biplot represents a song, while the direction of vectors indicates the contribution of music attributes to the variability in the data. This way, we can understand the cluster distribution, observe formed group patterns, and identify the distinguishing musical characteristics of each cluster. This visualization provides a better visual insight into how songs are grouped based on relevant music attributes.

# Define cluster labels
cluster_labels <- c("Dynamic Expressions", "Energetic Grooves", "Melodic Explorations")
fviz_cluster(object = spotify_km,
             data = spotify_num, labelsize = 1, labels = cluster_labels)

ggRadar(
  data=spotify_num,
  mapping = aes(colours = cluster),
  interactive = T
)

Based on the visualization of the profiling results, Here are the insights derived from the analysis of the Spotify dataset using K-Means clustering:

Cluster 1

Highest Musical Traits: Liveness, Speechiness
Lowest Musical Traits: Instrumentalness, Tempo
Label: Dynamic Expressions

Cluster 2

Highest Musical Traits: Danceability, Energy, Loudness, Popularity, Tempo
Lowest Musical Traits: Acousticness, Duration_ms
Label: Energetic Grooves

Cluster 3

Highest Musical Traits: Acousticness, Duration_ms, Instrumentalness
Lowest Musical Traits: Danceability, Energy, Liveness, Loudness, Speechiness
Label: Melodic Explorations”

In this cluster profiling:

Cluster 1 is characterized by high liveness and speechiness, along with low instrumentalness and tempo. It is labeled as “Dynamic Expressions.”
Cluster 2 showcases high danceability, energy, loudness, popularity, and tempo, with low acousticness and duration_ms. It is labeled as “Energetic Grooves.”
Cluster 3 exhibits high acousticness, duration_ms, and instrumentalness, while having low danceability, energy, liveness, loudness, and speechiness. It is labeled as “Melodic Explorations.”

Music Recommendation

In the context of the Spotify platform, the concept of music recommendation becomes highly relevant and beneficial. Spotify has successfully integrated data analysis techniques to provide a more personalized and tailored listening experience for each user through a process known as profiling. To continue the analysis, we add cluster information to the Spotify dataset.

spotify_clean$cluster <- spotify_km$cluster
head(spotify_clean)

For example, if someone is currently listening to the song ‘Ouverture’ by Fabien Nataf, we can display recommendations based on the same music cluster. This leverages the cluster information that has been added to the dataset earlier. As a result, song recommendations will be more aligned with the user’s music taste, providing a more cohesive and satisfying listening experience.

spotify_clean %>% 
  filter(cluster == 2)

Based on the provided information, Cluster 2 represents ‘Energetic Grooves,’ a category characterized by lively and energetic musical selections. This cluster likely includes songs with high energy, danceability, and possibly upbeat tempos, making it a suitable choice for listeners in the mood for dynamic and lively music.We can randomly select songs from this cluster to provide recommendations to the user, ensuring that the suggestions align with the filtered Energetic Grooves cluster.

energetic_grooves_songs <- spotify_clean %>% 
  filter(cluster == 2) %>%
  sample_n(10, replace = TRUE)  # Mengambil 10 lagu secara acak dari klaster tersebut
energetic_grooves_songs

Conclusion

In conclusion, the K-Means clustering analysis performed on the Spotify dataset successfully grouped songs into distinct clusters based on their musical attributes. The elbow method was employed to determine the optimal number of clusters, which resulted in the selection of three clusters. Each cluster exhibited unique characteristics that could be interpreted as different musical styles or genres.

Cluster 1, termed “Dynamic Expressions,” comprises songs with high liveness and speechiness but lower instrumentalness and tempo. These songs might be suitable for energetic and dynamic occasions, such as live performances or motivational playlists.
Cluster 2, labeled as “Energetic Grooves,” encompasses songs with high danceability, energy, loudness, and popularity, making them ideal choices for upbeat and energetic settings, like parties or workouts.
Cluster 3, known as “Melodic Explorations,” includes songs with higher acousticness, longer duration, and instrumental elements. These songs could create a more relaxed and contemplative atmosphere, suitable for moments of introspection or background music.

Considering the profiles of these clusters, we can provide song recommendations based on the identified musical attributes:

Dynamic Expressions Cluster: For lively and expressive moments, consider songs with high liveness and speechiness, such as live recordings, speeches, or engaging performances.
Energetic Grooves Cluster: If you’re looking to uplift the mood and add energy, go for songs with high danceability, energy, and loudness. These tracks are perfect for parties, workouts, or any lively activity.
Melodic Explorations Cluster: For a more soothing and introspective ambiance, explore songs with acoustic elements, longer durations, and instrumental nuances. These tracks can enhance relaxation or create a calming environment.

By leveraging the insights gained from the K-Means clustering and profiling analysis, music enthusiasts, playlist curators, and even artists can tailor their musical selections to suit various occasions and preferences. This methodology provides a data-driven approach to recommending songs that align with specific musical characteristics and styles, enhancing the overall music listening experience.

Discovering Musical Diversity: K-Means Clustering Analysis and Song Recommendations from Spotify Dataset

Rusdi Permana

2023-08-17