Spotify is an online music entertainment platform that has transformed the way we listen to and enjoy music. Launched in 2008, Spotify has become one of the most popular music streaming services worldwide. The dataset used for clustering analysis using the K-Means method on the Spotify platform consists of a large amount of music data that includes various attributes from different songs. Clustering analysis aims to group songs based on specific attribute similarities, thus enabling a better understanding of listener preferences, music trends, and the potential for collaboration between artists or genres. By employing the K-Means method, the Spotify dataset can be divided into interconnected clusters based on relevant attributes, providing deeper insights into the dynamics of music and listener preferences.
The following library that will be used will provide crucial support in managing data manipulation and data preparation for K-Means clustering analysis on the Spotify dataset.
library(dplyr)
library(GGally)
library(inspectdf)
library(ggiraphExtra)
library(factoextra)
library(tidyr)dplyr: This package provides a set of functions for data
manipulation and transformation. It’s used for tasks like filtering,
selecting, grouping, and summarizing data.
GGally: This package extends the plotting capabilities
of the popular ggplot2 package. It provides additional functions to
create various types of visualizations like scatter plot matrices,
correlation plots, and more.
inspectdf: This package is designed to help you explore
and understand your data. It provides functions to quickly summarize,
visualize, and inspect the data’s basic statistics and properties.
ggiraphExtra: This package extends the capabilities of
ggplot2 by allowing you to create interactive and animated ggplot
graphics using SVG (Scalable Vector Graphics) as output.
factoextra: This package contains various functions to
extract and visualize the results of multivariate data analysis, such as
principal component analysis (PCA), clustering, and more.
tidyr: This package is used for data tidying tasks. It
helps to reorganize and reshape data into a tidy format, where each
column represents a variable and each row an observation.
In the Data Exploratory Analysis phase for this case, we will delve deep into the Spotify dataset to uncover valuable initial insights
In the ‘Reading Dataset’ stage, the first step is to read the dataset from the ‘SpotifyFeatures.csv’ file and store it in the ‘spotify’ variable.
we will conduct an initial exploration of the Spotify dataset using the glimpse() function from the ‘dplyr’ package. This function allows us to take a brief overview of the dataset.
#> Rows: 232,725
#> Columns: 18
#> $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Here is a brief explanation of each column in the dataset:
genre: This column contains the music genre category for
each song, such as “Movie”, “Pop”, “Rock”, and so on.
artist_name: This column contains the name of the artist
who created a specific song.
track_name: This column contains the names of the
songs.
track_id: This column contains a unique ID for each
song.
popularity: This column indicates the popularity level
of a song, measured in numbers.
acousticness: This attribute indicates the extent to
which a song has acoustic elements in its production.
danceability: This attribute measures how suitable a
song is for dancing, with higher values indicating songs that are more
rhythmically suitable.
duration_ms: This is the duration of the song in
milliseconds (ms).
energy: This attribute describes the energy level in a
song, with higher values indicating more energetic songs.
instrumentalness: This attribute indicates the extent to
which a song lacks vocals or main instruments.
key: This attribute indicates the basic musical key of
the song (e.g., “C#” or “F”).
liveness: This attribute indicates the extent of live
performance in the song.
loudness: This is the relative loudness level of the
song in decibels (dB).
mode: This attribute indicates whether the song is in
major or minor mode.
speechiness: This attribute measures the amount of
vocals in the song, with higher values indicating more vocal songs.
tempo: This attribute represents the tempo or speed of
the music in beats per minute (BPM).
time_signature: This attribute indicates the music
meter, such as “4/4” for a 4/4 time signature.
valence: This attribute describes the level of
positivity in the song, with higher values indicating more cheerful or
positive songs.
With an understanding of each of these columns, we can begin to explore and analyze the Spotify dataset further for the purpose of cluster analysis.
We will take a closer look at the unique values in several key columns of the Spotify dataset.
# Menggunakan sapply() untuk menghitung jumlah baris unik di setiap kolom
sapply(spotify, function(col) length(unique(col)))#> genre artist_name track_name track_id
#> 27 14564 148615 176774
#> popularity acousticness danceability duration_ms
#> 101 4734 1295 70749
#> energy instrumentalness key liveness
#> 2517 5400 12 1732
#> loudness mode speechiness tempo
#> 27923 2 1641 78512
#> time_signature valence
#> 5 1692
Let’s perform an examination of the missing values within the dataset.
#> genre artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 0 0 0 0
#> time_signature valence
#> 0 0
In this dataset, there are no missing values in any column, indicating that the data is sufficiently complete to proceed to the next stage of analysis.
Data Wrangling is a crucial stage in this analysis, where we will clean, transform the format, and organize the data from the Spotify dataset.
Firstly, we will remove the ‘track_id’ column as this attribute does not provide significant contribution in cluster analysis. In the K-Means method, the use of unique identifiers like ‘track_id’ may not be relevant, and focusing on more descriptive music attributes would be more beneficial in forming meaningful clusters.
Next, we will perform data type conversion for the ‘genre’, ‘key’, ‘mode’, and ‘time_signature’ columns into factors.
spotify_clean <- spotify %>%
select(-c(track_id)) %>%
mutate_at(vars(genre, key, mode, time_signature), as.factor)
glimpse(spotify_clean)#> Rows: 232,725
#> Columns: 17
#> $ genre <fct> Movie, Movie, Movie, Movie, Movie, Movie, Movie, Movi…
#> $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key <fct> C#, F#, C, C#, F, C#, C#, F#, C, G, E, C, F#, D#, G, …
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <fct> Major, Minor, Minor, Major, Major, Major, Major, Majo…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <fct> 4/4, 4/4, 5/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4, 4/4…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
This code snippet is cleaning and preparing the dataset by removing the track_id column and converting specific columns to the factor data type, which is an essential step in data preprocessing for further analysis.
In K-Means analysis, the selection of numeric columns is performed because this method requires numerical data to calculate distances between data points. Columns such as ‘popularity’, ‘acousticness’, ‘danceability’, ‘duration_ms’, ‘energy’, ‘instrumentalness’, ‘liveness’, ‘loudness’, ‘speechiness’, ‘tempo’, and ‘valence’ provide information about music attributes that can be quantitatively analyzed. By using these attributes, K-Means can group songs based on similarity in specific musical aspects, allowing us to gain deeper insights into patterns within the data.
#> Rows: 232,725
#> Columns: 11
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Before proceeding with further analysis, a crucial step in data exploration is to examine the data distribution. Understanding how the data is spread is a vital initial step, as it can provide insights into the characteristics and patterns that may exist in the dataset.
#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.0000 Min. :0.0569 Min. : 15387
#> 1st Qu.: 29.00 1st Qu.:0.0376 1st Qu.:0.4350 1st Qu.: 182857
#> Median : 43.00 Median :0.2320 Median :0.5710 Median : 220427
#> Mean : 41.13 Mean :0.3686 Mean :0.5544 Mean : 235122
#> 3rd Qu.: 55.00 3rd Qu.:0.7220 3rd Qu.:0.6920 3rd Qu.: 265768
#> Max. :100.00 Max. :0.9960 Max. :0.9890 Max. :5552917
#> energy instrumentalness liveness loudness
#> Min. :0.0000203 Min. :0.0000000 Min. :0.00967 Min. :-52.457
#> 1st Qu.:0.3850000 1st Qu.:0.0000000 1st Qu.:0.09740 1st Qu.:-11.771
#> Median :0.6050000 Median :0.0000443 Median :0.12800 Median : -7.762
#> Mean :0.5709577 Mean :0.1483012 Mean :0.21501 Mean : -9.570
#> 3rd Qu.:0.7870000 3rd Qu.:0.0358000 3rd Qu.:0.26400 3rd Qu.: -5.501
#> Max. :0.9990000 Max. :0.9990000 Max. :1.00000 Max. : 3.744
#> speechiness tempo valence
#> Min. :0.0222 Min. : 30.38 Min. :0.0000
#> 1st Qu.:0.0367 1st Qu.: 92.96 1st Qu.:0.2370
#> Median :0.0501 Median :115.78 Median :0.4440
#> Mean :0.1208 Mean :117.67 Mean :0.4549
#> 3rd Qu.:0.1050 3rd Qu.:139.05 3rd Qu.:0.6600
#> Max. :0.9670 Max. :242.90 Max. :1.0000
We can see that the variables have different ranges of values. For example, the ‘acousticness’ and ‘instrumentalness’ variables have relatively high standard deviations, indicating significant variation in the data. As a step for further analysis, scaling will help address these magnitude differences, ensuring that each variable has a balanced impact on the analysis we conduct. ## Check Covariance
Covariance measures the degree to which two variables change together. A positive covariance suggests that the two variables tend to increase or decrease together, while a negative covariance suggests an inverse relationship.
#> popularity acousticness danceability duration_ms
#> popularity 330.8741927 -2.460579486 0.866213977 5079.7963
#> acousticness -2.4605795 0.125860362 -0.024004549 472.7050
#> danceability 0.8662140 -0.024004549 0.034450413 -2776.6700
#> duration_ms 5079.7962942 472.705010446 -2776.670015161 14145750520.8082
#> energy 1.1928936 -0.067816440 0.015931805 -957.2578
#> instrumentalness -1.1619558 0.033958915 -0.020508345 2737.5056
#> liveness -0.6058861 0.004853762 -0.001534008 560.8353
#> loudness 39.6070158 -1.468729115 0.488376611 -33970.6429
#> speechiness -0.5098157 0.009933929 0.004633400 -356.8187
#> tempo 45.5478771 -2.611654392 0.125822789 -104575.8610
#> valence 0.2841956 -0.030059094 0.026411285 -4386.3797
#> energy instrumentalness liveness loudness
#> popularity 1.192893559 -1.161955848 -0.605886064 39.607015849
#> acousticness -0.067816440 0.033958915 0.004853762 -1.468729115
#> danceability 0.015931805 -0.020508345 -0.001534008 0.488376611
#> duration_ms -957.257831674 2737.505647036 560.835264134 -33970.642881913
#> energy 0.069408833 -0.030227884 0.010071149 1.289631260
#> instrumentalness -0.030227884 0.091668682 -0.008055978 -0.919510999
#> liveness 0.010071149 -0.008055978 0.039312018 0.054333071
#> loudness 1.289631260 -0.919510999 0.054333071 35.978446810
#> speechiness 0.007092851 -0.009950208 0.018764818 -0.002529084
#> tempo 1.862333254 -0.974186536 -0.314619135 42.324447453
#> valence 0.029925682 -0.024214148 0.000608679 0.623816414
#> speechiness tempo valence
#> popularity -0.509815661 45.5478771 0.284195566
#> acousticness 0.009933929 -2.6116544 -0.030059094
#> danceability 0.004633400 0.1258228 0.026411285
#> duration_ms -356.818722591 -104575.8610139 -4386.379669824
#> energy 0.007092851 1.8623333 0.029925682
#> instrumentalness -0.009950208 -0.9741865 -0.024214148
#> liveness 0.018764818 -0.3146191 0.000608679
#> loudness -0.002529084 42.3244475 0.623816414
#> speechiness 0.034417042 -0.4674159 0.001150285
#> tempo -0.467415886 954.7424276 1.083677477
#> valence 0.001150285 1.0836775 0.067634056
By applying the scaling method, we can generate an equivalent data representation, making the cluster analysis more objective, and allowing information from all attributes to contribute in a balanced manner to the cluster formation.
By applying the prcomp() function after scaling, the
data is no longer associated with specific tendencies or biases. As a
result, we obtain a more accurate and objective visual representation of
the variable distribution in the dataset. Next, we can combine the
scaling results with the previous dataset using the following code:
spotify_final <- spotify_clean %>%
select_if(~!is.numeric(.)) %>%
cbind(spotify_scaled)
spotify_finalThe dataset has been successfully updated by merging the scaling results of numeric variables with the original dataset.
In the case of cluster analysis on the Spotify dataset, especially using the K-Means method, the modeling process involves grouping data into clusters determined by the K-Means algorithm. This algorithm seeks cluster centers (centroids) based on the distances between data points, where each data point is assigned to the cluster with the closest centroid. The steps of modeling with K-Means include initializing cluster centroids, calculating distances, cluster assignment, centroid updating, and iterating until convergence.
# k-means with 3 clusters
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_km <- kmeans(x = spotify_scaled,
centers = 3)One commonly used method of evaluation is the elbow method. This method involves plotting the inertia values against the number of clusters used in the analysis. Inertia measures how far the data points are spread within a cluster, and the smaller the inertia value, the better. In the elbow plot, we look for the point where the decrease in inertia becomes slower, resembling an elbow. At this point, adding more clusters no longer significantly reduces inertia, and that is the recommended number of clusters to choose.
# Define the range of clusters you want to consider
num_clusters <- 2:10
# Calculate WSS for each number of clusters
wss <- numeric(length(num_clusters))
for (i in seq_along(num_clusters)) {
k <- num_clusters[i]
kmeans_model <- kmeans(spotify_scaled, centers = k, nstart = 10)
wss[i] <- kmeans_model$tot.withinss
}
# Plot the WSS values against the number of clusters
plot(num_clusters, wss, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters", ylab = "Within-Cluster Sum of Squares")
# Add a vertical line at the "elbow point"
elbow_point <- which(diff(wss) <= 0.01 * max(diff(wss)))
abline(v = num_clusters[elbow_point], col = "red")Based on the visualization of the elbow method, we can observe that the inertia values start to decrease more slowly after the number of clusters reaches 3. This indicates that adding more clusters beyond 3 does not significantly contribute to reducing inertia. Therefore, we can determine that K = 3 is an appropriate choice for the number of clusters in our analysis.
The number of iterations in the K-Means clustering process is a significant aspect of the modeling procedure.
#> [1] 4
his value represents how many times the algorithm iterated to optimize the cluster assignments and centroids. A higher number of iterations may indicate that the algorithm required more steps to find the optimal clusters, while a lower number could suggest faster convergence.
It represents the number of observations that were assigned to each cluster during the clustering process.
#> [1] 10347 170413 51965
The output results you provided show the distribution of observations in each cluster after the cluster analysis process using the K-Means method. The breakdown of the number of observations is as follows:
Centroid centers are the mean values of the attributes within each cluster and represent a central point around which the data points in the cluster are grouped.
#> popularity acousticness danceability duration_ms energy instrumentalness
#> 1 -1.1225212 1.1835743 0.04146072 0.07412107 0.3346875 -0.4851339
#> 2 0.2668908 -0.4652109 0.30807513 -0.03047521 0.3974330 -0.2755425
#> 3 -0.6517258 1.2899362 -1.01855096 0.08518122 -1.3699753 1.0002059
#> liveness loudness speechiness tempo valence
#> 1 2.58008878 -0.4063925 4.0199624 -0.6222912 -0.1471545
#> 2 -0.08061124 0.4354810 -0.1294428 0.1507078 0.2825528
#> 3 -0.24937893 -1.3471891 -0.3759417 -0.3703210 -0.8972973
Cluster Labels The information about the assigned clusters for the data points can be obtained using the following code:
#> [1] 2 2 3 3 3 3
These labels indicate the groupings to which the data points belong based on the K-Means clustering algorithm.
In cluster analysis, it is important to evaluate how well the created cluster model fits the data. We can perform evaluations using various methods, one of which is by examining the within-cluster sum of squares (WSS) and between-cluster sum of squares (BSS).
First, we can examine the WSS by calculating the total variability of the data within the clusters. The WSS values can be accessed using the following code:
#> [1] 1661576
The lower the value of WSS, the denser and more compact the formed clusters, indicating a better cluster model. However, it’s important to note that, based solely on the WSS value, we cannot definitively determine whether the clustering is already optimal or not. Further analysis and comparison with other evaluation methods are needed to make a more informed assessment of the clustering model’s performance.
Furthermore, we can also evaluate by examining the BSS/TSS ratio, which measures how far the clusters are spread overall in the dataset. This ratio can be calculated using the following code:
#> [1] 0.3509376
A higher ratio indicates that the clusters are well spread, which also indicates a better clustering model. This ratio ranges between 0 and 1, where a value closer to 1 suggests that the clusters are more distinct and well-separated within the dataset.
The ratio of between-cluster sum of squares (BSS) to total sum of squares (TSS) is calculated to be approximately 0.3509376. This ratio provides insight into how well the clusters are spread out overall within the dataset. A higher ratio, closer to 1, suggests that the clusters are well-dispersed, indicating a more effective cluster model in capturing distinct patterns and variations among the data points.
However, it is important to note that selecting a too small value for the number of clusters (k) can result in a very low WSS and a favorable BSS/TSS ratio. However, these outcomes may no longer be significantly representative due to the potential presence of a cluster containing only one observation (*the clustering objective is not achieved). Therefore, when evaluating a cluster model, it is necessary to consider the trade-off between a low WSS value and the appropriateness of the resulting cluster representation. In practice, determining the optimal number of clusters often involves the use of visualization methods and other tests, such as the elbow method or silhouette analysis, to help select the most suitable cluster model for the data.
In the context of cluster analysis on the Spotify dataset, profiling can help us describe the detailed characteristics of the music within each cluster. For example, we can identify clusters with high values in attributes like danceability, energy, and loudness, which could be interpreted as clusters of high-energy songs suitable for dancing and entertainment. On the other hand, clusters with high values in attributes like acousticness and instrumentalness might indicate songs with a strong acoustic and instrumental emphasis.
To integrate the cluster information back into our dataset, we employ the following code within our analysis:
# Assign cluster column into the dataset
spotify_num $cluster <- spotify_km$cluster
head(spotify_num)This step enables us to associate each data point with its respective cluster assignment. By appending the cluster column to our dataset, we can now gain insights into how individual songs are categorized within the identified clusters.
we proceed with the process of profiling by summarizing the data within each cluster. This is achieved through the following code:
# melakukan profiling dengan summarise data
spotify_centroid <- spotify_num %>%
group_by(cluster) %>%
summarise_all(mean)
spotify_centroidFurthermore, to delve deeper into the profiling process, we utilize the following code:
spotify_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
group_min = which.min(value),
group_max = which.max(value))In this step, we transform the centroid profiles into a longer format to facilitate comparison and analysis. By pivoting the data, we can identify the attributes that have the minimum and maximum values within each cluster. This allows us to pinpoint the specific musical traits that contribute significantly to the distinctions between clusters. By understanding which attributes vary the most within each cluster, we gain a finer-grained understanding of the unique musical characteristics that define each group.
In the Clustering Visualization stage, we use the PCA (Principal Component Analysis) Biplot method to visualize the results of cluster analysis using the K-Means algorithm. PCA Biplot helps reduce the dimensions of the data and depict the relationships between songs in a two-dimensional graph. Each point on the biplot represents a song, while the direction of vectors indicates the contribution of music attributes to the variability in the data. This way, we can understand the cluster distribution, observe formed group patterns, and identify the distinguishing musical characteristics of each cluster. This visualization provides a better visual insight into how songs are grouped based on relevant music attributes.
# Define cluster labels
cluster_labels <- c("Dynamic Expressions", "Energetic Grooves", "Melodic Explorations")
fviz_cluster(object = spotify_km,
data = spotify_num, labelsize = 1, labels = cluster_labels)Based on the visualization of the profiling results, Here are the insights derived from the analysis of the Spotify dataset using K-Means clustering:
Cluster 1
Cluster 2
Cluster 3
In this cluster profiling:
Cluster 1 is characterized by high liveness and speechiness, along with low instrumentalness and tempo. It is labeled as “Dynamic Expressions.”
Cluster 2 showcases high danceability, energy, loudness, popularity, and tempo, with low acousticness and duration_ms. It is labeled as “Energetic Grooves.”
Cluster 3 exhibits high acousticness, duration_ms, and instrumentalness, while having low danceability, energy, liveness, loudness, and speechiness. It is labeled as “Melodic Explorations.”
In the context of the Spotify platform, the concept of music recommendation becomes highly relevant and beneficial. Spotify has successfully integrated data analysis techniques to provide a more personalized and tailored listening experience for each user through a process known as profiling. To continue the analysis, we add cluster information to the Spotify dataset.
For example, if someone is currently listening to the song ‘Ouverture’ by Fabien Nataf, we can display recommendations based on the same music cluster. This leverages the cluster information that has been added to the dataset earlier. As a result, song recommendations will be more aligned with the user’s music taste, providing a more cohesive and satisfying listening experience.
Based on the provided information, Cluster 2 represents ‘Energetic Grooves,’ a category characterized by lively and energetic musical selections. This cluster likely includes songs with high energy, danceability, and possibly upbeat tempos, making it a suitable choice for listeners in the mood for dynamic and lively music.We can randomly select songs from this cluster to provide recommendations to the user, ensuring that the suggestions align with the filtered Energetic Grooves cluster.
energetic_grooves_songs <- spotify_clean %>%
filter(cluster == 2) %>%
sample_n(10, replace = TRUE) # Mengambil 10 lagu secara acak dari klaster tersebut
energetic_grooves_songsIn conclusion, the K-Means clustering analysis performed on the Spotify dataset successfully grouped songs into distinct clusters based on their musical attributes. The elbow method was employed to determine the optimal number of clusters, which resulted in the selection of three clusters. Each cluster exhibited unique characteristics that could be interpreted as different musical styles or genres.
Cluster 1, termed “Dynamic Expressions,” comprises songs with high liveness and speechiness but lower instrumentalness and tempo. These songs might be suitable for energetic and dynamic occasions, such as live performances or motivational playlists.
Cluster 2, labeled as “Energetic Grooves,” encompasses songs with high danceability, energy, loudness, and popularity, making them ideal choices for upbeat and energetic settings, like parties or workouts.
Cluster 3, known as “Melodic Explorations,” includes songs with higher acousticness, longer duration, and instrumental elements. These songs could create a more relaxed and contemplative atmosphere, suitable for moments of introspection or background music.
Considering the profiles of these clusters, we can provide song recommendations based on the identified musical attributes:
Dynamic Expressions Cluster: For lively and expressive moments, consider songs with high liveness and speechiness, such as live recordings, speeches, or engaging performances.
Energetic Grooves Cluster: If you’re looking to uplift the mood and add energy, go for songs with high danceability, energy, and loudness. These tracks are perfect for parties, workouts, or any lively activity.
Melodic Explorations Cluster: For a more soothing and introspective ambiance, explore songs with acoustic elements, longer durations, and instrumental nuances. These tracks can enhance relaxation or create a calming environment.
By leveraging the insights gained from the K-Means clustering and profiling analysis, music enthusiasts, playlist curators, and even artists can tailor their musical selections to suit various occasions and preferences. This methodology provides a data-driven approach to recommending songs that align with specific musical characteristics and styles, enhancing the overall music listening experience.