Spotify is a leading digital music, podcast, and video streaming service that offers a vast collection of songs and content from artists worldwide. As one of the largest tech companies in the music industry, Spotify utilizes advanced technology to enhance user experience. For instance, each song track uploaded to the platform is meticulously identified, and users can easily access audio feature information for each track through the Spotify API link.
In this analysis, we will use the Spotify data set from Kaggle to perform K-means clustering and reduce dimensions with PCA. The results will be used to develop a song recommendation system based on users’ recent listening preferences on Spotify.
library(tidyverse)
library(glue)
library(factoextra)
library(FactoMineR)
library(ggplot2)
library(viridis)
library(GGally)
library(scales)
library(tidyr)
library(ggiraphExtra)Let’s just load the data for this project.
spotify <- read.csv("SpotifyFeatures.csv")
spotifydim(spotify)#> [1] 232725 18
There are 18 columns and 232,725 rows in the dataset, The following is an explanation of each column in the dataset:
acousticness: A confidence measure from 0.0 to 1.0 of
whether the track is acoustic. 1.0 represents high confidence the track
is acoustic.danceability: Danceability describes how suitable a
track is for dancing based on a combination of musical elements
including tempo, rhythm stability, beat strength, and overall
regularity. A value of 0.0 is least danceable and 1.0 is most
danceable.energy: Energy is a measure from 0.0 to 1.0 and
represents a perceptual measure of intensity and activity. Typically,
energetic tracks feel fast, loud, and noisy. For example, death metal
has high energy, while a Bach prelude scores low on the scale.
Perceptual features contributing to this attribute include dynamic
range, perceived loudness, timbre, onset rate, and general entropy.instrumentalness: Predicts whether a track contains no
vocals. “Ooh” and “aah” sounds are treated as instrumental in this
context. Rap or spoken word tracks are clearly “vocal”. The closer the
instrumentalness value is to 1.0, the greater likelihood the track
contains no vocal content. Values above 0.5 are intended to represent
instrumental tracks, but confidence is higher as the value approaches
1.0.key: The key the track is in. Integers map to pitches
using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and
so on. If no key was detected, the value is -1.liveness: Detects the presence of an audience in the
recording. Higher liveness values represent an increased probability
that the track was performed live. A value above 0.8 provides strong
likelihood that the track is live.loudness: The overall loudness of a track in decibels
(dB). Loudness values are averaged across the entire track and are
useful for comparing relative loudness of tracks. Loudness is the
quality of a sound that is the primary psychological correlate of
physical strength (amplitude). Values typically range between -60 and 0
db.mode: Mode indicates the modality (major or minor) of a
track, the type of scale from which its melodic content is derived.
Major is represented by 1 and minor is 0.speechiness: Speechiness detects the presence of spoken
words in a track. The more exclusively speech-like the recording
(e.g. talk show, audio book, poetry), the closer to 1.0 the attribute
value. Values above 0.66 describe tracks that are probably made entirely
of spoken words. Values between 0.33 and 0.66 describe tracks that may
contain both music and speech, either in sections or layered, including
such cases as rap music. Values below 0.33 most likely represent music
and other non-speech-like tracks.tempo: The overall estimated tempo of a track in beats
per minute (BPM). In musical terminology, tempo is the speed or pace of
a given piece and derives directly from the average beat duration.time_signature: An estimated time signature. The time
signature (meter) is a notational convention to specify how many beats
are in each bar (or measure). The time signature ranges from 3 to 7
indicating time signatures of “3/4”, to “7/4”.valence: A measure from 0.0 to 1.0 describing the
musical positiveness conveyed by a track. Tracks with high valence sound
more positive (e.g. happy, cheerful, euphoric), while tracks with low
valence sound more negative (e.g. sad, depressed, angry).First of all, We will check if there is any missing values in the data. .
colSums(is.na(spotify))#> genre artist_name track_name track_id
#> 0 0 0 0
#> popularity acousticness danceability duration_ms
#> 0 0 0 0
#> energy instrumentalness key liveness
#> 0 0 0 0
#> loudness mode speechiness tempo
#> 0 0 0 0
#> time_signature valence
#> 0 0
We didn’t find any missing values inside the data.
glimpse(spotify)#> Rows: 232,725
#> Columns: 18
#> $ genre <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
#> $ valence <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…
Several columns need their data types changed to the appropriate format:
genre should be changed to factor.key should be changed to factor.mode should be changed to factor.Beside that, we will remove unnecessary columns, such as
track_id, time_signature,
track_name, and artist_name. Also, we will
create a new column called song by combining the columns
artist_name and track_name.
spotify_clean <- spotify %>%
mutate(genre = as.factor(genre),
key = as.factor(key),
mode = as.factor(mode),
song = paste(artist_name, track_name, sep = " - ")) %>%
select(-track_id, -time_signature, -artist_name, -track_name)
spotify_cleanFirst, we will delete any rows that contain duplicate data in the
data frame. The song column is supposed to have unique
data.
sum(duplicated(spotify_clean$song))#> [1] 56211
There are 56211 duplicate rows. Let’s remove duplicate rows using
distinct() on song column.
spotify_clean <- distinct(spotify_clean, song, .keep_all = TRUE)
sum(duplicated(spotify_clean))#> [1] 0
Then, we will set the song column as the index.
rownames(spotify_clean) <- spotify_clean$song
spotify_clean <- spotify_clean %>% select(-song)
spotify_cleanTo reduce computational load and ease exploratory analysis on cluster formation, we will utilize a random 20,000 sample from the entire dataset. This approach will provide relevant insights without compromising the analysis quality, as the sample captures the variation in the original data. Moreover, it enables us to execute the clustering algorithm more efficiently and quickly.
set.seed(125)
spotify_sample <- spotify_clean %>% slice_sample(n = 20000)
dim(spotify_sample)#> [1] 20000 14
After sampling, we checked the distribution of one column and found that the frequency patterns were consistent with the original data.
#Before Sampling
hist(spotify_clean$popularity)#After Sampling
hist(spotify_sample$popularity)To apply the K-means and PCA method, we will only use data with a numeric type.
spotify_num <- spotify_sample %>% select_if(is.numeric)We need to examine the scale of variables in the dataset to ensure they are in similar ranges or directly comparable. This step will help identify any significant differences in values among audio features like tempo, energy, or loudness. By doing so, we can improve the accuracy of the clustering analysis and make the results more meaningful.
summary(spotify_num)#> popularity acousticness danceability duration_ms
#> Min. : 0.00 Min. :0.0000015 Min. :0.0599 Min. : 15509
#> 1st Qu.:25.00 1st Qu.:0.0438000 1st Qu.:0.4160 1st Qu.: 178000
#> Median :37.00 Median :0.2820000 Median :0.5610 Median : 220034
#> Mean :36.29 Mean :0.4020584 Mean :0.5429 Mean : 235519
#> 3rd Qu.:49.00 3rd Qu.:0.7880000 3rd Qu.:0.6840 3rd Qu.: 268860
#> Max. :97.00 Max. :0.9960000 Max. :0.9850 Max. :3909799
#> energy instrumentalness liveness loudness
#> Min. :0.000216 Min. :0.0000000 Min. :0.0119 Min. :-45.369
#> 1st Qu.:0.343000 1st Qu.:0.0000000 1st Qu.:0.0982 1st Qu.:-12.790
#> Median :0.592000 Median :0.0000656 Median :0.1300 Median : -8.223
#> Mean :0.556874 Mean :0.1695646 Mean :0.2248 Mean :-10.120
#> 3rd Qu.:0.787000 3rd Qu.:0.0847000 3rd Qu.:0.2760 3rd Qu.: -5.613
#> Max. :0.999000 Max. :0.9980000 Max. :1.0000 Max. : 1.893
#> speechiness tempo valence
#> Min. :0.0225 Min. : 34.21 Min. :0.0000
#> 1st Qu.:0.0368 1st Qu.: 92.00 1st Qu.:0.2220
#> Median :0.0498 Median :114.97 Median :0.4410
#> Mean :0.1296 Mean :117.01 Mean :0.4514
#> 3rd Qu.:0.1030 3rd Qu.:138.28 3rd Qu.:0.6660
#> Max. :0.9670 Max. :242.90 Max. :0.9920
We need to scale our data because the values in each column have different scales.
# Normalize numeric data using scale()
spotify_scale <- spotify_num %>% scale()
summary(spotify_scale)#> popularity acousticness danceability duration_ms
#> Min. :-2.08244 Min. :-1.0978 Min. :-2.53095 Min. :-1.7888
#> 1st Qu.:-0.64795 1st Qu.:-0.9782 1st Qu.:-0.66503 1st Qu.:-0.4677
#> Median : 0.04061 Median :-0.3278 Median : 0.09475 Median :-0.1259
#> Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
#> 3rd Qu.: 0.72917 3rd Qu.: 1.0538 3rd Qu.: 0.73926 3rd Qu.: 0.2711
#> Max. : 3.48340 Max. : 1.6217 Max. : 2.31646 Max. :29.8732
#> energy instrumentalness liveness loudness
#> Min. :-2.0199 Min. :-0.5286 Min. :-1.0105 Min. :-5.5271
#> 1st Qu.:-0.7761 1st Qu.:-0.5286 1st Qu.:-0.6010 1st Qu.:-0.4187
#> Median : 0.1275 Median :-0.5284 Median :-0.4500 Median : 0.2974
#> Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
#> 3rd Qu.: 0.8350 3rd Qu.:-0.2645 3rd Qu.: 0.2429 3rd Qu.: 0.7067
#> Max. : 1.6043 Max. : 2.5825 Max. : 3.6790 Max. : 1.8836
#> speechiness tempo valence
#> Min. :-0.5160 Min. :-2.65879 Min. :-1.68830
#> 1st Qu.:-0.4471 1st Qu.:-0.80291 1st Qu.:-0.85800
#> Median :-0.3845 Median :-0.06542 Median :-0.03892
#> Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
#> 3rd Qu.:-0.1281 3rd Qu.: 0.68299 3rd Qu.: 0.80260
#> Max. : 4.0349 Max. : 4.04258 Max. : 2.02187
ggcorr(spotify_scale, low = "#2980B9", mid = "#ECF0F1", high = "#2C3E50", label = TRUE, label_size = 3, label_alpha = TRUE,
hjust = 0.9, layout.exp = 2)The results indicate a strong positive correlation between
energy and loudness, and a strong negative
correlation between acousticness with energy
and loudness
Clustering is the process of grouping data based on similar characteristics. The main goal is to form clusters of data with similar attributes. Observations within the same cluster should be similar to each other, while those in different clusters should have distinct characteristics. Before using k-means, we will determine the optimal value of k.
We will determine the optimal number of clusters using 2 methods: the Elbow Method, and Silhouette Method.
In the Elbow Method, we choose the “bend of an elbow” on the graph to determine the optimal number of clusters.
fviz_nbclust(spotify_scale, kmeans, method = "wss", k.max = 10) +
labs(subtitle = "Elbow method")We found that using 3 clusters is sufficient because there is no significant decline in the total within-cluster sum of squares with a higher number of clusters.
The silhouette method calculates the silhouette coefficient, which involves finding the mean intra-cluster distance and the mean nearest-cluster distance for each observation. The optimal number of clusters is determined by selecting the number of clusters that corresponds to the highest silhouette score (the peak).
fviz_nbclust(spotify_scale, kmeans, method = "silhouette", k.max = 10) +
labs(subtitle = "Sillhouette method")According to the Silhouette method, the optimum number of clusters with the maximum score is 3.
We implemented the optimal K from our previous process, and we decided to use K = 3.
RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_cluster <- kmeans(spotify_scale, centers = 3)The goodness of clustering results can be evaluated using three values:
$withinss): the sum of squared
distances from each observation to the centroid of its cluster.$betweenss): the sum of
weighted squared distances from each cluster centroid to the global
mean, weighted by the number of observations in the cluster.$totss): the sum of squared
distances from each observation to the global mean.# check wss value
spotify_cluster$withinss#> [1] 10673.08 42606.15 85077.61
spotify_cluster$tot.withinss#> [1] 138356.8
# check bss value
spotify_cluster$betweenss#> [1] 81632.17
# check tss value
spotify_cluster$totss#> [1] 219989
# check bss/tss ratio
spotify_cluster$betweenss/spotify_cluster$totss#> [1] 0.3710739
The proportion value of 0.37 (or 37%) from the results indicates that the K-means model has successfully explained about 37% of the data’s variation based on the formed clusters.
Create a new column that contains label information for the clusters formed using the optimal K.
spotify_sample$cluster <- as.factor(spotify_cluster$cluster)
spotify_sampleGroup the data based on the formed clusters to examine the characteristics of each cluster.
spotify_num$cluster <- as.factor(spotify_cluster$cluster)
spotify_centroid <- spotify_num %>%
group_by(cluster) %>%
summarise_all(mean)
spotify_centroidTo simplify profiling, we need to create a table that displays the clusters with the lowest and highest values for each audio feature characteristic. Additionally, we can create a radar plot and PCA-Biplot integrated plot for better visualization.
spotify_centroid %>%
pivot_longer(-cluster) %>%
group_by(name) %>%
summarize(
min_cluster = which.min(value),
max_cluster = which.max(value))A radar plot is a graph used to visualize the variation in values of multiple variables. It displays attributes as axes, and the distance from the center to a point on each axis represents the value of that attribute.
ggRadar(data=spotify_num,
aes(colour=cluster),
interactive=TRUE)spotify_pca <- PCA(X = spotify_num,
scale.unit = T,
quali.sup = 'cluster',
graph = F)# biplot + cluster visualization
fviz_pca_biplot(X = spotify_pca,
habillage = "cluster",
geom.ind = "point",
addEllipses = T,
col.var = "navy")💡 Profiling of each cluster:
Cluster 1
liveness,
speechinessinstrumentalness,
popularity, tempoCluster 2
acousticness,
instrumentalness, duration_msdanceability,
energy, liveness, loudness,
speechiness, valenceCluster 3
danceability,
energy, loudness, popularity,
tempo, valenceacousticness,
duration_msNow, we aim to gain new insights or information that was not previously obtained through dataset exploration, using the analysis and modeling conducted earlier.
After obtaining three clusters, we want to determine if the clusters
are grouped by genre. Below are the frequencies of each
genre within each cluster.
spotify_genre <- table(spotify_sample %>% select(genre, cluster)) %>%
as.data.frame() %>%
arrange(genre, cluster)
spotify_genreFor a clearer overview, visualization is performed as follows:
spotify_genre %>%
ggplot(aes(x = genre, y = Freq, fill = cluster)) +
geom_col() +
labs(title = "Clusters for Each Genre",
y = "Frequency") +
coord_flip()The graph above provides a clearer view of the cluster distribution
for each genre. From the graph, we can observe the
following:
Here is the information in dataframe form:
spotify_genre %>%
group_by(genre) %>%
mutate(prop = Freq / sum(Freq)) %>%
slice_max(Freq == max(Freq)) %>%
ungroup() %>%
arrange(cluster)Based on the information above, there is a hypothesis that individuals who like certain genres are likely to enjoy other genres within the same cluster. However, this statement is questionable because cluster 3 covers a wide range of music genres, making it uncertain whether people who like genres in cluster 3 would necessarily enjoy other genres within the same cluster.
Now, do something similar but in the key column.
spotify_key <- table(spotify_sample %>% select(key, cluster)) %>%
as.data.frame() %>%
arrange(key, cluster)
spotify_keyIn visualization form:
spotify_key %>%
ggplot(aes(x = key, y = Freq, fill = cluster)) +
geom_col() +
labs(title = "Clusters for Each Key",
y = "Frequency")The key variable does not provide interesting
information as the proportion of clusters in each key is
almost the same.
We want to make a song recommender based on users’ recent listening
on Spotify. For example, a user is listening to the
Coldplay - Adventure of a Lifetime song. What songs would
we recommend after listening to the song?
Identify the Coldplay - Adventure of a Lifetime song in
which cluster.
spotify_sample["Coldplay - Adventure of a Lifetime", ]The song Coldplay - Adventure of a Lifetime is in
cluster 3. We will refer to a song from the same cluster as
Coldplay - Adventure of a Lifetime because songs within the
same cluster have similar characteristics. Afterward, we will sort the
songs by popularity.
The following are the recommended songs to play after
Coldplay - Adventure of a Lifetime:
song_recommendation <- spotify_sample[spotify_sample$cluster==3,] %>%
arrange(desc(popularity))
song_recommendationThe conclusions from the above project results are as follows: