Introduction

Spotify is a leading digital music, podcast, and video streaming service that offers a vast collection of songs and content from artists worldwide. As one of the largest tech companies in the music industry, Spotify utilizes advanced technology to enhance user experience. For instance, each song track uploaded to the platform is meticulously identified, and users can easily access audio feature information for each track through the Spotify API link.

In this analysis, we will use the Spotify data set from Kaggle to perform K-means clustering and reduce dimensions with PCA. The results will be used to develop a song recommendation system based on users’ recent listening preferences on Spotify.

Load Library

library(tidyverse)
library(glue)
library(factoextra)
library(FactoMineR)
library(ggplot2)
library(viridis)
library(GGally)
library(scales)
library(tidyr)
library(ggiraphExtra)

Load Dataset

Let’s just load the data for this project.

spotify <- read.csv("SpotifyFeatures.csv")
spotify
dim(spotify)
#> [1] 232725     18

There are 18 columns and 232,725 rows in the dataset, The following is an explanation of each column in the dataset:

  • acousticness: A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
  • danceability: Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
  • energy: Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
  • instrumentalness: Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
  • key: The key the track is in. Integers map to pitches using standard Pitch Class notation. E.g. 0 = C, 1 = C♯/D♭, 2 = D, and so on. If no key was detected, the value is -1.
  • liveness: Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
  • loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typically range between -60 and 0 db.
  • mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived. Major is represented by 1 and minor is 0.
  • speechiness: Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
  • tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
  • time_signature: An estimated time signature. The time signature (meter) is a notational convention to specify how many beats are in each bar (or measure). The time signature ranges from 3 to 7 indicating time signatures of “3/4”, to “7/4”.
  • valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

Data Wrangling and Exploratory Data Analysis

Check Missing Values

First of all, We will check if there is any missing values in the data. .

colSums(is.na(spotify))
#>            genre      artist_name       track_name         track_id 
#>                0                0                0                0 
#>       popularity     acousticness     danceability      duration_ms 
#>                0                0                0                0 
#>           energy instrumentalness              key         liveness 
#>                0                0                0                0 
#>         loudness             mode      speechiness            tempo 
#>                0                0                0                0 
#>   time_signature          valence 
#>                0                0

We didn’t find any missing values inside the data.

Check Data Type

glimpse(spotify)
#> Rows: 232,725
#> Columns: 18
#> $ genre            <chr> "Movie", "Movie", "Movie", "Movie", "Movie", "Movie",…
#> $ artist_name      <chr> "Henri Salvador", "Martin & les fées", "Joseph Willia…
#> $ track_name       <chr> "C'est beau de faire un Show", "Perdu d'avance (par G…
#> $ track_id         <chr> "0BRjO6ga9RKCKjfDqeFgWV", "0BjC1NfoEOOusryehmNudP", "…
#> $ popularity       <int> 0, 1, 3, 0, 4, 0, 2, 15, 0, 10, 0, 2, 4, 3, 0, 0, 0, …
#> $ acousticness     <dbl> 0.61100, 0.24600, 0.95200, 0.70300, 0.95000, 0.74900,…
#> $ danceability     <dbl> 0.389, 0.590, 0.663, 0.240, 0.331, 0.578, 0.703, 0.41…
#> $ duration_ms      <int> 99373, 137373, 170267, 152427, 82625, 160627, 212293,…
#> $ energy           <dbl> 0.9100, 0.7370, 0.1310, 0.3260, 0.2250, 0.0948, 0.270…
#> $ instrumentalness <dbl> 0.00000000, 0.00000000, 0.00000000, 0.00000000, 0.123…
#> $ key              <chr> "C#", "F#", "C", "C#", "F", "C#", "C#", "F#", "C", "G…
#> $ liveness         <dbl> 0.3460, 0.1510, 0.1030, 0.0985, 0.2020, 0.1070, 0.105…
#> $ loudness         <dbl> -1.828, -5.559, -13.879, -12.178, -21.150, -14.970, -…
#> $ mode             <chr> "Major", "Minor", "Minor", "Major", "Major", "Major",…
#> $ speechiness      <dbl> 0.0525, 0.0868, 0.0362, 0.0395, 0.0456, 0.1430, 0.953…
#> $ tempo            <dbl> 166.969, 174.003, 99.488, 171.758, 140.576, 87.479, 8…
#> $ time_signature   <chr> "4/4", "4/4", "5/4", "4/4", "4/4", "4/4", "4/4", "4/4…
#> $ valence          <dbl> 0.8140, 0.8160, 0.3680, 0.2270, 0.3900, 0.3580, 0.533…

Several columns need their data types changed to the appropriate format:

  • genre should be changed to factor.
  • key should be changed to factor.
  • mode should be changed to factor.

Beside that, we will remove unnecessary columns, such as track_id, time_signature, track_name, and artist_name. Also, we will create a new column called song by combining the columns artist_name and track_name.

spotify_clean <- spotify %>% 
  mutate(genre = as.factor(genre),
         key = as.factor(key),
         mode = as.factor(mode),
         song = paste(artist_name, track_name, sep = " - ")) %>% 
  select(-track_id, -time_signature, -artist_name, -track_name)

spotify_clean

Check Duplicate Data

First, we will delete any rows that contain duplicate data in the data frame. The song column is supposed to have unique data.

sum(duplicated(spotify_clean$song))
#> [1] 56211

There are 56211 duplicate rows. Let’s remove duplicate rows using distinct() on song column.

spotify_clean <- distinct(spotify_clean, song, .keep_all = TRUE)
sum(duplicated(spotify_clean))
#> [1] 0

Then, we will set the song column as the index.

rownames(spotify_clean) <- spotify_clean$song
spotify_clean <- spotify_clean %>% select(-song)
spotify_clean

Data Sampling

To reduce computational load and ease exploratory analysis on cluster formation, we will utilize a random 20,000 sample from the entire dataset. This approach will provide relevant insights without compromising the analysis quality, as the sample captures the variation in the original data. Moreover, it enables us to execute the clustering algorithm more efficiently and quickly.

set.seed(125)
spotify_sample <- spotify_clean %>% slice_sample(n = 20000)
dim(spotify_sample)
#> [1] 20000    14

After sampling, we checked the distribution of one column and found that the frequency patterns were consistent with the original data.

#Before Sampling
hist(spotify_clean$popularity)

#After Sampling
hist(spotify_sample$popularity)

Data Scalling

To apply the K-means and PCA method, we will only use data with a numeric type.

spotify_num <- spotify_sample %>% select_if(is.numeric)

We need to examine the scale of variables in the dataset to ensure they are in similar ranges or directly comparable. This step will help identify any significant differences in values among audio features like tempo, energy, or loudness. By doing so, we can improve the accuracy of the clustering analysis and make the results more meaningful.

summary(spotify_num)
#>    popularity     acousticness        danceability     duration_ms     
#>  Min.   : 0.00   Min.   :0.0000015   Min.   :0.0599   Min.   :  15509  
#>  1st Qu.:25.00   1st Qu.:0.0438000   1st Qu.:0.4160   1st Qu.: 178000  
#>  Median :37.00   Median :0.2820000   Median :0.5610   Median : 220034  
#>  Mean   :36.29   Mean   :0.4020584   Mean   :0.5429   Mean   : 235519  
#>  3rd Qu.:49.00   3rd Qu.:0.7880000   3rd Qu.:0.6840   3rd Qu.: 268860  
#>  Max.   :97.00   Max.   :0.9960000   Max.   :0.9850   Max.   :3909799  
#>      energy         instrumentalness       liveness         loudness      
#>  Min.   :0.000216   Min.   :0.0000000   Min.   :0.0119   Min.   :-45.369  
#>  1st Qu.:0.343000   1st Qu.:0.0000000   1st Qu.:0.0982   1st Qu.:-12.790  
#>  Median :0.592000   Median :0.0000656   Median :0.1300   Median : -8.223  
#>  Mean   :0.556874   Mean   :0.1695646   Mean   :0.2248   Mean   :-10.120  
#>  3rd Qu.:0.787000   3rd Qu.:0.0847000   3rd Qu.:0.2760   3rd Qu.: -5.613  
#>  Max.   :0.999000   Max.   :0.9980000   Max.   :1.0000   Max.   :  1.893  
#>   speechiness         tempo           valence      
#>  Min.   :0.0225   Min.   : 34.21   Min.   :0.0000  
#>  1st Qu.:0.0368   1st Qu.: 92.00   1st Qu.:0.2220  
#>  Median :0.0498   Median :114.97   Median :0.4410  
#>  Mean   :0.1296   Mean   :117.01   Mean   :0.4514  
#>  3rd Qu.:0.1030   3rd Qu.:138.28   3rd Qu.:0.6660  
#>  Max.   :0.9670   Max.   :242.90   Max.   :0.9920

We need to scale our data because the values in each column have different scales.

# Normalize numeric data using scale()
spotify_scale <- spotify_num %>% scale()
summary(spotify_scale)
#>    popularity        acousticness      danceability       duration_ms     
#>  Min.   :-2.08244   Min.   :-1.0978   Min.   :-2.53095   Min.   :-1.7888  
#>  1st Qu.:-0.64795   1st Qu.:-0.9782   1st Qu.:-0.66503   1st Qu.:-0.4677  
#>  Median : 0.04061   Median :-0.3278   Median : 0.09475   Median :-0.1259  
#>  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
#>  3rd Qu.: 0.72917   3rd Qu.: 1.0538   3rd Qu.: 0.73926   3rd Qu.: 0.2711  
#>  Max.   : 3.48340   Max.   : 1.6217   Max.   : 2.31646   Max.   :29.8732  
#>      energy        instrumentalness     liveness          loudness      
#>  Min.   :-2.0199   Min.   :-0.5286   Min.   :-1.0105   Min.   :-5.5271  
#>  1st Qu.:-0.7761   1st Qu.:-0.5286   1st Qu.:-0.6010   1st Qu.:-0.4187  
#>  Median : 0.1275   Median :-0.5284   Median :-0.4500   Median : 0.2974  
#>  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
#>  3rd Qu.: 0.8350   3rd Qu.:-0.2645   3rd Qu.: 0.2429   3rd Qu.: 0.7067  
#>  Max.   : 1.6043   Max.   : 2.5825   Max.   : 3.6790   Max.   : 1.8836  
#>   speechiness          tempo             valence        
#>  Min.   :-0.5160   Min.   :-2.65879   Min.   :-1.68830  
#>  1st Qu.:-0.4471   1st Qu.:-0.80291   1st Qu.:-0.85800  
#>  Median :-0.3845   Median :-0.06542   Median :-0.03892  
#>  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000  
#>  3rd Qu.:-0.1281   3rd Qu.: 0.68299   3rd Qu.: 0.80260  
#>  Max.   : 4.0349   Max.   : 4.04258   Max.   : 2.02187

Correlation

ggcorr(spotify_scale, low = "#2980B9", mid = "#ECF0F1", high = "#2C3E50", label = TRUE, label_size = 3, label_alpha = TRUE,
       hjust = 0.9, layout.exp = 2)

The results indicate a strong positive correlation between energy and loudness, and a strong negative correlation between acousticness with energy and loudness

Clustering

Clustering is the process of grouping data based on similar characteristics. The main goal is to form clusters of data with similar attributes. Observations within the same cluster should be similar to each other, while those in different clusters should have distinct characteristics. Before using k-means, we will determine the optimal value of k.

Find Optimal Value of K

We will determine the optimal number of clusters using 2 methods: the Elbow Method, and Silhouette Method.

Elbow Method

In the Elbow Method, we choose the “bend of an elbow” on the graph to determine the optimal number of clusters.

fviz_nbclust(spotify_scale, kmeans, method = "wss", k.max = 10) +
  labs(subtitle = "Elbow method")

We found that using 3 clusters is sufficient because there is no significant decline in the total within-cluster sum of squares with a higher number of clusters.

Sillhouette Method

The silhouette method calculates the silhouette coefficient, which involves finding the mean intra-cluster distance and the mean nearest-cluster distance for each observation. The optimal number of clusters is determined by selecting the number of clusters that corresponds to the highest silhouette score (the peak).

fviz_nbclust(spotify_scale, kmeans, method = "silhouette", k.max = 10) +
  labs(subtitle = "Sillhouette method")

According to the Silhouette method, the optimum number of clusters with the maximum score is 3.

K-Means Clustering

We implemented the optimal K from our previous process, and we decided to use K = 3.

RNGkind(sample.kind = "Rounding")
set.seed(100)
spotify_cluster <- kmeans(spotify_scale, centers = 3)

Goodness of Fit

The goodness of clustering results can be evaluated using three values:

  1. Within Sum of Squares ($withinss): the sum of squared distances from each observation to the centroid of its cluster.
  2. Between Sum of Squares ($betweenss): the sum of weighted squared distances from each cluster centroid to the global mean, weighted by the number of observations in the cluster.
  3. Total Sum of Squares ($totss): the sum of squared distances from each observation to the global mean.
# check wss value
spotify_cluster$withinss
#> [1] 10673.08 42606.15 85077.61
spotify_cluster$tot.withinss
#> [1] 138356.8
# check bss value
spotify_cluster$betweenss
#> [1] 81632.17
# check tss value
spotify_cluster$totss
#> [1] 219989
# check bss/tss ratio
spotify_cluster$betweenss/spotify_cluster$totss
#> [1] 0.3710739

The proportion value of 0.37 (or 37%) from the results indicates that the K-means model has successfully explained about 37% of the data’s variation based on the formed clusters.

Interpretation: Cluster Profiling

Create a new column that contains label information for the clusters formed using the optimal K.

spotify_sample$cluster <- as.factor(spotify_cluster$cluster)
spotify_sample

Grouping Data Based on Cluster Label

Group the data based on the formed clusters to examine the characteristics of each cluster.

spotify_num$cluster <- as.factor(spotify_cluster$cluster)
spotify_centroid <- spotify_num %>% 
  group_by(cluster) %>% 
  summarise_all(mean)
spotify_centroid

To simplify profiling, we need to create a table that displays the clusters with the lowest and highest values for each audio feature characteristic. Additionally, we can create a radar plot and PCA-Biplot integrated plot for better visualization.

spotify_centroid %>% 
  pivot_longer(-cluster) %>% 
  group_by(name) %>% 
  summarize(
    min_cluster = which.min(value),
    max_cluster = which.max(value))

Radar Plot

A radar plot is a graph used to visualize the variation in values of multiple variables. It displays attributes as axes, and the distance from the center to a point on each axis represents the value of that attribute.

ggRadar(data=spotify_num, 
        aes(colour=cluster), 
        interactive=TRUE)

Clustering with PCA

spotify_pca <- PCA(X = spotify_num,
               scale.unit = T,
               quali.sup = 'cluster',
               graph = F)
# biplot + cluster visualization
fviz_pca_biplot(X = spotify_pca,
                habillage = "cluster",
                geom.ind = "point",
                addEllipses = T,
                col.var = "navy")

💡 Profiling of each cluster:

Cluster 1

  • Highest in audio features: liveness, speechiness
  • Lowest in audio features: instrumentalness, popularity, tempo
  • This cluster can be labeled as “Live & Speechy Tracks”

Cluster 2

  • Highest in audio features : acousticness, instrumentalness, duration_ms
  • Lowest in audio features: danceability, energy, liveness, loudness, speechiness, valence
  • This cluster can be labeled as “Acoustic & Instrumental Tracks”

Cluster 3

  • Highest in audio features: danceability, energy, loudness, popularity, tempo, valence
  • Lowest in audio features: acousticness, duration_ms
  • This cluster can be labeled as “Energetic & Popular Tracks”

Explore Clustering Data

Now, we aim to gain new insights or information that was not previously obtained through dataset exploration, using the analysis and modeling conducted earlier.

Clusters for Each Genre

After obtaining three clusters, we want to determine if the clusters are grouped by genre. Below are the frequencies of each genre within each cluster.

spotify_genre <- table(spotify_sample %>%  select(genre, cluster)) %>%  
  as.data.frame() %>% 
  arrange(genre, cluster)
spotify_genre

For a clearer overview, visualization is performed as follows:

spotify_genre %>%  
  ggplot(aes(x = genre, y = Freq, fill = cluster)) +
  geom_col() +
  labs(title = "Clusters for Each Genre",
       y = "Frequency") +
  coord_flip()

The graph above provides a clearer view of the cluster distribution for each genre. From the graph, we can observe the following:

  • Cluster 1 is predominantly associated with the Comedy genre.
  • Cluster 2 is mainly related to the Soundtrack, Opera, Classical, and Acapella genres.
  • The remaining genres are mostly assigned to cluster 3.

Here is the information in dataframe form:

spotify_genre %>%
  group_by(genre) %>% 
  mutate(prop = Freq / sum(Freq)) %>% 
  slice_max(Freq == max(Freq)) %>%
  ungroup() %>%
  arrange(cluster)

Based on the information above, there is a hypothesis that individuals who like certain genres are likely to enjoy other genres within the same cluster. However, this statement is questionable because cluster 3 covers a wide range of music genres, making it uncertain whether people who like genres in cluster 3 would necessarily enjoy other genres within the same cluster.

Clusters for Each Key

Now, do something similar but in the key column.

spotify_key <- table(spotify_sample %>% select(key, cluster)) %>%  
  as.data.frame() %>% 
  arrange(key, cluster)
spotify_key

In visualization form:

spotify_key %>%  
  ggplot(aes(x = key, y = Freq, fill = cluster)) +
  geom_col() +
  labs(title = "Clusters for Each Key",
       y = "Frequency")

The key variable does not provide interesting information as the proportion of clusters in each key is almost the same.

Case: Song Recommender

We want to make a song recommender based on users’ recent listening on Spotify. For example, a user is listening to the Coldplay - Adventure of a Lifetime song. What songs would we recommend after listening to the song?

Identify the Coldplay - Adventure of a Lifetime song in which cluster.

spotify_sample["Coldplay - Adventure of a Lifetime", ]

The song Coldplay - Adventure of a Lifetime is in cluster 3. We will refer to a song from the same cluster as Coldplay - Adventure of a Lifetime because songs within the same cluster have similar characteristics. Afterward, we will sort the songs by popularity.

The following are the recommended songs to play after Coldplay - Adventure of a Lifetime:

song_recommendation <- spotify_sample[spotify_sample$cluster==3,] %>% 
  arrange(desc(popularity))

song_recommendation

Conclusion

The conclusions from the above project results are as follows:

  • The clustering analysis using the k-means method successfully grouped the songs in the Spotify dataset into 3 clusters based on specific audio characteristics. Each cluster has its own distinct characteristics, namely “Energetic & Popular Tracks,” “Live & Speechy Tracks,” and “Acoustic & Instrumental Tracks”
  • We can create a song recommender based on recent listening on Spotify by suggesting songs from the same cluster, as songs within the same cluster share similar characteristics.
  • The analysis results offer valuable insights for musicians and the music industry, guiding them in creating more interesting and popular songs on Spotify. Understanding the main characteristics of each cluster and the impact of audio features on popularity enables musicians to adopt targeted music creation strategies to attract listeners effectively.