1 Introduction

In this analysis, we will explore how capable clustering and dimension reduction methods are, when dealing with real life music data. The ability of clustering algorithms to find hidden, underlying structure will be tested, as well as the potential of dimension reduction methods in simplifying the visualization and interpretation of that structure. Many different techniques such as PCA, MDS, K-means, PAM (K-medoids) and hierarchical clustering will be used. It is also worth mentioning that this paper does not aim to impose the existence of this structure, simply test if different unsupervised learning methods are able to capture them.

2 Data description and pre-processing

In this paper, we will explore the possibility of correctly grouping songs by artists using only their numerical musical attributes and various unsupervised learning methods of clustering and dimension reduction. Below dataset comes from the kaggle website: https://www.kaggle.com/datasets/abdulszz/spotify-most-streamed-songs and contains information about most streamed spotify songs as of one year ago (end of 2024).

data <- read.csv("Spotify Most Streamed Songs.csv")

This is how the dataset looks after preprocessing:

# in deezer playlists
data$in_deezer_playlists <- as.integer(as.numeric(gsub(",", "", data$in_deezer_playlists)))

#in shazam charts
tmp <- data$in_shazam_charts
tmp[tmp == ""] <- NA
num <- as.numeric(gsub(",", "", tmp))
data$in_shazam_charts <- as.integer(num)
data$in_shazam_charts[is.na(data$in_shazam_charts)] <- 0

# in streams
streams_clean <- suppressWarnings(as.numeric(gsub(",", "", data$streams)))
data$streams <- as.numeric(data$streams)
bad_streams <- which(is.na(streams_clean))
data <- data[-bad_streams, ]

summary(data)

##   track_name        artist.s._name      artist_count   released_year 
##  Length:952         Length:952         Min.   :1.000   Min.   :1930  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:2020  
##  Mode  :character   Mode  :character   Median :1.000   Median :2022  
##                                        Mean   :1.557   Mean   :2018  
##                                        3rd Qu.:2.000   3rd Qu.:2022  
##                                        Max.   :8.000   Max.   :2023  
##  released_month    released_day   in_spotify_playlists in_spotify_charts
##  Min.   : 1.000   Min.   : 1.00   Min.   :   31.0      Min.   :  0.00   
##  1st Qu.: 3.000   1st Qu.: 6.00   1st Qu.:  874.5      1st Qu.:  0.00   
##  Median : 6.000   Median :13.00   Median : 2216.5      Median :  3.00   
##  Mean   : 6.039   Mean   :13.94   Mean   : 5202.6      Mean   : 12.02   
##  3rd Qu.: 9.000   3rd Qu.:22.00   3rd Qu.: 5573.8      3rd Qu.: 16.00   
##  Max.   :12.000   Max.   :31.00   Max.   :52898.0      Max.   :147.00   
##     streams          in_apple_playlists in_apple_charts  in_deezer_playlists
##  Min.   :2.762e+03   Min.   :  0.00     Min.   :  0.00   Min.   :    0.0    
##  1st Qu.:1.416e+08   1st Qu.: 13.00     1st Qu.:  7.00   1st Qu.:   13.0    
##  Median :2.905e+08   Median : 34.00     Median : 38.50   Median :   44.0    
##  Mean   :5.141e+08   Mean   : 67.87     Mean   : 51.96   Mean   :  385.5    
##  3rd Qu.:6.739e+08   3rd Qu.: 88.00     3rd Qu.: 87.00   3rd Qu.:  164.2    
##  Max.   :3.704e+09   Max.   :672.00     Max.   :275.00   Max.   :12367.0    
##  in_deezer_charts in_shazam_charts       bpm             key           
##  Min.   : 0.000   Min.   :   0.00   Min.   : 65.00   Length:952        
##  1st Qu.: 0.000   1st Qu.:   0.00   1st Qu.: 99.75   Class :character  
##  Median : 0.000   Median :   2.00   Median :121.00   Mode  :character  
##  Mean   : 2.669   Mean   :  56.91   Mean   :122.55                     
##  3rd Qu.: 2.000   3rd Qu.:  33.25   3rd Qu.:140.25                     
##  Max.   :58.000   Max.   :1451.00   Max.   :206.00                     
##      mode           danceability_.    valence_.        energy_.    
##  Length:952         Min.   :23.00   Min.   : 4.00   Min.   : 9.00  
##  Class :character   1st Qu.:57.00   1st Qu.:32.00   1st Qu.:53.00  
##  Mode  :character   Median :69.00   Median :51.00   Median :66.00  
##                     Mean   :66.98   Mean   :51.41   Mean   :64.27  
##                     3rd Qu.:78.00   3rd Qu.:70.00   3rd Qu.:77.00  
##                     Max.   :96.00   Max.   :97.00   Max.   :97.00  
##  acousticness_.  instrumentalness_.   liveness_.    speechiness_.  
##  Min.   : 0.00   Min.   : 0.000     Min.   : 3.00   Min.   : 2.00  
##  1st Qu.: 6.00   1st Qu.: 0.000     1st Qu.:10.00   1st Qu.: 4.00  
##  Median :18.00   Median : 0.000     Median :12.00   Median : 6.00  
##  Mean   :27.08   Mean   : 1.583     Mean   :18.21   Mean   :10.14  
##  3rd Qu.:43.00   3rd Qu.: 0.000     3rd Qu.:24.00   3rd Qu.:11.00  
##  Max.   :97.00   Max.   :91.000     Max.   :97.00   Max.   :64.00  
##   cover_url        
##  Length:952        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Now it is time to select only a few songs and variables for our study. We will start by choosing 4 different artists, each representing a different genre:

Bad Bunny \(\Longrightarrow\) reggeton
Kendrick Lamar \(\Longrightarrow\) rap
Morgan Wallen \(\Longrightarrow\) country
Harry Styles \(\Longrightarrow\) pop

It is important to note that, despite representing different genres, all four artists operate within the mainstream sphere. Consequently, their catalogs share underlying pop-inspired production values that result in significant sonic similarities relative to the broader spectrum of music. Even so, they are the best representatives of the current mainstream within their own genres - and as we aim to study trends in the most popular and not niche music this is the best data to use there is. To ensure comparability between clusters, we will select an equal number of songs per artist/genre (10). Different clustering algorithms will be applied to see if musical attributes of songs for each artist are enough for the clustering algorithms to correctly group each of catalogs together. For those algorithms to work properly first we have to isolate only the numerical columns. Additionally, we will split the dataset into streaming performance and musical characteristics as they carry completely different information about the songs. In this study, we will focus on using these eight variables concerning musical attributes of each song:

BPM \(\Rightarrow\) Beats per minute, representing the tempo of the song
danceability \(\Rightarrow\) Suitability of the song for dancing
valence \(\Rightarrow\) Positivity of the song’s musical content
energy \(\Rightarrow\) Perceived energy level of the song
acousticness \(\Rightarrow\) Acoustic sound presence in the song
instrumentalness \(\Rightarrow\) Proportion of instrumental content in the track
liveness \(\Rightarrow\) Presence of live performance elements
speechiness \(\Rightarrow\) Amount of spoken words in the song

# find and filter artists which fit the experiment criteria
freq <- table(data$artist.s._name)
rows <- data$artist.s._name %in% c("Bad Bunny", "Harry Styles", "Kendrick Lamar", "Morgan Wallen")
data <- data[rows,]

data_pc <- data %>% group_by(artist.s._name) %>% slice_head(n = 10) %>% ungroup()
data_test <- anti_join(data, data_pc)

# save labels to different frames and remove non-numeric columns from data
# for experiment data
song_name_pc <- data_pc[, 1]
artist_name_pc <- data_pc[, 2]
data_performance_pc <- data_pc[, c(7:14)]
data_pc <- data_pc[, c(15, 18:(length(data) - 1))]
# and for test data
song_name_test <- data_test[, 1]
artist_name_test <- data_test[, 2]
data_performance_test <- data_test[, c(7:14)]
data_test <- data_test[, c(15, 18:(length(data) - 1))]

3 Standardization

As the first step, before moving on to clustering or dimension reduction methods, we should normalize the data and remove variables with no standard deviation, as they do not hold any real information. Although both methods are necessary only for PCA and not other clustering or Dim-red methods, it is still good practice to apply them before analysis. Removing variables with no additional information prevents numerical issues in the PCA algorithm while calculating eigenvalues and eigenvectors. Additionally normalization (or standardization) results in all variables getting rescaled to smaller, similar ranges, which for PCA means similar importance in the new model.

Normalization: \[x_{normalized} = \frac{x - x_{min}}{x_{max} - x_{min}}\]

Standardization: \[x_{standardized} = \frac{x - mean}{stdev}\]

For this paper we are choosing standardization.

# set to matrix
data_pc <- as.matrix(data_pc)
data_test <- as.matrix(data_test)

# remove no sd
# dim(data_pc)
x1 <- apply(data_pc, 2, sd)
x2 <- which(x1 == 0)
if(length(x2) == 0) {
  print("In train data no variables to remove.")
} else {
  data_pc <- data_pc[, -x2, drop = FALSE]
  data_test <- data_test[, -x2, drop = FALSE]
}

## [1] "In train data no variables to remove."

# standardize
# data_pc.n <- scale(data_pc)
# train_center <- attr(data_pc.n, "scaled:center")
# train_scale <- attr(data_pc.n, "scaled:scale")
# data_test.n <- scale(data_test, center = train_center, scale = train_scale)

preproc <- preProcess(data_pc, method = c("center", "scale"))
data_pc.n <- predict(preproc, data_pc)
data_test.n  <- predict(preproc, data_test)

Experiment data:

bpm	danceability_.	valence_.	energy_.	acousticness_.	instrumentalness_.	liveness_.	speechiness_.
0.7245889	0.0678394	-1.2216955	0.8218201	-0.7060723	6.1379980	-0.5304852	-0.3178175
-0.4912746	0.0678394	-1.4055819	0.3856967	-0.8525983	-0.1753714	-0.3607299	1.4847293
-0.7870251	1.1679373	-1.2216955	-0.9226737	-0.7060723	-0.1753714	-0.9548733	-0.4126883
0.0016431	1.7546562	-0.3022633	-0.8136428	-0.9624929	-0.1753714	-0.2758523	-0.4126883
-0.7213028	1.1679373	-0.9458658	0.1131195	-0.1565997	-0.1753714	-0.4456075	-0.6024301
1.9075912	-0.0055005	-0.2562917	0.0040887	-0.1932312	-0.1753714	-0.2758523	0.2514078

Test data:

bpm	danceability_.	valence_.	energy_.	acousticness_.	instrumentalness_.	liveness_.	speechiness_.
0.1002266	-0.8122390	-0.8079510	0.0586041	0.1730839	-0.1753714	-0.7002404	-0.6024301
0.6588666	1.6813163	1.9963171	-0.3230039	-0.1932312	-0.1753714	-0.0212194	-0.4126883
-1.2142204	0.5078785	-0.3022633	0.0040887	-0.3763887	-0.1753714	-0.7002404	-0.4126883
0.2645325	1.3146170	0.0195379	0.1131195	-0.7793353	-0.1753714	-0.3607299	-0.4126883
0.9874783	1.4612968	1.1228565	0.2221504	-0.4496518	-0.1753714	0.0636582	-0.3178175
-0.0312181	1.6079765	0.8010553	0.0040887	0.3196099	-0.1753714	1.5065779	-0.2229466

4 Main methods

4.1 First clustering

Let us start by just trying to use NbClust function, which automatically finds the optimal number of clusters:

set.seed(123)
c3<-NbClust(data_pc.n, distance="euclidean", method="complete", index="ch")
c3$Best.nc

## Number_clusters     Value_Index 
##         11.0000         11.6441

c3$Best.partition

##  [1]  1  2  2  3  2  4  5  5  6  2  7  7  5  5  8  4  2  8  3  4  3  3  6  2  3
## [26]  9  9  8 10  2  4 11  7  7  4  5  5  7  4  7

Well, not exactly what we aimed to discover. Finding 11 clusters suggests that there is enough variation to split single artists into multiple clusters. While this is a good observation about the nature and complexity of music, it is not exactly what we were hoping to achieve. We can now try mapping these clusters with the artists’ names, looking if the excess clusters are just divisions within each artists catalog, maybe changes in styles or eras of their discography, or has the algorithm completely dismissed artist boundaries while grouping:

first_result<- data.frame(cluster = c3$Best.partition, artist = artist_name_pc)

unique_counts <- first_result %>% group_by(cluster) %>%
  summarise(
    unique = n_distinct(artist.s._name),
    total = n()
  )

kable(unique_counts)

cluster	unique	total
1	1	1
2	3	7
3	3	5
4	3	6
5	3	6
6	2	2
7	2	6
8	2	3
9	1	2
10	1	1
11	1	1

Again, not exactly the result we were hoping for, many of the clusters include more than one artist.

4.2 Clustering tests

We can now try calculating silhouette to check for the optimal number of clusters. The overall Silhouette Score for the clustering solution is the mean of the silhouette coefficients for all \(N\) data points:

\[S = \frac{1}{N} \sum_{i=1}^{N} s(i)\] where \(s(i)\) is Silhouette for one data point:

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]

where \(a(i)\) represents the average distance from point \(i\) to other points within its own cluster, and \(b(i)\) is the smallest average distance from \(i\) to points in a different cluster.

opt<-Optimal_Clusters_KMeans(data_pc.n, max_clusters=11, plot_clusters = TRUE, criterion="silhouette")

The silhouettes reveal that, contrary to our expectations, that \(n=4\) produces by far the lowest score. Additionally, no amount of clusters surpasses a score of \(0.2\), indicating poor separability in the raw data. We can try to further verify this by checking the Hopkins Statistic and visualizing the distance matrix.

set.seed(123)
verification_statistics <- get_clust_tendency(data_pc.n, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)

The Hopkins Statistic for our data equals 0.793. This is great news, the closer it is to one, the better chances for the data to be clusterable, not just random noise. We can once again further verify this by plotting the visual dissimilarity matrix.

d <- get_dist(data_pc.n, method = "euclidean")
fviz_dist(d, show_labels = FALSE) + labs(title = "Distance Matrix of Data Before PCA")

Here we can more clearly see what the structure of data really is. Blue values represent point positioned far apart from each other (solid blue stripes represent outliers). The red diagonal represents distances between points and themselves (distance always zero - solid red). Even though there is a presence of outliers, we can still see some reddish squares along the diagonal, they suggest there is similarity inside the data.

4.3 PCA

Let us try reducing the dimensions of our data with PCA (Principal component analysis), this may help with the clustering by simplifying the numerical operations. PCA is used for exploratory data analysis, it reduces the number of variables by calculating eigenvectors and eigenvalues of the data matrix. Eigenvectors serve as the new axes and eigenvalues represent the significance of each new variable (principal component).

data_pc.pca <- prcomp(data_pc.n, scale = FALSE)
kable(head(round(data_pc.pca$x, 2)))

PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8
-0.66	3.39	-0.78	-4.96	1.79	0.95	0.17	-0.09
0.24	-0.39	-1.39	-0.90	-1.33	-0.83	0.25	-0.11
0.62	1.02	-1.74	0.66	-0.05	-0.74	-0.47	-0.53
-0.32	1.00	-1.25	1.02	0.45	-0.76	0.45	-0.60
0.03	0.72	-1.29	0.52	0.09	-0.61	-0.57	0.50
0.02	0.92	0.83	0.02	-0.86	-0.20	1.26	-0.04

summary(data_pc.pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5    PC6     PC7
## Standard deviation     1.4883 1.2381 1.1731 1.0414 0.77952 0.7736 0.66040
## Proportion of Variance 0.2769 0.1916 0.1720 0.1356 0.07596 0.0748 0.05452
## Cumulative Proportion  0.2769 0.4685 0.6405 0.7761 0.85205 0.9268 0.98137
##                            PC8
## Standard deviation     0.38609
## Proportion of Variance 0.01863
## Cumulative Proportion  1.00000

# we predict PCA values for test data based on train data linear transformation
data_test.pca <- predict(data_pc.pca, newdata = data_test.n)

Rule of thumb concerning PCA is to leave only principal components with variation larger than one. Following this logic, in further analysis we will limit the dataset to the first 4 principal components carrying 78% of the original datasets variance.

data_pc.pca_reduced <- data_pc.pca$x[, 1:4]
data_test.pca_reduced <- data_test.pca[, 1:4]
fviz_pca_var(data_pc.pca, col.var="wheat4")

On the above graph we can see the share of each variable in the first and second principal component. For example, we can see that high values of PC1 will mean high values of acousticness but low values of energy. Additionally we can plot the feature contribution for each of the principal components. Let us see a couple examples:

for (i in 3:4) {
  p <- fviz_contrib(data_pc.pca, 
                    choice = "var", 
                    axes = i, 
                    title = paste("Contribution of variables to Dim", i))
  print(p)
}

We can also try plotting all our observations based on their values of the first two principal components:

Once again we can see some possibility of proper clustering. It is important to keep in mind, that there are two additional components not included in the graph, which will be included in further analysis. With that established, let us use the K-means algorithm and impose 4 clusters to avoid getting lost in different nuances of each artists discography.

4.4 K-means

The K-means algorithm aims to partition \(n\) observations into \(k\) clusters in which each observation belongs to the cluster with the nearest mean (centroid). The objective is to minimize the Within-Cluster Sum of Squares (WCSS):

\[argminJ = \sum_{j=1}^{k} \sum_{x_i \in C_j} ||x_i - \mu_j||^2\]

Where:

\(J\) is the objective function (sum of squared errors).
\(k\) is the number of clusters.
\(C_j\) is the set of points belonging to cluster \(j\).
\(x_i\) is a data point in cluster \(C_j\).
\(\mu_j\) is the centroid (mean) of cluster \(C_j\).

set.seed(123)
kmeans_result <- eclust(data_pc.pca_reduced, "kmeans", hc_metric="euclidean", k=4)

second_result<- table(cluster = kmeans_result$cluster, artist = artist_name_pc[[1]])
kable(second_result)

Bad Bunny	Harry Styles	Kendrick Lamar	Morgan Wallen
2	4	0	4
0	2	5	0
3	3	0	6
5	1	5	0

Again, no luck, the divide is not as we would like it. Let us try a different approach. An algorithm similar to K-means - K-medoids (PAM).

4.5 K-medoids (PAM)

The two algorithms, K-means and PAM, are similar in nature, but K-medoids instead of choosing an arbitrary point in space as the center of each cluster chooses one of the data points. This approach might help us, as such solution, for example, makes the results a bit more resistant to outliers.

pam_result <- pam(data_pc.pca_reduced, k = 4)
fviz_cluster(pam_result)

pam_table <- table(cluster = pam_result$clustering, artist = artist_name_pc[[1]])
kable(pam_table)

Bad Bunny	Harry Styles	Kendrick Lamar	Morgan Wallen
7	2	4	0
1	5	3	3
1	3	0	7
1	0	3	0

A bit better. This time we could actually try naming the clusters, the first one could be the Bad Bunny - “reggaeton” cluster, number three the Morgan Wallen - “country” cluster and number four the Kendrick Lamar - “rap” cluster. Although they are not clearly separated, there are still songs breaking the barriers of this clustering. This might just be the truth about popular music in the 2020s, that most hits are musically very alike, especially cluster two consists of all four artists. We can try verifying this by seeing where will the already trained algorithm assign a few new songs by the same artists.

pam_train_kcca <- as.kcca(pam_result, data_pc.pca_reduced)
pam_pred <- predict(pam_train_kcca, data_test.pca_reduced)
pam_table_test <- table(cluster = pam_pred, artist = artist_name_test)
kable(pam_table_test)

Bad Bunny	Harry Styles	Kendrick Lamar	Morgan Wallen
3	1	0	0
1	3	1	1
4	3	0	0
1	0	1	0

This is probably time use a more scientific test than the eye test to properly determine if the clustering was successful. Let us combine the train and test data to get the full picture of the analysis and calculate the rand index. It is a measure used to assess the similarity between two data clusterings. It operates by comparing all pairs of observations between two periods (\(t_0\) and \(t_1\)) to determine consistency in their grouping.

For any given pair of observations, there are four possible outcomes based on whether they belong to the same cluster or different clusters in the two periods:

\(a\): The number of pairs that are in the same cluster in \(t_0\) and remain in the same cluster in \(t_1\).
\(b\): The number of pairs that are in different clusters in \(t_0\) and remain in different clusters in \(t_1\).
\(c\): The number of pairs that are in the same cluster in \(t_0\) but move to different clusters in \(t_1\).
\(d\): The number of pairs that are in different clusters in \(t_0\) but move to the same cluster in \(t_1\).

The Rand Index (\(RI\)) is calculated as the ratio of consistent pairs (agreements) to total pairs:

\[RI = \frac{a + b}{a + b + c + d}\]

\(RI = 1\): Indicates that the partitions are identical (perfect agreement). In this scenario, \(c\) and \(d\) are null (0).
\(RI = 0\): Indicates that the partitions do not agree on any pair of points.

The Rand Index specifically evaluates pairs of points and is insensitive to relabelling (i.e., changing the names of the clusters does not affect the score).

result_together <- c(pam_result$clustering, pam_pred)
artist_together <- c(artist_name_pc[[1]], artist_name_test)
table_together <- table(cluster = result_together, artist = artist_together)
kable(table_together)

Bad Bunny	Harry Styles	Kendrick Lamar	Morgan Wallen
10	3	4	0
2	8	4	4
5	6	0	7
2	0	4	0

Well, unfortunately the rand index of 0.103 means the clustering is almost as bad as random guessing. This time we have got no other choice, than to accept the fact that to a computer looking strictly at statistics like “danceability” or “instrumentalness” there is almost no difference between the most popular acts. Whether the cause of this is the inability of computers to distinguish between artists (or genres) in popular music, or the fact that the Spotify created indicators of “audio features” might be lackluster in truly reflecting the songs nature, we were unable to recreate the proper divide between the songs. Nevertheless, what we have managed to achieve is to split the songs into categories, cluster one seems to represent more up-tempo music, cluster two and three focus more on melody and cluster four seems to represent the outliers, songs that might be different to the English speaking mainstream.

5 Different approach

5.1 MDS

As a last ditch effort we can try running two completely different methods on the pre-reduced data. MDS and hierarchical clustering, although they are completely different from each other, they both give as an opportunity to approach this task again, but from a different perspective. Additionally, in doing so we will have the chance to check, if the disappointing results did not come as a result of losing 22% of information during PCA. Let us begin by trying MDS. Main goal of MDS is to represent relations of any data in two dimensions (x,y). It reveals the dataset structure by plotting the relative distance of units. We assume n observations with k characteristics, one matrix calculates distances between all possible pairs – to get nxn matrix, then another one assumes artificial coordinates (x,y) in two dimensions and calculates the distances. Finally follows optimization (relocation) of artificial coordinates (x,y) - in two dimensions - in order to replicate at best real matrix of distances in n dimensions:

dist_mat <- dist(data_pc.n)
fit_data_pc <- mds(dist_mat, ndim=2, type="ratio")

mds_data <- as.data.frame(fit_data_pc$conf)
colnames(mds_data) <- c("Dim1", "Dim2")
mds_data$Artist <- artist_name_pc[[1]]

ggplot(mds_data, aes(x = Dim1, y = Dim2, color = Artist)) +
  geom_point() +
  geom_text(aes(label=Artist), size=3, vjust=1.5, check_overlap = TRUE) +
  ggtitle(paste("Metric MDS (Stress =", round(fit_data_pc$stress, 3), ")"))

Here again we can see that the separation is not really clear. Kendrick Lamar and Morgan Wallen, country and rap, do not seem to have much in common, but the presence of Harry Styles and Bad Bunny completely overshadows this divide, as they seem to dabble in both those styles. Additionally, the stress is >0.2, meaning the goodness of fit our model to the real data is quite poor. Our final option remains - hierarchical clustering.

5.2 Hierarchical clustering

Hierarchical clusteringis based on a matrix of dissimilarities, which is equivalent to a matrix of distances between points.The algorithm works iteratively, starting from the state in which each observation is its own cluster. In the next steps, the two most similar clusters are combined into one, until the state when one cluster is created.As a result we get the classification tree, which can be cut at the selected height, leaving us with a certain number of clusters:

hc <- hclust(dist_mat, method="complete")
plot(hc, labels=artist_name_pc[[1]])
rect.hclust(hc, k=4, border='red')

Unfortunately for the final time we do not manage to achieve the hoped result. One of the clusters turns out to be just an outlier song by itself and the rest consist of multiple different artists.

6 Conclusion

We did not manage to group the artists using UL methods. This may be a result of the spotify indicators not being good enough for scientific research, the computer not being able to differentiate between popular music genres or maybe it is simply the fact that popular music nowadays really is just the same formula used many times. In any case this is not a complete failure, as we managed to find out that the structure of modern pop records might be too complicated for clustering and dimension reduction algorithms to describe.

7 Bibliography

Class notes
https://en.wikipedia.org/wiki/K-means_clustering

Music analysis with clustering and dimension reduction methods

Antoni Rosiecki

2025-11-16