The aim of the project is to identify similar songs based on their lyrics by applying clustering techniques to textual data. Different methods, including k-means, PAM, spherical k-means and hierarchical clustering (with both euclidean and cosine distance metric) are used and compared. Clustering is performed in two different ways – first, using the Document Term Matrix to compare song lyrics data, and later, the Term Document Matrix to cluster individual words within different songs.
Following R packages were used to perform the analysis:
library(tm) # transforming text data
library(SnowballC) # stemming
library(NbClust) # assesing clustering tendency
library(factoextra) # clustering using k-means and PAM
library(skmeans) # spherical k-means clustering
library(dendextend) # comparing dendrograms
library(ClustGeo) # calculating intertia
library(cluster) # pam
library(gridExtra) # grid graphics
The dataset was downloaded from Kaggle, and includes lyrics data of songs nominated for a Grammy award in the Song of the Year category. Grammys, awarded anually to different artists, are considered as one of the most prestigious awards within the music industry, The highlighted category focuses on the song’s composition, and aims to recognize the songwriter who wrote the track – which, of course, also involves its lyrics. The data used in the project includes lyrics of songs nominated between 2021 and 2024, as well as the song name and artist.
grammys <- read.csv("data.csv")
str(grammys)
## 'data.frame': 28 obs. of 4 variables:
## $ song_name : chr "Flowers" "Anti-Hero" "Kill Bill" "vampire" ...
## $ artist_name: chr "Miley Cyrus" "Taylor Swift" "SZA" "Olivia Rodrigo" ...
## $ year : int 2024 2024 2024 2024 2024 2024 2024 2024 2023 2023 ...
## $ lyrics : chr "We were good, we were gold, Kinda dream that can't be sold, We were right 'til we weren't, Built a home and wat"| __truncated__ "I have this thing where I get older, but just never wiser, Midnights become my afternoons, When my depression w"| __truncated__ "I'm still a fan even though I was salty, Hate to see you with some other broad, know you happy, Hate to see you"| __truncated__ "Hate to give the satisfaction asking how you're doing now, How's the castle built off people you pretend to car"| __truncated__ ...
The first step is to transform the song lyrics data, using methods that allow to structurize the content of lyrics and transform texts into a document term matrix (or term document matrix).
## [1] "Black hole opened in the kitchen, Every clock's a different time, 13"
This includes:
• removing punctuation
punctuation <- removePunctuation(example)
punctuation
## [1] "Black hole opened in the kitchen Every clocks a different time 13"
• removing numbers
numbers <- removeNumbers(example)
numbers
## [1] "Black hole opened in the kitchen, Every clock's a different time, "
• transforming characters to lowercase
lower <- tolower(example)
lower
## [1] "black hole opened in the kitchen, every clock's a different time, 13"
• removing stopwords
stopwords <- removeWords(example, stopwords("en"))
stopwords
## [1] "Black hole opened kitchen, Every clock's different time, 13"
• stemming
stemming <- stemDocument(example)
stemming
## [1] "Black hole open in the kitchen, Everi clock a differ time, 13"
Next, the processed data can be further transformed into a document term matrix (or term document matrix). A document term matrix is used to describe frequency of terms that occur in each document. The columns of the matrix represent the words occurring across all documents, while the rows correspond to the individual documents (songs). Term document matrix can be obtained by transposing the document term matrix.
Creating a document term matrix using tm::DocumentTermMatrix involves both of the above steps, and allows for faster data preprocessing.
dtm <- DocumentTermMatrix(grammys$lyrics,
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
language = "eng"))
An additional important factor is the weighing of terms inside the document term matrix. TF-IDF (term frequency - inverse document frequency) method was applied in this project, as it takes into account the number of occurences of a term in the document (TF), as well as the number of documents in which that term appears (IDF). The final value is obtained by multiplying these two terms.
dtm <- weightTfIdf(dtm, normalize = TRUE)
Lastly, to allow for clustering of data, the document term matrix is transformed into a matrix object. Now, the data is ready for further analysis.
dtm_matrix <- as.matrix(dtm) # normalized document term tf-idf matrix
rownames(dtm_matrix) <- grammys$song_name
dtm_matrix[1:5, 125:128]
## Terms
## Docs buildin built bulletproof burn
## Flowers 0 0.01627471 0 0.01922907
## Anti-Hero 0 0.00000000 0 0.00000000
## Kill Bill 0 0.00000000 0 0.00000000
## vampire 0 0.01603180 0 0.00000000
## On My Mama 0 0.00000000 0 0.00000000
In the next step, traditional clustering methods are applied. First, clustering tendency is assessed using Hopkins Statistic.
hs <- get_clust_tendency(dtm_matrix, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
hs$hopkins_stat
## [1] 0.753398
The value of the Hopkins statistic is 0.75, signifying that the data is clusterable.
Next, the optimal number of clusters is analyzed using silhouette and gap statistic for both k-means and PAM algorithm.
According to the gap statistic, the optimal number of clusters is 1, which is not ideal. However, the silhouette score identified two clusters to be the optimal choice. Because the average silhouette width for k = 3 is close to the score of k = 2, both solutions were applied. Euclidean distance metric is used to cluster the data.
km2 <- eclust(dtm_matrix, "kmeans", hc_metric = "euclidean", k = 2, graph = FALSE)
pam2 <- eclust(dtm_matrix, "pam", hc_metric = "euclidean", k = 2, graph = FALSE)
km3 <- eclust(dtm_matrix, "kmeans", hc_metric = "euclidean", k = 3, graph = FALSE)
pam3 <- eclust(dtm_matrix, "pam", hc_metric = "euclidean", k = 3, graph = FALSE)
Let’s analyze the formed clusters. The clustering of k-means and PAM clustering is identical for both groupings. For k = 2, clusters of size 27 and 1 are created. For k = 3, clusters of size 26, 1, and 1 are created. This suggests that majority of song lyrics are similar in terms of structure, with only one or two data outliers.
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 27 0.38
## cluster size ave.sil.width
## 1 1 27 0.38
## 2 2 1 0.00
## cluster size ave.sil.width
## 1 1 1 0.00
## 2 2 1 0.00
## 3 3 26 0.38
## cluster size ave.sil.width
## 1 1 26 0.38
## 2 2 1 0.00
## 3 3 1 0.00
The silhouette scores are non-negative, but equal to 0 for the smallest clusters. The score of 0 indicates that these observations lie between two clusters (for k = 2, that would mean all observations could belong in a single cluster). These results are unsatisfactory, as the clusters do not provide any definite grouping of the data.
Poor results signify that the clustering approach may need to be adjusted. Hornik et al. (2012) suggest spherical k-means as a potential solution for clustering text data. Spherical k-means uses cosine dissimilarity metric to calculate distances between observations, recommended when dealing with textual data. The minimized distance metric can be defined as follows: \[d(x,p)=1−cos(x,p)\] where x is the term weight vector (document), and p is the centroid of the cluster. Following the authors’ approach, clustering is performed using skmeans::skmeans function in R. To allow for comparing previous results, k = 2 and k = 3 is selected.
set.seed(123)
skm2 <- skmeans(dtm_matrix, k = 2, control = list(nruns = 20, verbose = FALSE))
set.seed(123)
skm3 <- skmeans(dtm_matrix, k = 3, control = list(nruns = 20, verbose = FALSE))
## cluster size ave.sil.width
## 1 1 14 0.01
## 2 2 14 0.01
## cluster size ave.sil.width
## 1 1 10 0.00
## 2 2 7 0.01
## 3 3 11 0.03
This time, the clusters are much closer in size. However, both the overall and individual average silhouette width is much lower. The silhouette scores remain close to 0, and for some observations they are even negative (meaning that the observation might have been incorrectly assigned to the cluster). Once again, results are unsatisfactory.
In the next step, hierarchical clustering is performed using two approaches. First, the traditional method with euclidean distance metric is used.
dist1 <- dist(dtm_matrix, method = "euclidean")
hc1 <- hclust(dist1, method = "complete")
plot(hc1)
However, as mentioned in the previous section, cosine dissimilarity might be a better solution for analyzing textual data. Function skmeans::skmeans_xdist allows for calculating cosine cross-distances between the rows of matrices.
dist2 <- as.dist(skmeans_xdist(dtm_matrix))
hc2 <- hclust(dist2, method = "complete")
plot(hc2)
dend1 <- as.dendrogram(hc1)
dend2 <- as.dendrogram(hc2)
tanglegram(dend1, dend2)
dend_list<-dendlist(dend1, dend2)
entanglement(dend_list)
## [1] 0.4892793
The dendograms differ from each other depending on the distance metric used for clustering. The degree of difference (entanglement) between the two dendrograms is close to 0.5. Once again, the data appears to be better-separated when cosine dissimilarity is applied.
Next, a manual analysis of cluster boundaries for k between 2 and 4 is performed.
The size of the additional clusters formed using euclidean distance contains only one observation. In contrast, hierarchical clustering with cosine dissimilarity results in different cluster sizes.
Once again, let’s visualize the clustering and silhouette, this time for hierarchical clustering results.
Results obtained from hierarchical clustering with euclidean distance metric are comparable with k-means and PAM clustering outcome.
## cluster size ave.sil.width
## 1 1 27 0.38
## 2 2 1 0.00
## cluster size ave.sil.width
## 1 1 26 0.02
## 2 2 2 0.03
## cluster size ave.sil.width
## 1 1 26 0.38
## 2 2 1 0.00
## 3 3 1 0.00
## cluster size ave.sil.width
## 1 1 9 0.00
## 2 2 17 0.01
## 3 3 2 0.03
## cluster size ave.sil.width
## 1 1 25 0.23
## 2 2 1 0.00
## 3 3 1 0.00
## 4 4 1 0.00
## cluster size ave.sil.width
## 1 1 9 0.00
## 2 2 4 0.03
## 3 3 13 0.02
## 4 4 2 0.03
When euclidean distance metric is used, silhouette of every additional clusters is equal to 0 - the clusters overlap. When using cosine dissimilarity, silhouette remains very low, and in some cases - even negative. Overall, the results are not satisfying.
The last section of the project focuses on clustering data within the term document matrix. This time, the goal is to clasify relationships between different words used in song lyrics of the dataset. Since term document matrix is a transposed document term matrix, the rows of the data represent different words within song lyrics, while the columns correspond to individual songs in the dataset.
tdm <- TermDocumentMatrix(grammys$lyrics,
control = list(removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
tolower = TRUE,
stemming = TRUE,
language = "eng"))
tdm <- weightTfIdf(tdm, normalize = TRUE)
tdm_matrix <- as.matrix(tdm)
colnames(tdm_matrix) <- grammys$song_name
tdm_matrix[125:128,1:5]
## Docs
## Terms Flowers Anti-Hero Kill Bill vampire On My Mama
## buildin 0.00000000 0 0 0.0000000 0
## built 0.01627471 0 0 0.0160318 0
## bulletproof 0.00000000 0 0 0.0000000 0
## burn 0.01922907 0 0 0.0000000 0
Again, the clustering tendency is analyzed.
hs2 <- get_clust_tendency(tdm_matrix, 2, graph=TRUE, gradient=list(low="red", mid="white", high="blue"), seed = 123)
hs2$hopkins_stat
## [1] 0.9703382
d2<-get_dist(tdm_matrix, method="euclidean")
fviz_dist(d2, show_labels = FALSE)+ labs(title="Song Lyrics Data")
The value of the Hopkins statistic is very high – 0.97, signifying that the data is clusterable. Blocks of color can be also seen on the clustering tendency plot.
Next, the optimal number of clusters is analyzed, this time for the transposed data.
Based on the gap statistic, the optimal number of clusters is once again 1, however the silhouette score suggests 2 clusters for both k-means and PAM. Thus, k = 2 is selected for clustering with euclidean distance metric.
km22 <-eclust(tdm_matrix, "kmeans", hc_metric="euclidean", k=2, graph = FALSE)
pam22 <-eclust(tdm_matrix, "pam", hc_metric="euclidean", k=2, graph = FALSE)
Let’s analyze the formed clusters. The clustering of k-means and PAM clustering is slightly different. This time, the cluster sizes differ for both PAM and k-means. Once again, however, in both cases, the sizes of the two clusters significantly differ.
## cluster size ave.sil.width
## 1 1 11 0.3
## 2 2 1241 0.8
## cluster size ave.sil.width
## 1 1 1242 0.76
## 2 2 10 0.65
Average silhouette widths, however, are positive for both clusters, regardless of the clustering method. This suggests that the data is likely clusterable, and that the clustering results are meaningful.
Lastly, spherical k-means is applied for k = 2.
## cluster size ave.sil.width
## 1 1 954 0.04
## 2 2 298 0.48
The clusters are more diverse compared to k-means and PAM algorithm results. However, though average silhouette width is higher for both clusters when compared with Document Term Matrix clustering results, it remains low. Thus, clustering with PAM or k-means appears to be a better option.
The goal of the project was to group song lyrics data using different clustering techniques and distance metrics. Clustering of different words within the song lyrics was also performed. Though obtained results are quite unsatisfactory, it would be worth to replicate them on a different textual dataset. Additionally, it would be interesting to use topic modelling algorithms and see if any semantic relationship between the words can be uncovered.