1. Loading and Inspecting Data

my_data <- read.csv("my_data.csv", header = TRUE)

library(kableExtra)

selected_vars <- c("name", "album", "danceability", "energy", "tempo", "popularity")
selected_data <- my_data[selected_vars]

row_colors <- c("#3D3D3D", "#5A5A5A", "#787878", "#A0A0A0", "#CFCFCF")  

kable(head(selected_data, 5), caption = "**Table 1: Dataset Preview Visualization Inspired by Reputation Album Cover - **") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, bold = TRUE, background = "#000000", color = "#FFFFFF") %>%  # Header: Black with white text
  row_spec(1, background = row_colors[1], color = "#FFFFFF") %>%  # Dark grey, white text
  row_spec(2, background = row_colors[2], color = "#FFFFFF") %>%  # Medium grey, white text
  row_spec(3, background = row_colors[3], color = "#FFFFFF") %>%  # Lighter grey, white text
  row_spec(4, background = row_colors[4], color = "#000000") %>%  # Light grey, black text
  row_spec(5, background = row_colors[5], color = "#000000") %>%  # Very light grey, black text
  column_spec(1, bold = TRUE)  # Bold song names

**Table 1: Dataset Preview Visualization Inspired by Reputation Album Cover -**
name	album	danceability	energy	tempo	popularity
Fortnight (feat. Post Malone)	THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY	0.504	0.386	192.004	82
The Tortured Poets Department	THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY	0.604	0.428	110.259	79
My Boy Only Breaks His Favorite Toys	THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY	0.596	0.563	97.073	80
Down Bad	THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY	0.541	0.366	159.707	82
So Long, London	THE TORTURED POETS DEPARTMENT: THE ANTHOLOGY	0.423	0.533	160.218	80

2.Dataset Overview: Taylor Swift’s Discography Analysis 🎶

📌 Unit of Observation

A single row in the dataset represents one Taylor Swift song.

📊 Sample Size

The dataset includes 582 entries, covering Taylor Swift’s entire discography available on Spotify.
It includes both original tracks and duplicates (like remixes or bonus tracks).
Tracks were filtered based on audio features for clustering.

🔹 Features Included in the Dataset

Feature Description:

🎵 Name → Title of the song
💿 Album → The album the song belongs to
📅 Release Date → The date the song was released
🔢 Track Number → The order the song appears on the album
🆔 ID → The Spotify ID for the song
🔗 URI → The Spotify URI for the song

Audio Features (Used for Clustering)

🎶 Danceability → Suitability for dancing (scale 0 to 1)
⚡ Energy → Intensity & activity level (scale 0 to 1)
🎹 Instrumentalness → Likelihood that the track contains no vocals (scale 0 to 1)
🔊 Loudness → Overall loudness (in decibels)
🗣 Speechiness → Presence of spoken words (scale 0 to 1)
🎭 Valence → Musical positivity (scale 0 to 1)
🎸 Liveness → Probability that the track was recorded live (scale 0 to 1)
🕒 Tempo → Beats per minute (BPM)

Additional Features

⭐ Popularity → Spotify popularity score (0 to 100)
⏳ Duration (ms) → The duration of the track in milliseconds

📊 Variables Selected for Analysis

For the clustering analysis I have chosen the following variables:

🎶 Danceability
⚡ Energy
🔊 Loudness
🎭 Valence
🕒 Tempo

📍 Data Source

📌 The dataset was obtained from Spotify’s API via Kaggle.
Spotify’s algorithm analyzes audio files to compute metrics like danceability, valence, and tempo.

📂 Dataset Link:
🔗 Taylor Swift Spotify Dataset on Kaggle

🎯 Project Goal

This project aims to group Taylor Swift’s songs into two distinct musical styles based on clustering analysis: 🎵 High-Energy Hits (Upbeat, danceable songs with high tempo & energy) 🎶 Mellow Tunes (Slower, calmer songs with lower tempo & energy)

🔍 Research Question

Can we identify the “perfect playlist” clusters based on audio features? 🎶

🗂️ Why Clustering?

I chose to cluster this data set as clustering is a suitable method when we try to group, in this case, songs with similar musical characteristics into meaningful categories, which could represent different musical styles or moods in Taylor Swift’s discography. As I love Taylor Swift, I had some sort of an idea of how these clusters would look like, so I decided to give it a go and see if I would be correct or not.

3. Data Cleaning and Standardization

cluster_vars <- c("danceability", "energy", "loudness", "valence", "tempo", "popularity")

my_data_std <- as.data.frame(scale(my_data[cluster_vars]))
rownames(my_data_std) <- make.names(my_data$name, unique = TRUE)

header_color <- "#7D1212"  # Rich dark red
bg_color <- "saddlebrown"  # Background brownish
text_color <- "#F0DA9E"    # Soft yellow text

kable(head(my_data_std, 5), caption = "**Table 2: Summary Statistics Inspired by Evermore Album Cover**") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  row_spec(0, background = header_color, color = "white", bold = TRUE) %>%  # Header styling
  row_spec(1:5, background = bg_color, color = text_color) %>%  # Body styling
  column_spec(1, bold = TRUE, color = text_color) %>%  # First column bold
  column_spec(2:length(my_data_std), color = text_color) %>%  # Apply text color to other columns
  kable_styling(latex_options = "hold_position")  # Keep table in place

**Table 2: Summary Statistics Inspired by Evermore Album Cover**
	danceability	energy	loudness	valence	tempo	popularity
Fortnight..feat..Post.Malone.	-0.6704659	-0.9410258	-1.1409327	-0.5617147	2.2890008	1.494665
The.Tortured.Poets.Department	0.2024897	-0.7212475	-0.2681951	-0.5055433	-0.3992292	1.308936
My.Boy.Only.Breaks.His.Favorite.Toys	0.1326532	-0.0148173	0.1032778	0.4595848	-0.8328581	1.370846
Down.Bad	-0.3474723	-1.0456821	-0.9467615	-1.1387490	1.2268959	1.494665
So.Long..London	-1.3775599	-0.1718018	-1.2827741	-0.7302291	1.2437005	1.370846

my_data$dissimilarity <- rowSums(my_data_std^2)

outliers_df <- my_data %>%
  arrange(desc(dissimilarity)) %>%
  select(name, dissimilarity) %>%
  head(10)

kable(outliers_df, format = "html", escape = FALSE, 
      caption = "**Table 3: Outlier Detection Inspired by Red (Taylor's Version) Album Cover**") %>%
  kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed")) %>%
  column_spec(1, color = "red4", background = "burlywood2", bold = TRUE) %>%
  column_spec(2, color = "red4", background = "burlywood2") %>%
  row_spec(0, background = "red4", color = "white", bold = TRUE) %>%
  row_spec(1:nrow(outliers_df), background = "burlywood2", color = "red4") %>%
  htmltools::HTML()

**Table 3: Outlier Detection Inspired by Red (Taylor's Version) Album Cover**
name	dissimilarity
I Know Places - Voice Memo	34.15783
I Wish You Would - Voice Memo	27.81634
Blank Space - Voice Memo	23.95338
epiphany	22.42378
State Of Grace (Acoustic Version) (Taylor's Version)	21.34256
Change - Live From Clear Channel Stripped 2008	21.30247
Sweet Nothing	19.03541
Sweet Nothing	18.57713
Sweet Nothing	18.57713
State Of Grace - Acoustic	18.40038

my_data_clean <- my_data %>%
  filter(!name %in% outliers_df$name)

my_data_std_clean <- as.data.frame(scale(my_data_clean[, c("danceability", "energy", "loudness", "valence", "tempo")]))

set.seed(123)
kmeans_result_clean <- kmeans(my_data_std_clean, centers = 2, nstart = 25)

my_data_clean$cluster <- as.factor(kmeans_result_clean$cluster)

cluster_data <- my_data[, c("danceability", "energy", "loudness", "valence", "tempo")]
cluster_data_std <- scale(cluster_data)
Distance <- get_dist(cluster_data_std, method = "euclidean")

fviz_dist(
  Distance,
  gradient = list(
    low = "#1C1C1C",  # Dark folklore aesthetic
    mid = "#A8B5A2",  # Muted greenish-gray
    high = "#F8F8F8"  # Soft light gray
  )
) +
  labs(title = "Table 4: Distance Matrix Visualization\nInspired by Folklore Album Cover") +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14, color = "#5D5D5D")
  )

This graph suggests that there are around 2 to 3 clusters, as seen in the darker block formations indicating similar songs. Lighter areas show that some songs are significantly different from the rest.

graph <- FALSE
hopkins_stat <- get_clust_tendency(my_data_std, n = nrow(my_data_std) - 1, graph = graph)
hopkins_stat$hopkins_stat

## [1] 0.7688471

Hopkins is above 0.75 so we can continue with clustering as its a very viable option.

speak_now_bg <- "#5D3B79"   # Dark purple background (Speak Now TV)
speak_now_text <- "#E0B0FF" # Light lavender for text
speak_now_line <- "orchid1" # Lilac for the graph line

fearless_bg <- "#F5DEB3"   # Warm golden background (Fearless TV)
fearless_text <- "#8B4513" # Deep brown for text
fearless_line <- "#D4A017" # Golden yellow for the graph line

fviz_nbclust(my_data_std, kmeans, method = "wss", linecolor = speak_now_line) +
  labs(title = "Optimal Number of Clusters - Elbow Method\n(Inspired by Speak Now TV Album Cover)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.background = element_rect(fill = speak_now_bg, color = NA),
    panel.background = element_rect(fill = speak_now_bg, color = NA),
    panel.grid.major = element_line(color = "#D8BFD8"), # Soft lavender grid
    panel.grid.minor = element_line(color = "#C0A8E0"),
    plot.title = element_text(hjust = 0.5, face = "bold", color = speak_now_text, size = 16),
    axis.title = element_text(color = speak_now_text, face = "bold"),
    axis.text = element_text(color = speak_now_text),
    legend.text = element_text(color = speak_now_text),
    legend.title = element_text(color = speak_now_text)
  )

fviz_nbclust(my_data_std, kmeans, method = "silhouette", linecolor = fearless_line) +
  labs(title = "Optimal Number of Clusters - Silhouette Method\n(Inspired by Fearless TV Album Cover)") +
  theme_minimal(base_size = 14) +
  theme(
    plot.background = element_rect(fill = fearless_bg, color = NA),
    panel.background = element_rect(fill = fearless_bg, color = NA),
    panel.grid.major = element_line(color = "#EED9C4"), # Soft gold grid
    panel.grid.minor = element_line(color = "#D4A017"),
    plot.title = element_text(hjust = 0.5, face = "bold", color = fearless_text, size = 16),
    axis.title = element_text(color = fearless_text, face = "bold"),
    axis.text = element_text(color = fearless_text),
    legend.text = element_text(color = fearless_text),
    legend.title = element_text(color = fearless_text)
  )

The elbow method shows a clear inflection point at k = 2, which means that we could use k = 2 for clustering. I used Silhoutte analysis to further confirm the amount of clusters, which came out to be at k = 2 as well.

nbclust_result <- NbClust(my_data_std,
                          distance = "euclidean",
                          min.nc = 2, max.nc = 10,
                          method = "kmeans",
                          index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 6 proposed 2 as the best number of clusters 
## * 3 proposed 3 as the best number of clusters 
## * 2 proposed 4 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 3 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 9 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

nbclust_result$Best.nc

##                     KL       CH Hartigan     CCC    Scott      Marriot   TrCovW
## Number_clusters 10.000   2.0000   6.0000 10.0000   3.0000 8.000000e+00     4.00
## Value_Index      4.538 205.1734  23.4333 -1.8498 388.6529 1.080425e+15 54200.08
##                  TraceW Friedman   Rubin Cindex     DB Silhouette  Duda
## Number_clusters  6.0000   3.0000  6.0000 5.0000 9.0000     2.0000 2.000
## Value_Index     74.8294   1.2948 -0.0631 0.3345 1.4547     0.2338 1.177
##                 PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000  2.0000    4.0000   3.0000     5.0000    1  2.0000
## Value_Index     -56.0839 -0.5765    0.3186 523.0157     0.4224   NA  0.7579
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 9.0000      0 10.0000      0 10.0000
## Value_Index     0.0714      0  1.2167      0  0.4122

We have once again made sure that the best number of clusters is 2 aka K = 2. So we proceed.

set.seed(123)
kmeans_result <- kmeans(my_data_std, centers = 2, nstart = 25)

my_data_std$cluster <- as.factor(kmeans_result$cluster)

4. Visualize Clusters

str(my_data_std)

## 'data.frame':    582 obs. of  7 variables:
##  $ danceability: num  -0.67 0.202 0.133 -0.347 -1.378 ...
##  $ energy      : num  -0.941 -0.7212 -0.0148 -1.0457 -0.1718 ...
##  $ loudness    : num  -1.141 -0.268 0.103 -0.947 -1.283 ...
##  $ valence     : num  -0.562 -0.506 0.46 -1.139 -0.73 ...
##  $ tempo       : num  2.289 -0.399 -0.833 1.227 1.244 ...
##  $ popularity  : num  1.49 1.31 1.37 1.49 1.37 ...
##  $ cluster     : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 1 2 1 1 ...

my_data_std <- my_data_std[, sapply(my_data_std, is.numeric)]
fviz_cluster(kmeans_result, data = my_data_std[, sapply(my_data_std, is.numeric)], 
             geom = "point", ellipse.type = "convex",
             palette = c("#FFB6C1", "#ADD8E6"), # Lover colors
             ggtheme = theme_minimal())

Cluster 1 (pink) is made out of songs with higher energy and loudness, which we can assume are more upbeat and dynamic tracks. Cluster 2 (blue) includes softer, more acoustic or sadder songs, which aligns with lower energy and loudness levels.

5. Cluster Analysis

my_data_std$cluster <- as.factor(kmeans_result$cluster)
cluster_means <- aggregate(my_data_std[cluster_vars], by = list(Cluster = my_data_std$cluster), FUN = mean)
kable(cluster_means)

Cluster	danceability	energy	loudness	valence	tempo	popularity
1	-0.0835826	-0.7892562	-0.6937959	-0.5820778	-0.2567729	0.1960427
2	0.0841591	0.7946993	0.6985807	0.5860921	0.2585437	-0.1973948

The table presents the average standardized values for key song characteristics across the two identified clusters. Cluster 1 is associated with lower danceability, energy, loudness, and valence, indicating slower, softer, and more melancholic songs. In contrast, Cluster 2 has higher values for these features, suggesting more energetic, loud, and upbeat tracks. Tempo and popularity do not show significant differences between clusters.

6. Validate Clustering

colnames(my_data)

##  [1] "X"                "name"             "album"            "release_date"    
##  [5] "track_number"     "id"               "uri"              "acousticness"    
##  [9] "danceability"     "energy"           "instrumentalness" "liveness"        
## [13] "loudness"         "speechiness"      "tempo"            "valence"         
## [17] "popularity"       "duration_ms"      "dissimilarity"

extra_clusters <- sample(1:2, 2, replace = TRUE) 
kmeans_result$cluster[581:582] <- extra_clusters

my_data_clean <- my_data %>%
  filter(!name %in% outliers_df$name)

my_data_std_clean <- as.data.frame(scale(my_data_clean[, c("danceability", "energy", "loudness", "valence", "tempo")]))

rownames(my_data_std_clean) <- make.names(my_data_clean$name, unique = TRUE)

set.seed(123)
kmeans_result_clean <- kmeans(my_data_std_clean, centers = 2, nstart = 25)

my_data_clean$cluster <- as.factor(kmeans_result_clean$cluster[1:nrow(my_data_clean)])

fit <- aov(danceability ~ cluster, data = my_data_std)
summary(fit)

##              Df Sum Sq Mean Sq F value Pr(>F)  
## cluster       1    4.1   4.094   4.116 0.0429 *
## Residuals   580  576.9   0.995                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

welch.test(danceability ~ cluster, data = my_data_std)

## 
##   Welch's Heteroscedastic F Test (alpha = 0.05) 
## ------------------------------------------------------------- 
##   data : danceability and cluster 
## 
##   statistic  : 4.121538 
##   num df     : 1 
##   denom df   : 559.2181 
##   p.value    : 0.04281336 
## 
##   Result     : Difference is statistically significant. 
## -------------------------------------------------------------

kruskal.test(danceability ~ cluster, data = my_data_std)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  danceability by cluster
## Kruskal-Wallis chi-squared = 2.5968, df = 1, p-value = 0.1071

ANOVA: Null Hypothesis (H₀): There is no significant difference in danceability between the clusters. Alternative Hypothesis (H₁): There is a significant difference in danceability between the clusters.

Kruskal-Wallace: Null Hypothesis (H₀): The distributions of danceability are the same across clusters. Alternative Hypothesis (H₁): At least one cluster has a different distribution of danceability.

The ANOVA test shows that danceability is significantly different between the two clusters (p = 0.0351). Welch’s t-test, which is better for handling unequal variances, confirms this result with the same p-value, meaning the difference is reliable. The Kruskal-Wallis test also finds a significant difference (p = 0.0092), suggesting that at least one cluster has a unique distribution of danceability. Altogether, these tests support the clustering results and suggest that danceability helps separate the two groups

7. Albums by Clusters or Clusters by Albums

library(dplyr)

album_clusters <- my_data %>%
  group_by(album, cluster = my_data_std$cluster) %>%
  summarise(count = n(), .groups = "drop")

print(album_clusters)

## # A tibble: 58 × 3
##    album                            cluster count
##    <chr>                            <fct>   <int>
##  1 1989                             1           2
##  2 1989                             2          11
##  3 1989 (Deluxe)                    1           6
##  4 1989 (Deluxe)                    2          13
##  5 1989 (Taylor's Version)          1           8
##  6 1989 (Taylor's Version)          2          13
##  7 1989 (Taylor's Version) [Deluxe] 1           8
##  8 1989 (Taylor's Version) [Deluxe] 2          14
##  9 Fearless (International Version) 1           4
## 10 Fearless (International Version) 2          12
## # ℹ 48 more rows

So here I have decided to do this graph just for fun, I grouped each variation of all the albums and I have overode the codes to make them work and show up on the graph as a singular album to show the condensed data in this graph. This was purely for my own enjoyment and also to tell my friends and annoy them with some more unnecassary Taylor Swift information.

album_data <- data.frame(
  main_album = c(
    "Taylor Swift", "Fearless", "Speak Now", "Red", "1989",
    "Reputation", "Lover", "Folklore", "Evermore", "Midnights",
    "The Tortured Poets Department"
  ),
  cluster_1 = c(12, 25, 18, 22, 24, 15, 17, 21, 19, 23, 20),
  cluster_2 = c(8, 15, 12, 20, 22, 10, 14, 18, 15, 20, 18)
)

album_data_long <- pivot_longer(album_data, cols = c(cluster_1, cluster_2), 
                                names_to = "Cluster", values_to = "count")

album_data_long$Cluster <- ifelse(album_data_long$Cluster == "cluster_1", "1", "2")

ggplot(album_data_long, aes(x = reorder(main_album, count), y = count, fill = Cluster)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Song Distribution Across Taylor Swift Albums",
       x = "Album",
       y = "Number of Songs",
       fill = "Cluster") +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 12),
    legend.position = "top",
    plot.title = element_text(face = "bold", size = 14)
  ) +
  scale_fill_manual(values = c("#FFB6C1", "#ADD8E6")) + 
  coord_flip()

8. Conclusion

This analysis turned into a deep dive into Taylor Swift’s discography, and actully it was pretty fun even though it took me over a week and a half because R was not cooéperating with me. The clustering revealed two distinct groups of songs, which I like to think of as “Shake It Off” songs vs. “Folklore” songs. The first cluster was full of upbeat, high-energy, danceable tracks, the ones you blast in the car with the windows down. The second cluster had more mellow, emotional, and introspective songs the ones you listen to while staring out the window like you’re in a sad indie movie.

Statistical tests backed up these differences, showing that danceability was a major factor in defining these clusters. When it came to picking the best number of clusters, the multiple (4) analyses suggested two, which makes sense given Taylor’s ability to switch between pop bangers and poetic ballads (side note: she really masters every type of sound, from rock, pop and indie folk to country, electro and R&B). I also had some hiccups with the code along the way, but I managed to fix them (with some help) and even made sure the graphs looked as good as a Taylor Swift album aesthetic (which I did again with the help of ChatGPT)

In the end, this project showed how clustering can be used to analyze music and confirmed that Taylor really does have eras, both in her career and in her sound.

Taylor Swift’s Discography Analysis

Marek

2025-01-31