Introduction

In 2024, the iconic American singer-songwriter Taylor Swift concluded her tour titled “The Eras Tour”, spanning over two years and 150+ total shows performed across 5 continents. Her status as one of the most influential pop artists of the 21st century backed by over 100,000,000 monthly listeners on Spotify and high praise on album releases draws wide attention from her fanbase as she performed on tour for over 10 million fans!

One of the millions of fans is W. Jake Thompson, who created an R package that includes data on the entirety of Taylor Swift’s discography with song characteristics such as audio and release dates. His taylor package website is https://taylor.wjakethompson.com, where he cites Genius and SpotifyAPI as sources for data collection that imports every song into three main datasets: taylor_all_songs (All Taylor Swift songs), taylor_albums (All Taylor Swift albums), and eras_tour_surprise (Eras Tour Surprise Songs).

For one, I consider myself a Taylor Swift fan who has followed along through her musical career for years and while I didn’t attend her Eras Tour, I watched through livestreams and participated in guessing games for her “surprise songs”; on every Eras Tour show, separate from the main set list, Swift dedicated an acoustic section of her concert to perform 2-5 surprise songs for a show, one song on guitar and one on the piano. At the start of the tour in the United States, she would perform 2 songs maximum in their entirety making a promise to never repeat them and ensure every concert surprise is unique, but as the tour progressed she began mashing up at least two songs on the guitar and at least two songs on the piano. To fans at her concerts and fans viewing from livestreams, “Surprise Song Hour” would be the most popular part of any show as a viewer would be excited for which 2-5 songs out of 200+ Swift would perform for a certain show. Thus it became a guessing game of which songs she could play and which songs from albums she’s combining.

This phenomena took up a grand part of 2024 for me, and I’m excited to get the opportunity to combine a slightly embarrassing but one of my favorite interests and data visualization as a study of popular music for one musical artist! My goal is to showcase the extent of how one can visualize data and it can be found anywhere. I chose to explore two aspects of Swift’s song data in this project: visualizing her song’s musical characteristics per album, and how many different album mashups Swift performed on the Eras Tour. Some questions you may have answered in my project are: - Which Taylor Swift songs are the highest in valence (mood), tempo, loudness per album? Is there a distinction? - Can one predict valence from a song’s musical characteristics? - Which songs from albums did Swift mashup with the most/least on the Eras Tours? Is there context further than data can explain?

Loading in Libraries

library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(taylor)
## Warning: package 'taylor' was built under R version 4.3.3
library(dplyr)
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library(ggplot2)
library(stringr)
library(readr)
setwd("/Applications/DATA110")

Built-in dataset:

taylor_all_songs <- taylor_all_songs
eras_tour_surprise <- eras_tour_surprise
#sequential album order for both charts 
album_order <- c("Taylor Swift","Fearless","Speak Now","Red",
                 "1989","reputation","Lover","folklore",
                 "evermore","Midnights", "TTPD", "Other")

#highcharter
album_orderh <- c("Taylor Swift","Fearless","Speak Now","Red",
                 "1989","reputation","Lover","folklore",
                 "evermore","Midnights", "THE TORTURED POETS DEPARTMENT", "Other")

#for ggplot. both of these are custom color palettes that associate a color with an album's color.
album_order_hex <- c("#A5C9A5","#EFC180", "#C7A8CB", "#894651", "#B5E5F8", "#434343", "#F7B0CC", "#CDC9C1", "#C5AC90", "#485D90", "#E7EDEE", "#F1A66A")

#for highcharter
album_colors <- c(
  "Taylor Swift" = "rgba(165,201,165,0.6)",  
  "Fearless"   = "rgba(239,193,128,0.6)",
  "Speak Now" = "rgba(199,168,203,0.6)",
  "Red"  = "rgba(137,70,81,0.6)",
  "1989"  = "rgba(181,229,248,0.6)",
  "reputation" = "rgba(67,67,67,0.6)",
  "Lover" = "rgba(247,176,204,0.6)",
  "folklore"  = "rgba(205,201,193,0.6)",
  "evermore" = "rgba(197,172,144,0.6)",
  "Midnights" = "rgba(72,93,144,0.6)",
  "THE TORTURED POETS DEPARTMENT" = "rgba(231,237,238,0.6)"
)
# Ridding the dataset of unofficial albums (EPs)
t1 <- taylor_all_songs %>%
  filter(!is.na(album_name), album_name != "The Taylor Swift Holiday Collection", album_name != "Beautiful Eyes")

Outside Cleaning

I only had to adjust certain cells with albums. Some songs belonged to albums as a bonus track but were listed as singles and not as part of an album. An example is “I Don’t Wanna Live Forever”, which is not officially on the “reputation” album but has long been associated with it by release date and aesthetic.

t2 <- read.csv("taylor_songs.csv")

2nd Cleaning Ediiton

Removing “Taylor’s Version” from song titles. For context, Swift re-released many songs under “Taylor’s Version” as a re-recording. To avoid duplicates, only one original album will be used. The main goal to clean up song titles is to eventually join it with my own Eras Tour dataset through the song title.

 # mutating out any extra titles in songs and avoiding NAs by trimming
t2 <- t2 %>%
  mutate(album_name = str_replace(album_name, "\\(Taylor's Version\\)", "")
  ) %>%
  mutate(album_name = str_trim(album_name)) %>%
  mutate(album_name = factor(album_name, levels = album_order))


t2 <- t2 %>%
  mutate(track_name = str_replace(track_name, "\\(Taylor's Version\\)", "")) %>%
  mutate(track_name = str_replace(track_name, "\\[Taylor's Version\\]", "")) %>%
  mutate(track_name = str_replace(track_name, "\\[From The Vault\\]", "")) %>%
  mutate(track_name = str_replace(track_name, "\\(From The Vault\\)", "")) %>%
  mutate(track_name = str_trim(track_name)) %>% 
  select(-X)

My own data-set, collected from https://en.wikipedia.org/wiki/The_Eras_Tour#Surprise_songs.

surprise_songs <- read_csv("surprise_songs.csv")
## Rows: 447 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): city, country, leg, song_title, album, instrument, mode, guest, re...
## dbl  (1): night
## date (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Staying on top of trimming. This dataset suffered from various empty spaces and trial & error.
t2 <- t2 %>%
  mutate(track_name = str_trim(track_name)) 

surprise_songs <- surprise_songs %>%
  mutate(song_title = str_trim(song_title)) 

#Combining into one full dataset by song titles, more cleaning

full_taylor <- left_join(surprise_songs, t2, by = c("song_title" = "track_name"))

# Removing categories that won't be useful in this data exploration

full_taylor <- full_taylor %>%
  select(-bonus_track, -promotional_release, -single_release, -album_name, -ep, -album_release, -track_release)

# re-naming two categories that had the same name

full_taylor <- full_taylor %>%
  rename(mode = mode.x) %>%
  rename(modality = mode.y)

Grouping

My first visualization is on the Eras Tour mashups. In order to create a heatmap, my goal was to index every unique mashup as a reference. A unique mashup is played on a certain date, in a certain city, on a night of the tour, and one of two instruments.

#group 1, adding an ID to every mashup. Removing any surprise songs that were performed in its entirety without a mashup. grouped by a mashups unique characteristics.

mashups <- full_taylor %>% 
  filter(mode != "Full")  %>% 
  group_by(date, city, night, instrument) %>%
  mutate(mashup_id = cur_group_id()) %>%
  ungroup() 

#group 2 and 3, counting individual songs and filtering out any duplicates that didn't actually mash up 
album_counts <- mashups %>%
  count(mashup_id, album, name = "song_num") %>%
  distinct(mashup_id, song_num, album) 

# Some mashups were done of the same album! (Ex. a 1989 song mashuped with a 1989 song.)
# Made to accomodate some mashups that were actually a triple mashup! 
# These same album mashups are preserved while any duplicate mashups that didn't happen are removed.
album_more_counts <- album_counts %>%
  inner_join(album_counts, by = "mashup_id") %>%
  filter(
    (album.x == album.y & song_num.x >= 2) |
    (album.x != album.y)
  ) 
## Warning in inner_join(., album_counts, by = "mashup_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 2 of `x` matches multiple rows in `y`.
## ℹ Row 2 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.
#switching characters and factors so levels actually applies to ggplot, to ensure that the visualization is presented in album order.
#The datasets had an album title as "THE TORTURED POETS DEPARTMENT", which was later shortened to the abbreviation "TTPD" for an easier visualization. 
# counting instances of album x album mashups here.
mashup_final_count <- album_more_counts %>%
  group_by(album.x, album.y) %>%
  count(album.x, album.y) %>%
  mutate(
        album.x = as.character(album.x),
        album.y = as.character(album.y),
        album.x = ifelse(album.x == "THE TORTURED POETS DEPARTMENT", "TTPD", album.x),
        album.y = ifelse(album.y == "THE TORTURED POETS DEPARTMENT", "TTPD", album.y),
        album.x = factor(album.x, levels = album_order),
        album.y = factor(album.y, levels = album_order)
  )

Plot 1: Mashup Frequencies from the Eras Tour

# filling per the count of album x album mashup
ggplot(mashup_final_count, aes(x = album.x, y = album.y, fill = n)) +
  geom_tile(color = "gray") +
  geom_text(aes(label = n), size = ifelse(mashup_final_count$n >= 5, 4, 3),
  color = ifelse(mashup_final_count$n >= 5, "#371422", "#f0f6f6")) + #different text color for visual appeal
  scale_fill_taylor_c(album = "lover", na.value = "#eeeeee") + #custom gradient per album included in the taylor package!
  labs(
    title = "Album Mashup Heatmap on Taylor Swift's \"The Eras Tour\" (2023-2024)",
    x = "Album",
    y = "Album",
    fill = "Mashup Count",
    caption = "Source: Genius"
  ) +
  theme_classic(base_family = "Times", base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1), #avoid crowded x-axis labels
  )

My heatmap is a measure of how unique a surprise song album mashup can be! It’s ordered by Swift’s albums from release dates, from “Taylor Swift” (2006) to “TTPD” (2024) and includes an “Other” category, as there were instances where Swift mashuped one of her own songs with one that did not belong to hers, usually with a special guest to add onto the “surprise” song element. Some examples of these that are visualized are:

The heatmap backs up some context: Swift began performing mashups more frequently in February 2024, and after the release of “TTPD” in April 2024, when she began the Europe Leg of her Eras Tour in May 2024, she had the opportunity to debut new songs from “TTPD” acoustically live in the surprise song section of the tour. This timeline explains why many songs from her previous albums are given the opportunity to be in a mashup with her newest album at the time.

“reputation” is the album most commonly mashuped with “TTPD”, with a total of 8 mashups! Visually, both albums have similar color palettes and touch on similar topics, but it can be explained further quantitatively. On the contrary, “TTPD” only has 1 mashup with “Taylor Swift” and “Lover”.

Another fun realization is when mashuping up songs, Swift rarely mashes up two songs from the same album. The times she has, there’s only been 1 instance or none, but “1989” is a large outlier with 8 “1989” self album mashups performed! Oddly enough, “TTPD”, with 31 songs, has never mashuped with itself. Some album mashups have never occurred at all on the tour, some only once, living up to Swift’s goal of making surprise song mashups per concert night as unique as she can.

Ultimately, I struggled with grouping and cleanup the most. Changing a data frame to factor and chars, ensuring my heatmap x and y axis could be categorical, and building up my visualization to be accurate. Mashups are very unique and I had to keep grouping and editing to accommodate for the outliers of 3 song mashups, same album mashups, adding an “Other” or else there’d be an NA, and cross-checking with my own data set and Wikipedia’s page on surprise songs to ensure accurate results. Had I had more knowledge and time, I wish I could have included a custom gradient per unique album color and mix it with another album’s staple color when they’d intersect on the heatmap. An idea I’m definitely building on in the future is to create a highcharter version that includes every mashup song when hovered over its respective mashup to put song mashup names to albums. Otherwise, I’m satisfied with this heatmap and achieved its goal to finally quantify an album’s mashup frequency.

Statistical Analysis: Musical Composition of Songs Predicting Valence (Mood)

Now, my focus is on musicial characteristics of Swift’s songs across all of her albums. song_stat, the filtered data set, contains 242 songs for the statistical analysis.

# Predict mood, by keeping important song characteristics 
song_stat <- t2 %>%
  select(album_name, track_name, valence, energy, loudness, tempo, acousticness, instrumentalness) %>%
  filter(!is.na(valence))  # 1 song had all NAs. This filters only it out.

v_model <- lm(valence ~ energy + loudness + tempo + acousticness + instrumentalness, data = song_stat)  
summary(v_model)
## 
## Call:
## lm(formula = valence ~ energy + loudness + tempo + acousticness + 
##     instrumentalness, data = song_stat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35488 -0.11765 -0.00039  0.10304  0.44801 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -4.787e-02  1.024e-01  -0.468    0.640    
## energy            6.794e-01  9.848e-02   6.899 4.78e-11 ***
## loudness          3.844e-03  6.631e-03   0.580    0.563    
## tempo             9.847e-05  3.250e-04   0.303    0.762    
## acousticness      1.988e-01  4.693e-02   4.235 3.27e-05 ***
## instrumentalness -4.084e-01  3.758e-01  -1.087    0.278    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1568 on 236 degrees of freedom
## Multiple R-squared:  0.2842, Adjusted R-squared:  0.269 
## F-statistic: 18.74 on 5 and 236 DF,  p-value: 1.124e-15
plot(v_model)

My model equation is adding up 5 characteristics: energy, loudness, tempo, acousticness, and instrumentalness to determine if its significant with the valence/mood of a song. Only two characteristics can be associated with a song’s valence: energy and acousticness, with their respective p-values of 4.78e-11 and 3.27e-05. The adjusted R-squared is 0.269, so 26.9% of the variations in all of the songs can be explained by my model. In the Residuals vs Fitted plot, he line dips and forms a bit of a curve, but is safe enough to assume that residuals in my model are decently fit. In the Q-Q Plot, there are outlier residuals forming a left tail and right tail, but otherwise the rest of the residuals lines up along the normal distribution line. Identifying Homoscedasticity in (Scale/Location), with a visually almost straight red line we can assume that there is equal variance of residuals. Lastly, the Residuals vs Leverage plot spikes up a bit before attempting to form a horizontal line again, indicating that the model has it’s own issue with not every characteristic being a good fit, such as loudness, tempo, and instrumentalness which were proven to not be a significant correlation with valence.

Plot 2: Songs by a Valence, Energy, & Loudness Index in Album Order

hc_songs <- highchart() |>
  hc_chart(type = "scatter") |>
  hc_title(text = "<b>Taylor Swift Songs by Valence, Energy, & Loudness in Track # Order</b>") |>
  hc_caption(text = "Source: SpotifyAPI") |>
  hc_xAxis(title = list(text = "Track Number")) |>
  hc_yAxis(title = list(text = "Valence/Energy/Loudness"))

# a for loop cycles through the album order, creating a series per album

for (i in album_orderh){
  
  data <- t2[t2$album_name == i, ] # AI help to index songs
  # i will equal an album's name for the unique order and color
    
  hc_songs <- hc_songs |> 
    hc_add_series(
      data = data,
      hcaes(x = track_number, y = (35 + (valence * 5) + (energy * 2) + (loudness * 2))),
      type = "scatter",
      color = album_colors[i], 
      name = i,
      marker = list(
        enabled = TRUE,
        symbol = "circle",
        radius = 6,
        lineColor = "#555555",
        lineWidth = 1
        )
    )
}

# Tooltip has the song name, album, track number, and song characteristics
hc_songs <- hc_songs |>
    hc_tooltip(
    shared = FALSE,
    pointFormat = "<b>{point.track_name}</b><br>
    Track {point.track_number}<br>
    Key: {point.key_mode}<br>
    Valence: {point.valence}<br>
    Energy: {point.energy}<br>
    Loudness: {point.loudness}
    "
  )  |>
  hc_legend(enabled = TRUE)

hc_songs

My scatter plot on highcharter visualizing Taylor Swift songs per album through a song’s characteristics from SpotifyAPI, in track order per album. Clicking through the legend, you can compare as many album’s as you’d like as if you were comparing it’s qualities and listening through it in order. I chose to order it by the x-axis in track order because when listening through an album, moods change and certain songs stick out for their change in valence, energy, and loudness. One can also compare two or more different albums and how its characteristics differs from each other, while viewing each individual song via a tooltip!

Click only on “Taylor Swift” and “THE TORTURED POETS DEPARTMENT”, Swift’s first and most recent album. Sonically, her debut album is higher in valence, energy and loudness on the y-axis. Almost every song on her debut album is higher in this “song characteristic measure” than her recent album, minus both Track 3’s. On a scatter plot, they almost look like clusters and can be explained by their different genres: Swift’s debut album was country, and her recent album is more akin to synth-pop and dream-pop. Other album genre anomaly’s include “Speak Now” which is more pop-rock and boasts some of the highest energy and loudness, and “1989” her first synth-pop album with her biggest hits anyone might know such as “Shake It Off”, “Blank Space”, and “Style”; all scoring high on the y-axis measure rightfully so due to their catchy melodies that influence valence.

In my last project, my highcharter graphing code was massive and I chose to resolve the problem by using a for loop to loop through albums and create a unique series per album for its visual appeal. Unfortunately, the data sets did not include a genre for an album to properly add it to the visualization without further explanation or extra notes. When deciding the y-axis characteristics, I multiplied many values to avoid a y value of -0 as loudness is a negative number. Valence is the number most multiplied in this measure as the overall mood of a song I would argue is the biggest standout characteristic. I also did not have enough time to analyze the numbers per track number, which would’ve been neat on its own or a linear regression line per highcharter without overcrowding the chart.

Works Cited

Wikipedia contributors (2025), Surprise Songs, “The Eras Tour” https://en.wikipedia.org/wiki/The_Eras_Tour#Surprise_songs

Thompson W (2025). taylor: Lyrics and Song Data for Taylor Swift’s Discography. R package version 3.2.0, https://github.com/wjakethompson/taylor, https://taylor.wjakethompson.com.