This report dissects 151 Taylor Swift songs using lyrical metadata and YouTube metrics to explore:
The analysis demonstrates data‑wrangling, text mining, factor engineering, and visualization in R.
music <- read_csv("music.csv", show_col_types = FALSE) %>% clean_names()
youtube <- read_csv("youtube.csv", show_col_types = FALSE) %>% clean_names()
pop_df <- youtube %>% inner_join(music, by = c("song_id" = "id"))
explicit_tbl <- pop_df %>%
group_by(album_name) %>%
summarise(total_songs = n(),
explicit_count = sum(explicit, na.rm = TRUE),
explicit_pct = round(mean(explicit, na.rm = TRUE) * 100, 1),
.groups = "drop") %>%
arrange(desc(explicit_pct))
explicit_tbl
ggplot(explicit_tbl, aes(x = fct_reorder(album_name, explicit_pct), y = explicit_pct)) +
geom_col(fill = "#d62728") +
coord_flip() +
labs(title = "Share of Explicit Tracks by Album", x = NULL, y = "% Explicit") +
theme_minimal()
Midnights and evermore have the highest proportion of explicit lyrics, reflecting a more mature thematic era.
pop_df <- pop_df %>%
mutate(contains_midnight = str_detect(str_to_lower(full_lyrics), "midnight"),
love_count = str_count(str_to_lower(full_lyrics), "love"))
midnight_total <- sum(pop_df$contains_midnight, na.rm = TRUE)
love_top <- pop_df %>%
filter(love_count == max(love_count, na.rm = TRUE)) %>%
select(album_name, youtube_title, love_count)
• “Midnight” appears in 8 songs.
• The love champion is Taylor Swift -This Love (1989), with 52 mentions.
pop_df <- pop_df %>%
mutate(song_release_date_month = as.character(song_release_date_month),
season = fct_collapse(song_release_date_month,
spring = c("3", "4", "5"),
summer = c("6", "7", "8"),
fall = c("9", "10", "11"),
winter = c("12", "1", "2")))
ggplot(pop_df, aes(x = season)) +
geom_bar(fill = "#9467bd") +
labs(title = "Number of Songs by Season", x = "Season", y = "Count") +
theme_minimal()
Spring slightly edges out other seasons, hinting at strategic Q2 releases.
# ---- Robust lag calculation --------------------------------------------
lag_df <- pop_df %>%
mutate(
# Build release date (assume missing day = 1)
release_date = lubridate::make_date(
song_release_date_year,
song_release_date_month,
dplyr::coalesce(song_release_date_day, 1)
),
# Parse YouTube publish dates: primary format MM/DD/YYYY, fallback to ISO
youtube_publish_date_parsed = lubridate::mdy(youtube_publish_date, quiet = TRUE) %||%
lubridate::ymd(youtube_publish_date, quiet = TRUE),
# Compute lag in days
post_lag = as.numeric(difftime(youtube_publish_date_parsed, release_date, units = "days"))
)
lag_tbl <- lag_df %>%
filter(is.finite(post_lag)) %>%
group_by(album_name) %>%
summarise(avg_lag_days = mean(post_lag), .groups = "drop") %>%
arrange(desc(avg_lag_days))
⏳ On average, Taylor Swift songs appear on YouTube within a few days of release; longer lags highlight either surprise drops or metadata gaps.
Skills demonstrated: data joins & cleaning, text mining, date engineering, factor manipulation, and insightful visualization. These techniques provide a multifaceted lens on Taylor Swift’s discography and promotion strategy.
sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] patchwork_1.3.0 janitor_2.2.1 lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.5
## [9] tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.9 utf8_1.2.4 generics_0.1.3 stringi_1.8.3
## [5] hms_1.1.3 digest_0.6.35 magrittr_2.0.3 evaluate_1.0.3
## [9] grid_4.3.3 timechange_0.3.0 fastmap_1.2.0 jsonlite_1.8.8
## [13] fansi_1.0.6 scales_1.3.0 jquerylib_0.1.4 cli_3.6.2
## [17] rlang_1.1.3 crayon_1.5.2 bit64_4.0.5 munsell_0.5.0
## [21] withr_3.0.2 cachem_1.1.0 yaml_2.3.8 parallel_4.3.3
## [25] tools_4.3.3 tzdb_0.4.0 colorspace_2.1-0 vctrs_0.6.5
## [29] R6_2.5.1 lifecycle_1.0.4 snakecase_0.11.1 bit_4.0.5
## [33] vroom_1.6.5 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.6.2
## [37] gtable_0.3.4 glue_1.7.0 highr_0.10 xfun_0.43
## [41] tidyselect_1.2.1 rstudioapi_0.16.0 knitr_1.45 farver_2.1.1
## [45] htmltools_0.5.8 labeling_0.4.3 rmarkdown_2.26 compiler_4.3.3