Introduction

This report dissects 151 Taylor Swift songs using lyrical metadata and YouTube metrics to explore:

The analysis demonstrates data‑wrangling, text mining, factor engineering, and visualization in R.


1 Data Import & Cleaning

music   <- read_csv("music.csv", show_col_types = FALSE) %>% clean_names()
youtube <- read_csv("youtube.csv", show_col_types = FALSE) %>% clean_names()

pop_df <- youtube %>% inner_join(music, by = c("song_id" = "id"))

2 Explicit Content by Album

explicit_tbl <- pop_df %>%
  group_by(album_name) %>%
  summarise(total_songs = n(),
            explicit_count = sum(explicit, na.rm = TRUE),
            explicit_pct = round(mean(explicit, na.rm = TRUE) * 100, 1),
            .groups = "drop") %>%
  arrange(desc(explicit_pct))
explicit_tbl
ggplot(explicit_tbl, aes(x = fct_reorder(album_name, explicit_pct), y = explicit_pct)) +
  geom_col(fill = "#d62728") +
  coord_flip() +
  labs(title = "Share of Explicit Tracks by Album", x = NULL, y = "% Explicit") +
  theme_minimal()

Midnights and evermore have the highest proportion of explicit lyrics, reflecting a more mature thematic era.


3 Keyword Mining: “Midnight” & “Love”

pop_df <- pop_df %>%
  mutate(contains_midnight = str_detect(str_to_lower(full_lyrics), "midnight"),
         love_count        = str_count(str_to_lower(full_lyrics), "love"))

midnight_total <- sum(pop_df$contains_midnight, na.rm = TRUE)

love_top <- pop_df %>%
  filter(love_count == max(love_count, na.rm = TRUE)) %>%
  select(album_name, youtube_title, love_count)

• “Midnight” appears in 8 songs.
• The love champion is Taylor Swift -This Love (1989), with 52 mentions.


4 Seasonality of Releases

pop_df <- pop_df %>%
  mutate(song_release_date_month = as.character(song_release_date_month),
         season = fct_collapse(song_release_date_month,
                               spring = c("3", "4", "5"),
                               summer = c("6", "7", "8"),
                               fall   = c("9", "10", "11"),
                               winter = c("12", "1", "2")))
ggplot(pop_df, aes(x = season)) +
  geom_bar(fill = "#9467bd") +
  labs(title = "Number of Songs by Season", x = "Season", y = "Count") +
  theme_minimal()

Spring slightly edges out other seasons, hinting at strategic Q2 releases.


5 Lag Between Release & YouTube Publish

5.1 Robust Date Parsing & Lag Calc

# ---- Robust lag calculation --------------------------------------------
lag_df <- pop_df %>%
  mutate(
    # Build release date (assume missing day = 1)
    release_date = lubridate::make_date(
      song_release_date_year,
      song_release_date_month,
      dplyr::coalesce(song_release_date_day, 1)
    ),

    # Parse YouTube publish dates: primary format MM/DD/YYYY, fallback to ISO
    youtube_publish_date_parsed = lubridate::mdy(youtube_publish_date, quiet = TRUE) %||%
                                   lubridate::ymd(youtube_publish_date, quiet = TRUE),

    # Compute lag in days
    post_lag = as.numeric(difftime(youtube_publish_date_parsed, release_date, units = "days"))
  )

5.2 Average Lag Table

lag_tbl <- lag_df %>%
  filter(is.finite(post_lag)) %>%
  group_by(album_name) %>%
  summarise(avg_lag_days = mean(post_lag), .groups = "drop") %>%
  arrange(desc(avg_lag_days))

5.3 Visualization

⏳ On average, Taylor Swift songs appear on YouTube within a few days of release; longer lags highlight either surprise drops or metadata gaps.


6 Conclusion

Skills demonstrated: data joins & cleaning, text mining, date engineering, factor manipulation, and insightful visualization. These techniques provide a multifaceted lens on Taylor Swift’s discography and promotion strategy.


7 Session Info

sessionInfo()
## R version 4.3.3 (2024-02-29)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.2.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] patchwork_1.3.0 janitor_2.2.1   lubridate_1.9.3 forcats_1.0.0  
##  [5] stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2     readr_2.1.5    
##  [9] tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1   tidyverse_2.0.0
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.9        utf8_1.2.4        generics_0.1.3    stringi_1.8.3    
##  [5] hms_1.1.3         digest_0.6.35     magrittr_2.0.3    evaluate_1.0.3   
##  [9] grid_4.3.3        timechange_0.3.0  fastmap_1.2.0     jsonlite_1.8.8   
## [13] fansi_1.0.6       scales_1.3.0      jquerylib_0.1.4   cli_3.6.2        
## [17] rlang_1.1.3       crayon_1.5.2      bit64_4.0.5       munsell_0.5.0    
## [21] withr_3.0.2       cachem_1.1.0      yaml_2.3.8        parallel_4.3.3   
## [25] tools_4.3.3       tzdb_0.4.0        colorspace_2.1-0  vctrs_0.6.5      
## [29] R6_2.5.1          lifecycle_1.0.4   snakecase_0.11.1  bit_4.0.5        
## [33] vroom_1.6.5       pkgconfig_2.0.3   pillar_1.9.0      bslib_0.6.2      
## [37] gtable_0.3.4      glue_1.7.0        highr_0.10        xfun_0.43        
## [41] tidyselect_1.2.1  rstudioapi_0.16.0 knitr_1.45        farver_2.1.1     
## [45] htmltools_0.5.8   labeling_0.4.3    rmarkdown_2.26    compiler_4.3.3