library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "billboard.csv"
## 2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics
head(billboard)
## # A tibble: 6 × 105
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
## 2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
## 3 Littl… The E… 1958-08-25 00:00:00 1 0 5
## 4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
## 5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
## 6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## # front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
## lyrical_topics
## <chr>
## 1 Addiction
## 2 Anger
## 3 Appreciation
## 4 Badassery
## 5 Bad Behavior
## 6 Bad Relationships
billboard <- billboard |>
mutate(
primary_genre = str_split_i(cdr_genre, ";", 1)
)
billboard |>
select(cdr_genre, primary_genre) |>
distinct()
## # A tibble: 33 × 2
## cdr_genre primary_genre
## <chr> <chr>
## 1 Pop;Rock Pop
## 2 Pop Pop
## 3 Rock Rock
## 4 Folk/Country Folk/Country
## 5 Folk/Country;March Folk/Country
## 6 Pop;Folk/Country Pop
## 7 Jazz Jazz
## 8 Funk/Soul;Rock Funk/Soul
## 9 Polka Polka
## 10 Funk/Soul Funk/Soul
## # ℹ 23 more rows
Columns Unclear Until I Read the Documentation:
song_structure
artist_male
For song_structure, I think they chose to encode the data the way they did (ex. A1, C2, E7, etc.) for musical structure purposes like verses, choruses, and bridge patterns so they could store this rather complex information more simply and more compactly across a wide range of songs within the dataset.
For artist_male, they encoded the data as follows: 0 if the artist or group was all female, 1 if the artist or group was all male, 2 if the artist or group has both males and females, and 3 if the artist or group has at least one non-binary individual. I think they chose to encode the data the way they did because this kind of numeric encoding allows the dataset to go beyond binary encoding (0 or 1, True or False) and provide further insight for instances such as the artist is actually a group and if that group has a mixed composition.
If I didn’t read the documentation for song_structure I may have worked with the values in this column and treated them as useless categories and directly compared them, even though in reality these categories represent specific musical structures regarding aspects like verses, choruses, and bridges. If I worked with this song_structure column without reading the documentation, I could have made completely different conclusions about the similarities varying songs have structurally.
If I didn’t read the documentation for artist_male I would have just assumed the values in the column were binary with just 0s and 1s. I would’ve then only looked for 0 and 1 values and lose some valuable information about groups that are mixed in gender and would introduce quite a bit of bias in my analyses regarding artists/groups and their representation.
An element of the data that is unclear even after reading the documentation:
Even after reading the documentation for cdr_genre, there’s still a lot of aspects that are unclear overall. First, how are multiple genres assigned as well as ordered? We know Chris Dalla Riva and Vinnie Christopher assign the genres, but what is their criteria when assigning genres? Lastly, do genre labels hold stable over time and consistent for all artists? Due to this kind of genre assignment, any analyses of genres, whether over time or comparing different genres, may show different genre assignment practices rather than actual changes in musical style or audience preferences.
** Assumption for primary_genre that was created at the top of this notebook and is being used here: the first listed genre is the “main” genre **
billboard |>
count(primary_genre, sort = TRUE) |>
slice_max(n, n = 10) |>
ggplot(aes(x = reorder(primary_genre, n), y = n, fill = primary_genre)) +
geom_col(show.legend = FALSE) +
coord_flip() +
labs(
title = "Top 10 Genres Using Primary Genre Only",
x = "Primary Genre",
y = "Number of Songs"
)
This visualization assumes that the first genre listed (if there are many genres) in cdr_genre represents the primary genre for that particular song. The problem is that the documentation does not explain whether genres are ordered or random, so this plot may be very misleading if the assumption is incorrect.
** Assumption: All listed genres for a song are equal
primary_counts <- billboard |>
count(primary_genre, sort = TRUE)
all_genre_counts <- billboard |>
separate_rows(cdr_genre, sep = ";") |>
mutate(cdr_genre = str_trim(cdr_genre)) |>
count(cdr_genre, sort = TRUE)
primary_counts |>
left_join(all_genre_counts, by = c("primary_genre" = "cdr_genre")) |>
rename(
primary_only = n.x,
all_genres = n.y
) |>
mutate(diff = all_genres - primary_only) |>
arrange(desc(diff))
## # A tibble: 13 × 4
## primary_genre primary_only all_genres diff
## <chr> <int> <int> <int>
## 1 Funk/Soul 228 239 11
## 2 Rock 281 291 10
## 3 Folk/Country 25 30 5
## 4 Pop 355 358 3
## 5 Electronic/Dance 91 94 3
## 6 Reggae 10 13 3
## 7 Hip Hop 88 89 1
## 8 Jazz 5 6 1
## 9 Latin 3 4 1
## 10 March 1 2 1
## 11 <NA> 88 88 0
## 12 Blues 1 1 0
## 13 Polka 1 1 0
primary_counts <- billboard |>
count(primary_genre) |>
rename(genre = primary_genre, n = n) |>
mutate(method = "Primary genre only")
all_genre_counts <- billboard |>
separate_rows(cdr_genre, sep = ";") |>
mutate(cdr_genre = str_trim(cdr_genre)) |>
count(cdr_genre) |>
rename(genre = cdr_genre, n = n) |>
mutate(method = "All listed genres")
bind_rows(primary_counts, all_genre_counts) |>
group_by(method) |>
slice_max(n, n = 10) |>
ungroup() |>
ggplot(aes(x = reorder(genre, n), y = n, fill = method)) +
geom_col(position = "dodge") +
coord_flip() +
labs(
title = "Genre Counts Depend on How Multi-Genre Songs Are Handled",
x = "Genre",
y = "Count"
)
The cdr_genre column permits each song to be labeled with multiple genres separated by semicolons (ex. “Pop;Rock;Funk”). This is troublesome because we don’t know if the order of the genres means anything, if all genres should be treated as equally important, or what the genre assignment criteria is. This is concerning because doing analyses on genres is often for musical trends or cultural shifts, and if the genre assignments are subjective, inconsistent, or randomly ordered, then the trends may reflect genre assignment practices rather than real changes in genre.
Overstatements of the popularity of genres that frequently occur with other genres when counting all of the genres
Misrepresentation of multi-genre songs when forcing each song into only a primary genre
Drawing conclusions about genres and their trends over time without knowing genre definitions or genre assignments practices
Report results using multiple reasonable genre encoding, whether primary genre or all genres and compare them (similar to the second visualization above)
Clearly state modeling assumptions when presenting genre analyses
Avoid causal claims about genre trends and frame results as descriptive and sensitive to how it is encoded
billboard |>
mutate(year = year(date)) |>
count(year, primary_genre) |>
complete(year, primary_genre)
## # A tibble: 884 × 3
## year primary_genre n
## <dbl> <chr> <int>
## 1 1958 Blues NA
## 2 1958 Electronic/Dance NA
## 3 1958 Folk/Country 1
## 4 1958 Funk/Soul NA
## 5 1958 Hip Hop NA
## 6 1958 Jazz NA
## 7 1958 Latin NA
## 8 1958 March NA
## 9 1958 Polka NA
## 10 1958 Pop 6
## # ℹ 874 more rows
Some genre-year combinations do not appear at all. If I were to summarize without complete(), those absences would disappear, and in turn could falsely suggest smooth trends.
billboard |>
filter(is.na(cdr_genre) | is.na(lyrical_topic))
## # A tibble: 114 × 106
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 The … "Dave… 1959-05-11 00:00:00 1 0 4
## 2 Slee… "Sant… 1959-09-21 00:00:00 2 0 8
## 3 Them… "Perc… 1960-02-22 00:00:00 9 0 6
## 4 Wond… "Bert… 1961-01-09 00:00:00 3 0 7
## 5 Calc… "Lawr… 1961-02-13 00:00:00 2 0 3
## 6 Stra… "Mr. … 1962-05-26 00:00:00 1 0 3
## 7 The … "Davi… 1962-07-07 00:00:00 1 0 6
## 8 Tels… "The … 1962-12-22 00:00:00 3 0 8
## 9 Fing… "Stev… 1963-08-10 00:00:00 3 0 8
## 10 Love… "Paul… 1968-02-10 00:00:00 5 0 6
## # ℹ 104 more rows
## # ℹ 100 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>, …
This analysis shows songs with no genre or lyrical topic, which may be a strong indicator of incomplete data.
billboard |>
group_by(lyrical_topic) |>
summarise(n = n()) |>
filter(n == 0)
## # A tibble: 0 × 2
## # ℹ 2 variables: lyrical_topic <chr>, n <int>
For this analysis, this shows that all of the lyrical topics that appear in the dataset have at least one song connected to them in some way.
For weeks_at_number_one, I would define an outlier as a song above the 99th percentile in the column weeks_at_number_one. This distribution has a small number of songs spending a much longer time at #1 than a majority of others. Extreme values like these could heavily influence summary statistics like means and regression models, so defining these as outliers helps prevent these misleading types of summaries. As you can see below, there are only 14 songs that are above the 99th percentile and considered “outliers” in this instance.
quantile(billboard$weeks_at_number_one, 0.99, na.rm = TRUE)
## 99%
## 14
billboard |>
ggplot(aes(x = weeks_at_number_one)) +
geom_histogram(bins = 40) +
labs(
title = "Distribution of Weeks at #1",
x = "Weeks at Number One",
y = "Number of Songs"
)