Week 5 | Data Dive — Documentation


Loading the Dataset


library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics

Overview of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Cleaning the cdr_genre column and creating primary_genre


billboard <- billboard |>
  mutate(
    primary_genre = str_split_i(cdr_genre, ";", 1)
  )
billboard |>
  select(cdr_genre, primary_genre) |>
distinct()
## # A tibble: 33 × 2
##    cdr_genre          primary_genre
##    <chr>              <chr>        
##  1 Pop;Rock           Pop          
##  2 Pop                Pop          
##  3 Rock               Rock         
##  4 Folk/Country       Folk/Country 
##  5 Folk/Country;March Folk/Country 
##  6 Pop;Folk/Country   Pop          
##  7 Jazz               Jazz         
##  8 Funk/Soul;Rock     Funk/Soul    
##  9 Polka              Polka        
## 10 Funk/Soul          Funk/Soul    
## # ℹ 23 more rows

List of Columns in the Billboard Hot 100 #1 Hits Dataset that were Unclear Until I Read the Documentation


Columns Unclear Until I Read the Documentation:

  • song_structure

  • artist_male

Why I think they chose to encode the data the way they did


For song_structure, I think they chose to encode the data the way they did (ex. A1, C2, E7, etc.) for musical structure purposes like verses, choruses, and bridge patterns so they could store this rather complex information more simply and more compactly across a wide range of songs within the dataset.

For artist_male, they encoded the data as follows: 0 if the artist or group was all female, 1 if the artist or group was all male, 2 if the artist or group has both males and females, and 3 if the artist or group has at least one non-binary individual. I think they chose to encode the data the way they did because this kind of numeric encoding allows the dataset to go beyond binary encoding (0 or 1, True or False) and provide further insight for instances such as the artist is actually a group and if that group has a mixed composition.

What could have happened if I didn’t read the documentation


If I didn’t read the documentation for song_structure I may have worked with the values in this column and treated them as useless categories and directly compared them, even though in reality these categories represent specific musical structures regarding aspects like verses, choruses, and bridges. If I worked with this song_structure column without reading the documentation, I could have made completely different conclusions about the similarities varying songs have structurally.

If I didn’t read the documentation for artist_male I would have just assumed the values in the column were binary with just 0s and 1s. I would’ve then only looked for 0 and 1 values and lose some valuable information about groups that are mixed in gender and would introduce quite a bit of bias in my analyses regarding artists/groups and their representation.

Element(s) of the Billboard Hot 100 #1 Hits Dataset That Are Still Unclear After Reading the Documentation


An element of the data that is unclear even after reading the documentation:

  • cdr_genre

Is there anything about the data that the documentation does not explain?


Even after reading the documentation for cdr_genre, there’s still a lot of aspects that are unclear overall. First, how are multiple genres assigned as well as ordered? We know Chris Dalla Riva and Vinnie Christopher assign the genres, but what is their criteria when assigning genres? Lastly, do genre labels hold stable over time and consistent for all artists? Due to this kind of genre assignment, any analyses of genres, whether over time or comparing different genres, may show different genre assignment practices rather than actual changes in musical style or audience preferences.

Building Visualizations Using the Column of Data Affected Mentioned Previously (cdr_genre)


Visualization #1: Top Genres Using Primary Genre Only

** Assumption for primary_genre that was created at the top of this notebook and is being used here: the first listed genre is the “main” genre **

billboard |>
  count(primary_genre, sort = TRUE) |>
  slice_max(n, n = 10) |>
  ggplot(aes(x = reorder(primary_genre, n), y = n, fill = primary_genre)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(
    title = "Top 10 Genres Using Primary Genre Only",
    x = "Primary Genre",
    y = "Number of Songs"
  )

This visualization assumes that the first genre listed (if there are many genres) in cdr_genre represents the primary genre for that particular song. The problem is that the documentation does not explain whether genres are ordered or random, so this plot may be very misleading if the assumption is incorrect.

Visualization #2: Top Genres for All Listed Genres

** Assumption: All listed genres for a song are equal

primary_counts <- billboard |>
  count(primary_genre, sort = TRUE)

all_genre_counts <- billboard |>
  separate_rows(cdr_genre, sep = ";") |>
  mutate(cdr_genre = str_trim(cdr_genre)) |>
  count(cdr_genre, sort = TRUE)

primary_counts |>
  left_join(all_genre_counts, by = c("primary_genre" = "cdr_genre")) |>
  rename(
    primary_only = n.x,
    all_genres = n.y
  ) |>
  mutate(diff = all_genres - primary_only) |>
  arrange(desc(diff))
## # A tibble: 13 × 4
##    primary_genre    primary_only all_genres  diff
##    <chr>                   <int>      <int> <int>
##  1 Funk/Soul                 228        239    11
##  2 Rock                      281        291    10
##  3 Folk/Country               25         30     5
##  4 Pop                       355        358     3
##  5 Electronic/Dance           91         94     3
##  6 Reggae                     10         13     3
##  7 Hip Hop                    88         89     1
##  8 Jazz                        5          6     1
##  9 Latin                       3          4     1
## 10 March                       1          2     1
## 11 <NA>                       88         88     0
## 12 Blues                       1          1     0
## 13 Polka                       1          1     0
primary_counts <- billboard |>
  count(primary_genre) |>
  rename(genre = primary_genre, n = n) |>
  mutate(method = "Primary genre only")

all_genre_counts <- billboard |>
  separate_rows(cdr_genre, sep = ";") |>
  mutate(cdr_genre = str_trim(cdr_genre)) |>
  count(cdr_genre) |>
  rename(genre = cdr_genre, n = n) |>
  mutate(method = "All listed genres")

bind_rows(primary_counts, all_genre_counts) |>
  group_by(method) |>
  slice_max(n, n = 10) |>
  ungroup() |>
  ggplot(aes(x = reorder(genre, n), y = n, fill = method)) +
  geom_col(position = "dodge") +
  coord_flip() +
  labs(
    title = "Genre Counts Depend on How Multi-Genre Songs Are Handled",
    x = "Genre",
    y = "Count"
  )

Explanation of what is unclear and why it might be concerning


The cdr_genre column permits each song to be labeled with multiple genres separated by semicolons (ex. “Pop;Rock;Funk”). This is troublesome because we don’t know if the order of the genres means anything, if all genres should be treated as equally important, or what the genre assignment criteria is. This is concerning because doing analyses on genres is often for musical trends or cultural shifts, and if the genre assignments are subjective, inconsistent, or randomly ordered, then the trends may reflect genre assignment practices rather than real changes in genre.

Possible Significant Risks and What I Could Do to Reduce Negative Consequences


Significant Risks:

  • Overstatements of the popularity of genres that frequently occur with other genres when counting all of the genres

  • Misrepresentation of multi-genre songs when forcing each song into only a primary genre

  • Drawing conclusions about genres and their trends over time without knowing genre definitions or genre assignments practices

Ways to Reduce Negative Consequences:

  • Report results using multiple reasonable genre encoding, whether primary genre or all genres and compare them (similar to the second visualization above)

  • Clearly state modeling assumptions when presenting genre analyses

  • Avoid causal claims about genre trends and frame results as descriptive and sensitive to how it is encoded

Checking for Explicitly/Implicitly Missing Rows and Empty Groups for Categorical Columns


Implicitly Missing Rows

billboard |>
  mutate(year = year(date)) |>
  count(year, primary_genre) |>
  complete(year, primary_genre)
## # A tibble: 884 × 3
##     year primary_genre        n
##    <dbl> <chr>            <int>
##  1  1958 Blues               NA
##  2  1958 Electronic/Dance    NA
##  3  1958 Folk/Country         1
##  4  1958 Funk/Soul           NA
##  5  1958 Hip Hop             NA
##  6  1958 Jazz                NA
##  7  1958 Latin               NA
##  8  1958 March               NA
##  9  1958 Polka               NA
## 10  1958 Pop                  6
## # ℹ 874 more rows

Some genre-year combinations do not appear at all. If I were to summarize without complete(), those absences would disappear, and in turn could falsely suggest smooth trends.

Explicitly Missing Rows

billboard |> 
  filter(is.na(cdr_genre) | is.na(lyrical_topic))
## # A tibble: 114 × 106
##    song  artist date                weeks_at_number_one non_consecutive rating_1
##    <chr> <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
##  1 The … "Dave… 1959-05-11 00:00:00                   1               0        4
##  2 Slee… "Sant… 1959-09-21 00:00:00                   2               0        8
##  3 Them… "Perc… 1960-02-22 00:00:00                   9               0        6
##  4 Wond… "Bert… 1961-01-09 00:00:00                   3               0        7
##  5 Calc… "Lawr… 1961-02-13 00:00:00                   2               0        3
##  6 Stra… "Mr. … 1962-05-26 00:00:00                   1               0        3
##  7 The … "Davi… 1962-07-07 00:00:00                   1               0        6
##  8 Tels… "The … 1962-12-22 00:00:00                   3               0        8
##  9 Fing… "Stev… 1963-08-10 00:00:00                   3               0        8
## 10 Love… "Paul… 1968-02-10 00:00:00                   5               0        6
## # ℹ 104 more rows
## # ℹ 100 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>, …

This analysis shows songs with no genre or lyrical topic, which may be a strong indicator of incomplete data.

Empty Groups

billboard |>
  group_by(lyrical_topic) |>
  summarise(n = n()) |>
  filter(n == 0)
## # A tibble: 0 × 2
## # ℹ 2 variables: lyrical_topic <chr>, n <int>

For this analysis, this shows that all of the lyrical topics that appear in the dataset have at least one song connected to them in some way.

What I Would Define as an Outlier and Why for weeks_at_number_one


For weeks_at_number_one, I would define an outlier as a song above the 99th percentile in the column weeks_at_number_one. This distribution has a small number of songs spending a much longer time at #1 than a majority of others. Extreme values like these could heavily influence summary statistics like means and regression models, so defining these as outliers helps prevent these misleading types of summaries. As you can see below, there are only 14 songs that are above the 99th percentile and considered “outliers” in this instance.

quantile(billboard$weeks_at_number_one, 0.99, na.rm = TRUE)
## 99% 
##  14
billboard |>
  ggplot(aes(x = weeks_at_number_one)) +
  geom_histogram(bins = 40) +
  labs(
    title = "Distribution of Weeks at #1",
    x = "Weeks at Number One",
    y = "Number of Songs"
  )