Week 2 | Data Dive — Summaries

This Week 2 | Data Dive — Summaries Assignment will be mainly focused on grasping an understanding of the data, as well as running a few summary statistical tests on the dataset while also providing a few basic plots. The dataset we’ll be diving into is the Billboard Hot 100 Number Ones within the time period of August 4th, 1958 and January 11th, 2025.

Loading the Data

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)

## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"

billboard <- tuesdata$billboard
topics <- tuesdata$topics

Previewing the Data

head(billboard)

## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …

head(topics)

## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Numeric Summary #1 (Categorical) — Unique Artists with #1 Hits and Number of #1 Hits

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.2
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

billboard |>
  count(artist, sort = TRUE, name = "num_of_num1_hits")

## # A tibble: 763 × 2
##    artist          num_of_num1_hits
##    <chr>                      <int>
##  1 The Beatles                   20
##  2 Mariah Carey                  16
##  3 Madonna                       12
##  4 Michael Jackson               11
##  5 Whitney Houston               11
##  6 Janet Jackson                 10
##  7 Taylor Swift                  10
##  8 The Supremes                  10
##  9 Bee Gees                       9
## 10 Stevie Wonder                  8
## # ℹ 753 more rows

Based on this categorical summary, the insight gathered would be that within the dataset there are a significant number of number one hits created by the same artist. We know this because when we first load in the dataset, there are 1,177 rows, a row for each number one song. When we count by artist, there are now only 763 rows, so there are many artists that have many number one hits. After counting by each unique artist, we can see how many number one hits they actually have, in this instance, The Beatles with 20 number one hits at the top since we’ve sorted from most to least number one hits. This is significant because we now know for sure that there are repeats of artists within the dataset with more than one number one hit, and can use this going forward for further analyses!

Numeric Summary #2 (Numeric) — Min/Max, Central Tendencies, and Distribution of Weeks at Number One

billboard |>
  summarize(
    min_weeks = min(weeks_at_number_one, na.rm = TRUE),
    max_weeks = max(weeks_at_number_one, na.rm = TRUE),
    mean_weeks = mean(weeks_at_number_one, na.rm = TRUE),
    median_weeks = median(weeks_at_number_one, na.rm = TRUE),
    q1_weeks = quantile(weeks_at_number_one, 0.25, na.rm = TRUE),
    q3_weeks = quantile(weeks_at_number_one, 0.75, na.rm = TRUE)
  )

## # A tibble: 1 × 6
##   min_weeks max_weeks mean_weeks median_weeks q1_weeks q3_weeks
##       <dbl>     <dbl>      <dbl>        <dbl>    <dbl>    <dbl>
## 1         1        19       2.94            2        1        4

Based on this numeric summary, the insight gathered is that the minimum number of weeks a song spent at number one was 1 week, whereas the maximum number of weeks a song spent at number one was 19 (extremely impressive). More insights gathered include the average (mean) number of weeks songs spent at number one, which was roughly 2.94 weeks, and the median is right at 2. Lastly, we looked at the first quartile where we found that 25% of songs in the dataset spent a week or less at number one, as well as looked at the third quartile where we found that 75% of songs spent 4 weeks or less at number one. This is all significant because we now know the range of weeks songs spent at number one, the average weeks spent at number one, the median of number one songs, as well as the percentages of songs that are at number one and for how long, which also provides us with the information that the middle 50% of songs stay at number one somewhere within the range of 1-4 weeks.

Innovative Questions to Investigate

Which artists are responsible for the most number one hits, and how concentrated are these hits among a smaller group of artists?
How long do songs usually remain at number one, and how common are songs with a longevity at number one compared to those more short lived?
Do artists with more frequent number one hits tend to have songs that stay at number one longer, or are songs with longevity independent of artist frequency?

Let’s Explore Artist Hit Song Frequencies and If They Remain at #1 Longer

billboard |>
  group_by(artist) |>
  summarize(
    avg_weeks_num_one = mean(weeks_at_number_one, na.rm = TRUE),
    num_of_num_one_hits = n()
  ) |>
  arrange(desc(num_of_num_one_hits))

## # A tibble: 763 × 3
##    artist          avg_weeks_num_one num_of_num_one_hits
##    <chr>                       <dbl>               <int>
##  1 The Beatles                  2.9                   20
##  2 Mariah Carey                 4.25                  16
##  3 Madonna                      2.67                  12
##  4 Michael Jackson              2.73                  11
##  5 Whitney Houston              2.82                  11
##  6 Janet Jackson                3.3                   10
##  7 Taylor Swift                 3.3                   10
##  8 The Supremes                 1.8                   10
##  9 Bee Gees                     3.11                   9
## 10 Stevie Wonder                1.75                   8
## # ℹ 753 more rows

Based on this exploration of artist hit song frequencies and seeing if they remain longer at number one due to this frequency, the insights provided would be that this doesn’t exactly hold true. We can see this as the top artist/band, The Beatles, has an astounding 20 number one hits, but their average weeks at number one for their songs is 2.9 weeks. As we go down the list, we can see that an artist like Taylor Swift has half the number of number one hits, 10, but her average weeks at number one is 3.3 weeks, 0.4 weeks longer on average than The Beatles. This is significant because we now have observed the correlation between average weeks at number one and number of number one hits and got a brief oversight of this, and we could conduct more analyses on this later on!

Visualization #1 — Distribution of Weeks at #1

library(ggplot2)

ggplot(billboard, aes(x = weeks_at_number_one)) +
  geom_histogram(
    binwidth = 1,
    boundary = 0.5,
    fill = "steelblue",
    color = "white"
  ) +
  geom_text(
    stat = "bin",
    binwidth = 1,
    aes(label = after_stat(count)),
    vjust = -0.3,
    size = 3
  ) +
  scale_x_continuous(breaks = 1:max(billboard$weeks_at_number_one, na.rm = TRUE)) +
  labs(
    title = "Distribution of Weeks at Number One",
    x = "Weeks at Number One",
    y = "Number of Songs"
  )

Based on this visualization of the distribution of weeks at number one, the insight provided is rather clear where we can see that essentially it gets increasingly more difficult/impressive to have a song hold the number one spot as the weeks go on. We see a large majority of songs only hold the number one spot for 1 week, where it steadily decreases week after week following. One interesting point would be that from week 11 to week 12, the number of number one songs actually increases by 2, as well as in week 14 where in week 13 it dips back down to 2 songs, but jumps in week 14 back to 7 songs. There is an overarching pattern, but there are definitely some areas to look into as we can tell from when the weeks get further and further.

Visualization #2 — Distribution of Danceability by Genre

ggplot(billboard, aes(x = cdr_genre, y = danceability, fill = cdr_genre)) +
  geom_boxplot(show.legend = FALSE) +
  labs(
    title = "Distribution of Danceability by Genre",
    x = "Genre",
    y = "Danceability"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

Based on this visualization of the distribution of danceability by genre, we are able to see with these box plots the distribution of danceability scores (score of 1-100 provided by Spotify) across countless different music genres. The differences in the median danceability and variability help show that some genres tend to produce more danceable number one songs in comparison to others. This is significant because it helps show how the aspect of danceability varies across genres and may influence a song’s chances of getting to the number one spot. Genres with a higher median danceability may point to more mainstream listening habits and may suggest that if a song can get people dancing it could be a larger factor in overall success!

Week2DataDive

Grant Starnes

2026-01-26