library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "billboard.csv"
## 2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics
head(billboard)
## # A tibble: 6 × 105
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
## 2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
## 3 Littl… The E… 1958-08-25 00:00:00 1 0 5
## 4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
## 5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
## 6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## # front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
## lyrical_topics
## <chr>
## 1 Addiction
## 2 Anger
## 3 Appreciation
## 4 Badassery
## 5 Bad Behavior
## 6 Bad Relationships
billboard <- billboard |>
mutate(
primary_genre = str_split_i(cdr_genre, ";", 1)
)
billboard |>
select(cdr_genre, primary_genre) |>
distinct()
## # A tibble: 33 × 2
## cdr_genre primary_genre
## <chr> <chr>
## 1 Pop;Rock Pop
## 2 Pop Pop
## 3 Rock Rock
## 4 Folk/Country Folk/Country
## 5 Folk/Country;March Folk/Country
## 6 Pop;Folk/Country Pop
## 7 Jazz Jazz
## 8 Funk/Soul;Rock Funk/Soul
## 9 Polka Polka
## 10 Funk/Soul Funk/Soul
## # ℹ 23 more rows
Response Variable
weeks_at_number_one (number of weeks a song was at the #1 spot on the Billboard Hot 100)
Explanatory Variable
primary_genre (derived from the original cdr_genre column, split by “;”, assuming the first genre listed if there are multiple genres is the primary genre)
top_genres <- billboard |>
filter(!is.na(primary_genre)) |>
count(primary_genre, sort = TRUE) |>
slice_head(n = 10) |>
pull(primary_genre)
billboard_filtered <- billboard |>
filter(primary_genre %in% top_genres)
top_genres
## [1] "Pop" "Rock" "Funk/Soul" "Electronic/Dance"
## [5] "Hip Hop" "Folk/Country" "Reggae" "Jazz"
## [9] "Latin" "Blues"
Null Hypothesis –> H0:
Alternative Hypothesis –> HA:
library(viridis)
## Loading required package: viridisLite
ggplot(billboard_filtered, aes(x = primary_genre, y = weeks_at_number_one, fill = primary_genre)) +
geom_boxplot() +
scale_fill_viridis_d(option = "plasma") +
labs(
title = "Distribution of Weeks at Number One by Primary Genre",
x = "Primary Genre",
y = "Weeks at Number One",
fill = "Genre"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
anova_model <- aov(weeks_at_number_one ~ primary_genre, data = billboard_filtered)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## primary_genre 9 381 42.29 6.792 1.57e-09 ***
## Residuals 1077 6706 6.23
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA test ran above, we can see that the p-value outputted is roughly 1.57 x 10-9 , which is far smaller than the widely utilized significance level of p = 0.05. Because of this, we reject the null hypothesis that the average number of weeks a song stays at number one on the Billboard Hot 100 chart is the same across all genres. Essentially, this means that there is strong evidence that the mean number of weeks a song stays at number one on the Billboard Hot 100 chart differs between some genres. All in all, primary genre appears to have a statistically significant relationship with how long a song remains at number one on the Billboard Hot 100 chart.
From what we’ve discovered thus far, the ANOVA test results suggest that primary genre may influence chart success and longevity. Some primary genres may be more likely to put out songs that remain at number one for longer periods of time compared to other primary genres. For a real-world example, this could reflect the differences at hand regarding audience size, streaming popularity (Spotify, Apple Music, etc.), radio play, or even cultural trends corresponding to various primary genres. As for some of the stakeholders in the music industry like artists, producers, and record labels, the discoveries from the ANOVA test may help them identify that primary genre plays a pretty big part in how long a song can stay at the number one spot on the Billboard Hot 100 chart. That said, primary genre alone doesn’t automatically determine the success of a song, but it appears to be a heavy factor that might influence how long a song actually stays at the top spot on the charts.
Continuous Column that May Influence the Response Variable (weeks_at_number_one)
bpm (beats per minute as provided by Spotify)
lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)
summary(lm_bpm)
##
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9634 -1.9363 -0.9395 1.0523 16.0718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8979091 0.3723397 7.783 1.55e-14 ***
## bpm 0.0003743 0.0031440 0.119 0.905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.646 on 1173 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 1.208e-05, Adjusted R-squared: -0.0008404
## F-statistic: 0.01418 on 1 and 1173 DF, p-value: 0.9052
ggplot(billboard, aes(x = bpm, y = weeks_at_number_one)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Tempo (bpm) vs. Weeks at Number One",
x = "Beats per Minute",
y = "Weeks at Number One"
) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).
First, the intercept is about 2.898 and represents the predicted number of weeks that a song would stay at number one when the bpm is zero, which doesn’t make sense conceptually as a bpm of zero isn’t realistic in music. Second, the bpm coefficient is roughly 0.000374 and shows the expected change in weeks at number one as bpm increases by one each time. This bpm coefficient is rather small, and suggests that changes in tempo have essentially no direct effect on the longevity of songs on the Billboard Hot 100 chart. As for the p-value for bpm, it sits at 0.905 and is far beyond the usual significance threshold of 0.05. Due to this finding, we fail to reject the null hypothesis that the slope is equal to zero, meaning that there’s no statistical evidence that bpm is directly related to the number of weeks a song is at number one. Lastly, the multiple R-squared value is very small, sitting at 0.00001208, which shows that bpm doesn’t explain the variation in weeks at number one hardly at all. Based on these results, they obviously suggest that tempo, or bpm, doesn’t meaningfully influence song longevity on the Billboard Hot 100 chart. Whether tempos are slower or faster, they seem to reflect similar levels of chart success. Like we discussed earlier, for stakeholders like artists, producers, and record labels, this implies that tempo by itself isn’t a determining factor in achieving success regarding chart longevity. As we saw before, aspects like genre or potentially marketing and popularity are more likely to be more important in determining chart longevity for number one songs.