Week 8 | Data Dive — Regression Modeling


Loading my Billboard Hot 100 Number One’s Dataset


library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics

Head of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Cleaning the cdr_genre column and creating primary_genre


billboard <- billboard |>
  mutate(
    primary_genre = str_split_i(cdr_genre, ";", 1)
  )
billboard |>
  select(cdr_genre, primary_genre) |>
distinct()
## # A tibble: 33 × 2
##    cdr_genre          primary_genre
##    <chr>              <chr>        
##  1 Pop;Rock           Pop          
##  2 Pop                Pop          
##  3 Rock               Rock         
##  4 Folk/Country       Folk/Country 
##  5 Folk/Country;March Folk/Country 
##  6 Pop;Folk/Country   Pop          
##  7 Jazz               Jazz         
##  8 Funk/Soul;Rock     Funk/Soul    
##  9 Polka              Polka        
## 10 Funk/Soul          Funk/Soul    
## # ℹ 23 more rows

Selecting a Continuous (or Ordered Integer) Column of Data that Seems Most “Valuable” Given the Context of the Data (Response Variable)


Response Variable

  • weeks_at_number_one (number of weeks a song was at the #1 spot on the Billboard Hot 100)

    • Good estimator of how successful a song actually was

Selecting a Categorical Column of Data (Explanatory Variable) that I Expect Might Influence the Response Variable


Explanatory Variable

  • primary_genre (derived from the original cdr_genre column, split by “;”, assuming the first genre listed if there are multiple genres is the primary genre)

    • Musical genre may influence how long a song remains at number one as different genres have different audience sizes and popularity

Consolidating Categories Before Running the Test Since There are More than 10 Categories


top_genres <- billboard |>
  filter(!is.na(primary_genre)) |>
  count(primary_genre, sort = TRUE) |>
  slice_head(n = 10) |>
  pull(primary_genre)

billboard_filtered <- billboard |>
  filter(primary_genre %in% top_genres)

top_genres
##  [1] "Pop"              "Rock"             "Funk/Soul"        "Electronic/Dance"
##  [5] "Hip Hop"          "Folk/Country"     "Reggae"           "Jazz"            
##  [9] "Latin"            "Blues"

Devising a Null Hypothesis for an ANOVA Test Given the Situation


Null Hypothesis –> H0:

  • The average number of weeks a song stays at number one on the Billboard Hot 100 chart is the same across all genres.

Alternative Hypothesis –> HA:

  • At least one genre differs for the mean number of weeks at number one on the Billboard Hot 100 chart.

Testing the Hypothesis Using ANOVA and Summarizing the Results


library(viridis)
## Loading required package: viridisLite
ggplot(billboard_filtered, aes(x = primary_genre, y = weeks_at_number_one, fill = primary_genre)) +
  geom_boxplot() +
  scale_fill_viridis_d(option = "plasma") +
  labs(
    title = "Distribution of Weeks at Number One by Primary Genre",
    x = "Primary Genre",
    y = "Weeks at Number One",
    fill = "Genre"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

anova_model <- aov(weeks_at_number_one ~ primary_genre, data = billboard_filtered)

summary(anova_model)
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## primary_genre    9    381   42.29   6.792 1.57e-09 ***
## Residuals     1077   6706    6.23                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

How the Output Relates to the Conclusions


Based on the ANOVA test ran above, we can see that the p-value outputted is roughly 1.57 x 10-9 , which is far smaller than the widely utilized significance level of p = 0.05. Because of this, we reject the null hypothesis that the average number of weeks a song stays at number one on the Billboard Hot 100 chart is the same across all genres. Essentially, this means that there is strong evidence that the mean number of weeks a song stays at number one on the Billboard Hot 100 chart differs between some genres. All in all, primary genre appears to have a statistically significant relationship with how long a song remains at number one on the Billboard Hot 100 chart.

What this Might Mean for Those Interested in the Data


From what we’ve discovered thus far, the ANOVA test results suggest that primary genre may influence chart success and longevity. Some primary genres may be more likely to put out songs that remain at number one for longer periods of time compared to other primary genres. For a real-world example, this could reflect the differences at hand regarding audience size, streaming popularity (Spotify, Apple Music, etc.), radio play, or even cultural trends corresponding to various primary genres. As for some of the stakeholders in the music industry like artists, producers, and record labels, the discoveries from the ANOVA test may help them identify that primary genre plays a pretty big part in how long a song can stay at the number one spot on the Billboard Hot 100 chart. That said, primary genre alone doesn’t automatically determine the success of a song, but it appears to be a heavy factor that might influence how long a song actually stays at the top spot on the charts.

A Single Continuous (or Ordered Integer, Non-Binary) Column of Data that Might Influence the Response Variable


Continuous Column that May Influence the Response Variable (weeks_at_number_one)

  • bpm (beats per minute as provided by Spotify)

    • bpm measures the tempo of a song, and may appeal to audiences in different ways, affecting chart performance/longevity

Building a Linear Regression Model of the Response Using this Column and Evaluating its Fit


lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)

summary(lm_bpm)
## 
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9634 -1.9363 -0.9395  1.0523 16.0718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8979091  0.3723397   7.783 1.55e-14 ***
## bpm         0.0003743  0.0031440   0.119    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.646 on 1173 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  1.208e-05,  Adjusted R-squared:  -0.0008404 
## F-statistic: 0.01418 on 1 and 1173 DF,  p-value: 0.9052
ggplot(billboard, aes(x = bpm, y = weeks_at_number_one)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Tempo (bpm) vs. Weeks at Number One",
    x = "Beats per Minute",
    y = "Weeks at Number One"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values or values outside the scale range
## (`geom_point()`).

Interpreting the Coefficients of the Model and Explaining How they Relate to the Context of the Data


First, the intercept is about 2.898 and represents the predicted number of weeks that a song would stay at number one when the bpm is zero, which doesn’t make sense conceptually as a bpm of zero isn’t realistic in music. Second, the bpm coefficient is roughly 0.000374 and shows the expected change in weeks at number one as bpm increases by one each time. This bpm coefficient is rather small, and suggests that changes in tempo have essentially no direct effect on the longevity of songs on the Billboard Hot 100 chart. As for the p-value for bpm, it sits at 0.905 and is far beyond the usual significance threshold of 0.05. Due to this finding, we fail to reject the null hypothesis that the slope is equal to zero, meaning that there’s no statistical evidence that bpm is directly related to the number of weeks a song is at number one. Lastly, the multiple R-squared value is very small, sitting at 0.00001208, which shows that bpm doesn’t explain the variation in weeks at number one hardly at all. Based on these results, they obviously suggest that tempo, or bpm, doesn’t meaningfully influence song longevity on the Billboard Hot 100 chart. Whether tempos are slower or faster, they seem to reflect similar levels of chart success. Like we discussed earlier, for stakeholders like artists, producers, and record labels, this implies that tempo by itself isn’t a determining factor in achieving success regarding chart longevity. As we saw before, aspects like genre or potentially marketing and popularity are more likely to be more important in determining chart longevity for number one songs.