Week 10 | Data Dive — GLMs

Loading the Billboard Hot 100 Number Ones Dataset

library(tidyverse, quietly = T)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr, quietly = T)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)

## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"

billboard <- tuesdata$billboard
topics <- tuesdata$topics

Preview of the Dataset

head(billboard)

## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …

head(topics)

## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Selecting an Interesting Binary Column of Data

The binary column of data I will be analyzing that is worth modeling is:

happy_song — (converted variable)
- Although there is a happiness column of data already present in the dataset, it isn’t binary, and is actually on a scale from 0-100 as provided by Spotify, which we know from the clear documentation. Converting happiness to a binary variable and determining if a song is considered “happy” or not can be done by the following, and we first can calculate the median value of happiness in order to get a more accurate scale rather than doing a > 60% assumption.

Determining the Median happiness Value

summary(billboard$happiness)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    4.00   42.50   66.00   61.78   83.00   99.00       2

Creating the Binary Variable happy_song Now that we have a True Median Value (66.0)

billboard$happy_song <- ifelse(billboard$happiness > 66, 1, 0)

Building a Logistic Regression Model for happy_song

The explanatory variables to be applied are the following:

danceability
energy
bpm
loudness_d_b

happy_model <- glm(happy_song ~ danceability + energy + bpm + loudness_d_b, data = billboard, family = "binomial")

summary(happy_model)

## 
## Call:
## glm(formula = happy_song ~ danceability + energy + bpm + loudness_d_b, 
##     family = "binomial", data = billboard)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.783124   0.767207 -12.752   <2e-16 ***
## danceability  0.049597   0.005218   9.506   <2e-16 ***
## energy        0.071308   0.005673  12.569   <2e-16 ***
## bpm          -0.002289   0.002823  -0.811    0.417    
## loudness_d_b -0.295241   0.030122  -9.801   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1628.8  on 1174  degrees of freedom
## Residual deviance: 1279.7  on 1170  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 1289.7
## 
## Number of Fisher Scoring iterations: 4

Interpreting the Coefficients and What They Mean

Intercept — As you can see above in the logistic regression happy_model, the intercept is roughly -9.78 and essentially represents the log odds of a song being classified as “happy” when each of the four predictors are equal to zero. The intercept alone isn’t exactly very meaningful, but it can be viewed as the base value for the model while all predictors are at zero.
Danceability — For danceability, the coefficient is around 0.0496, which can be seen as statistically significant with regards to p-value. The coefficient is also positive, so as danceability increases, so do the log odds of a song reaching the “happy” song category. More simply, the more danceable a song is, the more likely the song is to be seen as “happy”. This makes sense in a real world context as a song people tend to dance to more than likely has a more upbeat or positive vibe.
Energy — Regarding energy, the model yielded a coefficient of 0.0713, similar to that of danceability where it’s definitely statistically significant when compared to the p-value, while it’s also positive, so as energy increases for a song, so does the likeliness the song is categorized as “happy”. This output also reinforces the idea that the intensity and how energetic a song is contributes to how “happy” it may be.
BPM — For bpm, the coefficient is -0.0023, but is not statistically significant as the p-value is 0.417, and mainly means that the bpm or tempo of a song doesn’t have a meaningful linear relationship for whether a song is categorized as “happy” in this specific model. Based on this, the model suggests that whether a song is faster or slower doesn’t heavily influence happiness for songs.
Loudness — The coefficient for loudness is about -0.2952 and is pretty statistically significant, but negative. So, the louder a song is, the less likely the song is to be categorized as “happy”. This is interesting when compared to danceability and energy from before, and this loudness coefficient seems a bit counterintuitive as it suggests that louder songs may be seen as more aggressive or over the top negative feelings, compared to what we’d think would be more positive feelings, like happiness.

Overall, the happy_model suggests that danceability and energy are more strong positive predictors of whether or not a song is seen as “happy”, whereas loudness has a more negative relationship with regards to the happiness of a song, and leaving bpm/tempo as a variable that doesn’t play a very big role in the model for the most part. These findings could indicate that the happiness or emotional vibe of a song is a bit more influenced by areas like rhythm and energy rather than the speed or how loud it is.

Using the Standard Error, Building a Confidence Interval, and What it Means for the Danceability and Energy Coefficients

confint(happy_model, 'danceability')

## Waiting for profiling to be done...

##      2.5 %     97.5 % 
## 0.03953179 0.06000067

confint(happy_model, 'energy')

## Waiting for profiling to be done...

##      2.5 %     97.5 % 
## 0.06041047 0.08266486

What this means for danceability

As you can see, the 95% confidence interval for the danceability coefficient is roughly 0.03953 to 0.06. This supplied interval doesn’t include 0, so we can conclude that danceability has a statistically significant, positive effect on the likelihood of a song in the dataset being categorized as “happy”. Furthermore, for each unit increase for danceability, the log odds of a song being “happy” increases by an amount within the previously stated range, and supports that more danceable songs are more likely to be seen as “happy”.

What this means for energy

The 95% confidence interval for the energy coefficient is somewhere within the range 0.0604 to 0.0827, and again, since the interval doesn’t include 0, energy can be seen as statistically significant and has a positive effect on the likelihood of a song in the database being categorized as “happy” as well. For each unit increase in energy, the log odds of a song being “happy” increases by an amount within the range above, and is strong evidence that songs with higher energy are more likely to be classified as “happy”.

Week10DataDive

Grant Starnes

2026-03-30