Week 11 | Data Dive — GLMs (Part 2)


Loading the Billboard Hot 100 Number Ones Dataset


library(tidyverse)
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)

billboard <- tuesdata$billboard
topics <- tuesdata$topics

Preview of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Building a Linear (or Generalized Linear) Model


Response & Explanatory Variables to be Included


Response Variable

  • hit_song — derived binary variable from weeks_at_number_one

    • This binary response variable was created and is defined as songs that stayed at number one for more than one week are categorized as 1, and the others are categorized as 0. With this created hit_song variable, we can model whether songs were major hits or not among all the number one songs up to this point.

Explanatory Variables

  • energy — Energy measure on a 0 - 100 scale provided by Spotify

  • danceability — Danceability measure on a 0 - 100 scale provided by Spotify

  • bpm — Beats per minute (a.k.a. tempo) provided by Spotify

  • explicit — Dummy variable for if Spotify labels a song as explicit (containing certain words, sexual, violent, or drug related at the time of release)

Creating the Binary Response Variable — hit_song


billboard <- billboard |>
  mutate(hit_song = ifelse(weeks_at_number_one > 1, 1, 0))

Constructing the GLM


hitsong_glm <- glm(hit_song ~ energy + danceability + bpm + explicit,
                   data = billboard,
                   family = "binomial")

summary(hitsong_glm)
## 
## Call:
## glm(formula = hit_song ~ energy + danceability + bpm + explicit, 
##     family = "binomial", data = billboard)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.5899125  0.4055324   1.455    0.146
## energy       -0.0024810  0.0033839  -0.733    0.463
## danceability  0.0010192  0.0043942   0.232    0.817
## bpm           0.0003926  0.0025196   0.156    0.876
## explicit      0.1124695  0.1612963   0.697    0.486
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1536.7  on 1174  degrees of freedom
## Residual deviance: 1535.7  on 1170  degrees of freedom
##   (2 observations deleted due to missingness)
## AIC: 1545.7
## 
## Number of Fisher Scoring iterations: 4

In the above GLM, it was utilized to determine whether a song is a “hit” (longer than one week at number one) or not, using explanatory variables like energy, danceability, bpm/tempo, and explicit. Based on the output of the GLM, it appears that none of these explanatory variables were found statistically significant with regards to “hit” status.

Interpreting the Coefficient(s)


coef(hitsong_glm)
##   (Intercept)        energy  danceability           bpm      explicit 
##  0.5899125457 -0.0024810087  0.0010191899  0.0003926415  0.1124695357

Energy — As one can see in the overall summary as well as the coefficient cell output just above, the estimate/coefficient for the energy explanatory variable is roughly -0.00248, and the corresponding p-value is 0.463. With this information, we can determine that energy doesn’t have a statistically significant effect on whether or not a song becomes a “hit”. The negative coefficient suggests that there is a slight decrease in the odds of a song being a hit as energy increases, but the effect overall isn’t meaningful.

Danceability — For danceability, the coefficient is around 0.00102, accompanied by a p-value of 0.817. Due to this, danceability too shows no statistically significant relationship with a song being a “hit” or not. This could suggest that the more danceable a song is doesn’t necessarily mean it’s more likely to remain at number one for multiple weeks.

BPM — For bpm, the coefficient is about 0.00039, and the paired p-value is 0.876. Because of this, similar to the prior two explanatory variables, bpm/tempo doesn’t have a statistically significant effect on a song being a “hit” or not as well. Specifically for bpm, whether a song is slower or faster doesn’t matter much as these different tempos are nearly equally likely to be a “hit”.

Explicit — Lastly, for the explicit explanatory variable, the coefficient is roughly 0.112, and the p-value is 0.486. Again, this explanatory variable isn’t statistically significant and has no effect on a song being a “hit”. So, if a song is deemed explicit, it doesn’t significantly impact the likelihood of it becoming a “hit”.

Additionally, the null deviance is 1536.7 and the residual deviance is 1535.7, and since the two are so closely related, this essentially means that this model doesn’t improve prediction hardly at all over a model with no predictors present. Also, the AIC presented is 1545.7, which is rather high, indicating that this model doesn’t fit the data well, if one couldn’t tell by this point.

Diagnosing the Model


Residuals vs Fitted Values

plot(hitsong_glm$fitted.values, resid(hitsong_glm, type = "deviance"),
     xlab = "Fitted Probabilities",
     ylab = "Deviance Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red3")

Residuals vs X Values

model_data <- model.frame(hitsong_glm)

plot(model_data$energy, resid(hitsong_glm, type = "deviance"),
     xlab = "Energy",
     ylab = "Deviance Residuals",
     main = "Residuals vs Energy")
abline(h = 0, col = "red3")

plot(model_data$danceability, resid(hitsong_glm, type = "deviance"),
     xlab = "Danceability",
     ylab = "Deviance Residuals",
     main = "Residuals vs Danceability")
abline(h = 0, col = "red3")

plot(model_data$bpm, resid(hitsong_glm, type = "deviance"),
     xlab = "BPM",
     ylab = "Deviance Residuals",
     main = "Residuals vs BPM")
abline(h = 0, col = "red3")

plot(model_data$explicit, resid(hitsong_glm, type = "deviance"),
     xlab = "Explicit",
     ylab = "Deviance Residuals",
     main = "Residuals vs Explicit")
abline(h = 0, col = "red3")

Cook’s Distance

plot(cooks.distance(hitsong_glm), type="h",
     main="Cook's Distance", ylab="Influence")
abline(h = 4/(nrow(billboard)-length(coef(hitsong_glm))), col="red3")

Correlation Heatmap of Predictors

library(GGally)

ggcorr(select(billboard, energy, danceability, bpm, explicit), label = TRUE)

Issues with the Model


For the residuals vs x values plots, they all seem to show a very similar pattern in the sense that they all or most of them have two horizontal lines of points that cluster around 1 and -1.5. This is what was expected for a regression model with a binary outcome, hit_song, since the values can essentially only be either 0 or 1. Based on those specific explanatory variable plots, the patterns behave as expected, and is a good sign overall. For the residuals vs fitted values plot, most values are within the specific range of 0.60 and 0.69 and also cluster around 0 or 1 horizontally as expected. This suggests that the model predicts roughly the same probability for every observation, and makes sense considering the none of the predictors in the model are statistically significant. For Cook’s Distance, the majority of observations are below the line situated at about 0.003, and means observations aren’t influencing the model estimates. That said, there are a couple observations that spike above the line and may be partially influential, but not by much. Lastly, for the correlation heatmap, the correlations for all predictors are within the range of 0.0 to 0.3, so multicollinearity isn’t much of a concern. Overall, the correlations are rather weak, but the predictors are definitely independent of one another.

Based on the diagnostic plots above, everything looks rather clean for the most part, so this isn’t the issue with this model. The main issue here is simply that the predictors or explanatory variables don’t have meaningful predictive power for “hit” songs in the dataset. We can see this pretty clearly based on the null and residual deviance and the statistically insignificant p-values for each and every explanatory variable, and the diagnostic plots support this as we can see the expected outcomes, so that’s not where the issue lies, and it’s mainly with the explanatory variables/predictors themselves.