Understanding Hit Songs Using Audio Features

Author

Grant Starnes

Notebook Converted to Slide Deck

Audience Description

This analysis is mainly intended for music industry professionals, for example, record label executives, music producers, and music streaming platform analysts (e.g. Spotify, Apple Music). Stakeholders such as these are interested in understanding aspects in the industry like the factors associated with song success, how to accurately allocate production and marketing resources, and whether or not data-driven insights can actually inform their strategic decisions made.

Background Information

The music industry invests significant amounts of time and resources into producing songs, yet it remains uncertain why some songs become more successful than others. As the use of data and analyzing this data becomes increasingly significant day by day, there is growing interest in whether measurable audio features are associated with success. The dataset being analyzed is the Billboard Hot 100 Number Ones held on GitHub. It contains every song to ever top the charts, within the time period of August 4^th, 1958 and January 11^th, 2025. Each row in the dataset corresponds to exactly one song that has ended up being number one on the charts at some point.

Main Objective

The primary goal of this analysis is to evaluate whether audio features such as energy, danceability, bpm (tempo), and explicit content are associated with whether a song becomes a hit or how long it remains at number one. Here, we define a “hit” song as a binary response variable was created and is defined as songs that stayed at number one for more than one week are categorized as 1, and the others are categorized as 0. As for how long a song remains at number one, we simply use weeks at number one as our response variable.

Initial Exploratory Data Analysis

Overview of Variables We’ll Be Implementing:

Energy — energy measure from 0 to 100, as provided by Spotify
Danceability — danceability measure from 0 to 100, as provided by Spotify
BPM — beats per minute (tempo), as provided by Spotify
Explicit — Dummy for if Spotify labels the song as explicit, containing expletives, or overly sexual, violent, or drug related at the time of release
Weeks at Number One — Collective (consecutive and non-consecutive) weeks at number one
Hit Song — derived binary outcome variable from weeks at number one, categorized as 1 if at number one for multiple weeks, and 0 for only one week

Loading the Billboard Hot 100 Number Ones Dataset

library(tidyverse)
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)

billboard <- tuesdata$billboard
topics <- tuesdata$topics

Creating the Binary Response Variable — hit_song

billboard <- billboard |>
  mutate(hit_song = ifelse(weeks_at_number_one > 1, 1, 0))

Constructing the GLM

hitsong_glm <- glm(hit_song ~ energy + danceability + bpm + explicit,
                   data = billboard,
                   family = "binomial")

Correlation Heatmap for the GLM

library(GGally)

ggcorr(select(billboard, energy, danceability, bpm, explicit), label = TRUE)

Longevity Regression Model

new_reg = lm(weeks_at_number_one ~ bpm + danceability + explicit + bpm:danceability, data = billboard)

Correlation Heatmap for Longevity Model

ggcorr(select(billboard,
              explicit,
              danceability,
              bpm,
              weeks_at_number_one),
       label = TRUE) +
  labs(title = "Correlation Heatmap")

Interpretations of the Two Correlation Heatmaps for the hit_song GLM and weeks_at_number_one Linear Regression Model

The correlation heatmaps show little to no strong relationships between predictors, which suggests that multicollinearity isn’t of major concern, and indicates that individual variables alone may not have a strong association with what we’ve defined as “success”.

Assumptions Made

Logistic Regression Assumptions

Regarding the logistic regression model, there are a handful of assumptions we’ve made. First, we’ve made the assumption that observations are independent of one another. Second, we’ve made the assumption that the log-odds relationship is linear. Next, we’ve assumed there’s no multicollinearity that is deemed extreme. Lastly, we’ve assumed that the sample size is sufficient overall.

Linear Regression Assumptions

Now focusing on the linear regression model, we’ve made some assumptions here as well. First, we’ve assumed that the model is linear overall. Then, like the logistic regression model, we assume that observations are independent. Next, we assume that there is a constant variance, or homoscedasticity. Finally, we assume that the residuals are normally distributed.

Why the Assumptions are Acceptable

The above assumptions can be seen as acceptable firstly due to the fact that each song in the dataset is treated as an independent observation. Moving on, based on the correlation heatmaps presented above, multicollinearity has been presented as lower more than anything. Lastly, in the diagnostic plots that will be displayed further on in this notebook, there aren’t any indicated severe violations.

Interpretation Risks and Mitigating Them

Omitted Variable Bias (missing important contextual areas like marketing and artist popularity)
- This can be mitigated by avoiding causation claims and interpreting results simply as associations
Looking too deep into interpretations for variables that are insignificant
- This can be mitigated by focusing solely on the statistical meaningful relationships
A more simplified definition of “hit”
- We could supplement this with an alternative outcome (weeks at number one)

Analyses and Support

Logistic Regression Model

summary(hitsong_glm)


Call:
glm(formula = hit_song ~ energy + danceability + bpm + explicit, 
    family = "binomial", data = billboard)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)
(Intercept)   0.5899125  0.4055324   1.455    0.146
energy       -0.0024810  0.0033839  -0.733    0.463
danceability  0.0010192  0.0043942   0.232    0.817
bpm           0.0003926  0.0025196   0.156    0.876
explicit      0.1124695  0.1612963   0.697    0.486

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1536.7  on 1174  degrees of freedom
Residual deviance: 1535.7  on 1170  degrees of freedom
  (2 observations deleted due to missingness)
AIC: 1545.7

Number of Fisher Scoring iterations: 4

Logistic Regression Model Results Interpretation

None of the predictors/explanatory variables are statistically significant as p-values range from 0.146 to 0.876. Predictors include: energy, danceability, bpm, and explicit
The model shows an extremely minimal improvement over the null model with no predictors
- Null deviance —> 1536.7
- Residual deviance —> 1535.7
The summary overall suggests that audio features aren’t strongly associated with whether or not a song becomes a “hit”

Logistic Regression Model Diagnostics

Residuals vs Fitted Values Plot

plot(hitsong_glm$fitted.values, resid(hitsong_glm, type = "deviance"),
     xlab = "Fitted Probabilities",
     ylab = "Deviance Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red3")

Cook’s Distance Plot

plot(cooks.distance(hitsong_glm), type="h",
     main="Cook's Distance", ylab="Influence")
abline(h = 4/(nrow(billboard)-length(coef(hitsong_glm))), col="red3")

For the residuals vs fitted values plot, most values are within the specific range of 0.60 and 0.69 and also cluster around 0 or 1 horizontally as expected. This suggests that the model predicts roughly the same probability for every observation, and makes sense considering the none of the predictors in the model are statistically significant. For Cook’s Distance, the majority of observations are below the line situated at about 0.003, and means observations aren’t influencing the model estimates. That said, there are a couple observations that spike above the line and may be partially influential, but not by much.

Linear Regression Model

summary(new_reg)


Call:
lm(formula = weeks_at_number_one ~ bpm + danceability + explicit + 
    bpm:danceability, data = billboard)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.0077 -1.7405 -0.7675  0.6367 15.8937 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)       2.659e+00  1.476e+00   1.802   0.0718 .  
bpm              -5.022e-03  1.260e-02  -0.398   0.6904    
danceability      1.147e-03  2.503e-02   0.046   0.9635    
explicit          8.209e-01  2.005e-01   4.094 4.53e-05 ***
bpm:danceability  8.599e-05  2.150e-04   0.400   0.6893    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.617 on 1170 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.02432,   Adjusted R-squared:  0.02099 
F-statistic: 7.292 on 4 and 1170 DF,  p-value: 8.445e-06

Linear Regression Model Results Interpretation

Explicit content is statistically significant, with a p-value of 4.53e^-5
- That said, songs labeled explicit tend to spend more time at number one
All other explanatory variables (bpm, danceability, bpm:danceability) remain statistically insignificant
The model only explains roughly 2% of the variation (0.02 R² value)
Given all of this, despite explicit being statistically significant, a vast majority of variation for song success isn’t explained by any of the audio features above

Linear Regression Model Diagnostics

Residuals vs Fitted Values Plot

plot(new_reg$fitted.values, resid(new_reg),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red3")

The residuals vs. fitted plot above shows more of a structured fit rather than a random scatter with clearly visible clustering. Based on this, the residuals vs. fitted plot suggests that the linear assumption may be violated. Also, the spread of the residuals seems a bit uneven across the fitted values, so there might be a bit of heteroscedasticity. Overall, based on the residuals vs. fitted plot, there’s moderate evidence that the regression model doesn’t truly capture the underlying relationship.

Residuals vs X Values Plot

model_data <- model.frame(new_reg)

plot(model_data$bpm, resid(new_reg),
     xlab = "BPM",
     ylab = "Residuals",
     main = "Residuals vs BPM")
abline(h = 0, col = "red3")

plot(model_data$danceability, resid(new_reg),
     xlab = "Danceability",
     ylab = "Residuals",
     main = "Residuals vs Danceability")
abline(h = 0, col = "red3")

For the Residuals vs BPM plot, the residuals are scattered around the zero line with no noticeable trend, so we can make the assumption that this is rather linear. On the other hand, some residuals reach up to roughly 15 or a bit higher, and go to about -2 to -3 on the lower end, which could suggest the errors are more right-skewed rather than more distributed around zero. This could be a bit concerning regarding homoscedasticity and normality, and future transformation of weeks_at_number_one could be beneficial. For the Residuals vs Danceability plot, similar to that of the Residuals vs BPM plot, the residuals scatter around zero, and suggests linearity pretty well. Also, the range of values from ~ -2 to ~15 is present here as well, and raises similar concerns about normality and homoscedasticity with a more right-skewed upper tail. For danceability, unlike the bpm plot, the overall spread doesn’t widen, so there’s a bit less of a variance problem here.

QQ Plot

qqnorm(resid(new_reg),
       main = "QQ Plot of Residuals")
qqline(resid(new_reg), col = "red3")

The above Q-Q plot shows some substantial deviation overall near the far right upper tail of the reference line, showing the residuals aren’t normally distributed for the most part. The right tail of the line suggests the inclusion of “outliers” and/or that the response variable may be skewed. Based on this, the model may contain a moderate to more severe violation of the normality assumption.

Cook’s Distance Plot

plot(cooks.distance(new_reg),
     type = "h",
     main = "Cook's Distance",
     ylab = "Cook's Distance")

abline(h = 4/length(new_reg$fitted.values), col = "red3", lty = 2)

Lastly, the Cook’s distance plot above shows multiple observations that exceed the usual threshold, indicating that some of the data points may have a disproportionate influence on the new regression model. Most of the observations have a lower influence, but these larger values suggest that the regression model might be sensitive to specific number one songs present in this Billboard Hot 100 Number Ones dataset.

Conclusions

The results of the analyses indicate that the audio features (energy, danceability, bpm (tempo) are not strongly associated with whether a song becomes a hit. For the logistic regression model, none of the explanatory variables were found to be statistically significant, and the model overall showed minimal improvement over that of the null model, which suggests a limited explanatory power. As for the linear regression model that looked into weeks at number one, it also showed that most explanatory variables were statistically insignificant as well, besides the explicit variable that was both positive and statistically significant and has a similar association with time spent at number one. However, the explanatory power of the linear model was very low, where a majority of the variation for song success wasn’t truly captured by the explanatory variables. All in all, the findings from both the logistic and linear regression models suggest that the audio features alone don’t provide a strong explanation for song success, whether it be a “hit” or the time spent at number one.

Recommendations

Based on the findings, one recommendation would be that music industry professionals avoid solely relying on audio features when evaluating or looking to understand song success. Certain variables like explicit have shown a bit of an association with performance outcomes, but the overall lack of significant relationships goes to show that other factors most likely play a far larger role with regards to success. To gain a better understanding, future analyses should incorporate additional variables, for example, artist popularity, marketing and promotional efforts, streaming metrics, as well as social/cultural trends. This dataset doesn’t offer these kinds of variables, but external research may be very beneficial here if possible. In conclusion, by combining audio features with more broad contextual factors, stakeholders can develop more informed and effective strategies for evaluating and supporting successful music releases.