Week 9 | Data Dive — Regression Diagnostics


Loading the Billboard Hot 100 Number Ones Dataset


library(tidyverse)
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)

billboard <- tuesdata$billboard
topics <- tuesdata$topics

Preview of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

The Simple Linear Regression Model from Week 8


lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)

summary(lm_bpm)
## 
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9634 -1.9363 -0.9395  1.0523 16.0718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8979091  0.3723397   7.783 1.55e-14 ***
## bpm         0.0003743  0.0031440   0.119    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.646 on 1173 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  1.208e-05,  Adjusted R-squared:  -0.0008404 
## F-statistic: 0.01418 on 1 and 1173 DF,  p-value: 0.9052

Tying in More Variables to this Regression Model


Continuous Terms

  • bpm (beats per minute, provided by Spotify)

  • danceability (how “danceable” a song is, provided by Spotify)

Binary Terms

  • explicit (if Spotify labels a song as explicit, containing specific words or is overly sexual, violent, or drug related)

Interaction Terms

  • bpm:danceability (the effect bpm/tempo may have on the danceability score of a song)

Why These New Terms/Variables Should be Included or Not


  1. For bpm, also known as tempo, it may have an influence on how appealing a song may be. Faster tempo or slower tempo songs may correlate with how long a song spends at number one.
  2. For danceability, the 0 - 100 score here reflects how “danceable” a song is specifically for dancing, which could impact the popularity and longevity of a song at number one. In our world today, a song could simply become popular and remain on the charts longer for cultural or social reasons, for example, a dancing trend on TikTok. (we don’t know exactly how danceability is calculated, other than it’s provided by Spotify)
  3. For the explicit term, more sensitive topics or content that is deemed explicit may affect radio play and the target audience and in turn may affect chart longevity for number one songs. For example, when I was younger, my parents would only let me listen to the clean versions of songs and nothing explicit.
  4. Lastly, for the interaction variable bpm + danceability, the effect tempo or bpm has might depend on how danceable a song is. For instance, higher tempo songs may only perform well if they’re also considered highly danceable, and the inverse is true as well.

The New Regression Model


new_reg <- lm(weeks_at_number_one ~ bpm + danceability + explicit + bpm:danceability, data = billboard)

summary(new_reg)
## 
## Call:
## lm(formula = weeks_at_number_one ~ bpm + danceability + explicit + 
##     bpm:danceability, data = billboard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0077 -1.7405 -0.7675  0.6367 15.8937 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.659e+00  1.476e+00   1.802   0.0718 .  
## bpm              -5.022e-03  1.260e-02  -0.398   0.6904    
## danceability      1.147e-03  2.503e-02   0.046   0.9635    
## explicit          8.209e-01  2.005e-01   4.094 4.53e-05 ***
## bpm:danceability  8.599e-05  2.150e-04   0.400   0.6893    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.617 on 1170 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.02432,    Adjusted R-squared:  0.02099 
## F-statistic: 7.292 on 4 and 1170 DF,  p-value: 8.445e-06

Regression Model Summary Statistic Evaluation


The above regression model explains just about 2.4% of the variation in weeks at number one based on the R2 value, which indicates a sort of weak explanatory power overall. Of all the terms included in the model, the only term that is statistically significant is the explicit binary variable (p-value of 4.53 x 10-5) that was included, which suggests that the number one songs with some aspect of explicit content tend to spend more weeks at number one. The terms bpm (p-value of 0.6904), danceability (p-value of 0.9635), and their interaction (p-value of 0.6893) are not as statistically significant, showing that these features of songs may not influence chart longevity like explicit does. The overall model is statistically significant with a p value of 8.445 x 10-6 but with a low R2 value, there is low true significance.

Evaluating the Regression Model Using Diagnostic Plots


Residuals vs Fitted Values


plot(new_reg$fitted.values, resid(new_reg),
     xlab = "Fitted Values",
     ylab = "Residuals",
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red3")

Residuals vs X Values


model_data <- model.frame(new_reg)

plot(model_data$bpm, resid(new_reg),
     xlab = "BPM",
     ylab = "Residuals",
     main = "Residuals vs BPM")
abline(h = 0, col = "red3")

plot(model_data$danceability, resid(new_reg),
     xlab = "Danceability",
     ylab = "Residuals",
     main = "Residuals vs Danceability")
abline(h = 0, col = "red3")

Correlation Heatmap


library(GGally)

ggcorr(select(billboard,
              explicit,
              danceability,
              bpm,
              weeks_at_number_one),
       label = TRUE) +
  labs(title = "Correlation Heatmap")

QQ Plot


qqnorm(resid(new_reg),
       main = "QQ Plot of Residuals")
qqline(resid(new_reg), col = "red3")

Cook’s Distance


plot(cooks.distance(new_reg),
     type = "h",
     main = "Cook's Distance",
     ylab = "Cook's Distance")

abline(h = 4/length(new_reg$fitted.values), col = "red3", lty = 2)

Residuals vs. Fitted Plot


The residuals vs. fitted plot above shows more of a structured fit rather than a random scatter with clearly visible clustering. Based on this, the residuals vs. fitted plot suggests that the linear assumption may be violated. Also, the spread of the residuals seems a bit uneven across the fitted values, so there might be a bit of heteroscedasticity. Overall, based on the residuals vs. fitted plot, there’s moderate evidence that the regression model doesn’t truly capture the underlying relationship.

Q-Q Residuals Plot


The above Q-Q plot shows some substantial deviation overall near the far right upper tail of the reference line, showing the residuals aren’t normally distributed for the most part. The right tail of the line suggests the inclusion of “outliers” and/or that the response variable may be skewed. Based on this, the model may contain a moderate to more severe violation of the normality assumption.

Residuals vs X Values Plot


For the Residuals vs BPM plot, the residuals are scattered around the zero line with no noticeable trend, so we can make the assumption that this is rather linear. On the other hand, some residuals reach up to roughly 15 or a bit higher, and go to about -2 to -3 on the lower end, which could suggest the errors are more right-skewed rather than more distributed around zero. This could be a bit concerning regarding homoscedasticity and normality, and future transformation of weeks_at_number_one could be beneficial. For the Residuals vs Danceability plot, similar to that of the Residuals vs BPM plot, the residuals scatter around zero, and suggests linearity pretty well. Also, the range of values from ~ -2 to ~15 is present here as well, and raises similar concerns about normality and homoscedasticity with a more right-skewed upper tail. For danceability, unlike the bpm plot, the overall spread doesn’t widen, so there’s a bit less of a variance problem here.

Correlation Heatmap Plot


For the correlation heatmap, all predictor pair correlations are within the range 0.0 to 0.3, showing that the predictors are pretty much independent of each other, multicollinearity isn’t a concern in this instance, and that the overall correlation between them is weak or nearly non-existent.

Cook’s Distance Plot


Lastly, the Cook’s distance plot above shows multiple observations that exceed the usual threshold, indicating that some of the data points may have a disproportionate influence on the new regression model. Most of the observations have a lower influence, but these larger values suggest that the regression model might be sensitive to specific number one songs present in this Billboard Hot 100 Number Ones dataset.

Overall Conclusion


All in all, the new regression model for this week suggests multiple violations of key assumptions, for example, non-linearity, heteroscedasticity, and non-normal residuals, as well as a few potentially influential observations. Also, the model has a rather low explanatory power, again with an R2 of roughly 2.4%, which may suggest that the terms included in this specific model don’t meaningfully explain variation in the weeks at number one variable, besides the explicit term which was the only one that was statistically significant. That said, reiterating what was said previously in the regression model summary statistic evaluation section, the overall model is simply weaker. Based on the results from this model, there are more than likely absent terms that are better predictors than we have currently, and we can test for this in the future potentially. For example, future analyses could explore different terms or apply transformations, like we’ll be learning about in weeks 10 and 11, in order to improve model fit and improve regression assumptions.