Week 9 | Data Dive — Regression Diagnostics


Loading the Billboard Hot 100 Number Ones Dataset


library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics

Preview of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

The Simple Linear Regression Model from Week 8


lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)

summary(lm_bpm)
## 
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.9634 -1.9363 -0.9395  1.0523 16.0718 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.8979091  0.3723397   7.783 1.55e-14 ***
## bpm         0.0003743  0.0031440   0.119    0.905    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.646 on 1173 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  1.208e-05,  Adjusted R-squared:  -0.0008404 
## F-statistic: 0.01418 on 1 and 1173 DF,  p-value: 0.9052

Tying in More Variables to this Regression Model


Continuous Terms

  • bpm (beats per minute, provided by Spotify)

  • danceability (how “danceable” a song is, provided by Spotify)

Binary Terms

  • explicit (if Spotify labels a song as explicit, containing specific words or is overly sexual, violent, or drug related)

Interaction Terms

  • bpm:danceability (the effect bpm/tempo may have on the danceability score of a song)

Why These New Terms/Variables Should be Included or Not


  1. For bpm, also known as tempo, it may have an influence on how appealing a song may be. Faster tempo or slower tempo songs may correlate with how long a song spends at number one.
  2. For danceability, the 0 - 100 score here reflects how “danceable” a song is specifically for dancing, which could impact the popularity and longevity of a song at number one. In our world today, a song could simply become popular and remain on the charts longer for cultural or social reasons, for example, a dancing trend on TikTok. (we don’t know exactly how danceability is calculated, other than it’s provided by Spotify)
  3. For the explicit term, more sensitive topics or content that is deemed explicit may affect radio play and the target audience and in turn may affect chart longevity for number one songs. For example, when I was younger, my parents would only let me listen to the clean versions of songs and nothing explicit.
  4. Lastly, for the interaction variable bpm + danceability, the effect tempo or bpm has might depend on how danceable a song is. For instance, higher tempo songs may only perform well if they’re also considered highly danceable, and the inverse is true as well.

The New Regression Model


new_reg <- lm(weeks_at_number_one ~ bpm + danceability + explicit + bpm:danceability, data = billboard)

summary(new_reg)
## 
## Call:
## lm(formula = weeks_at_number_one ~ bpm + danceability + explicit + 
##     bpm:danceability, data = billboard)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0077 -1.7405 -0.7675  0.6367 15.8937 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       2.659e+00  1.476e+00   1.802   0.0718 .  
## bpm              -5.022e-03  1.260e-02  -0.398   0.6904    
## danceability      1.147e-03  2.503e-02   0.046   0.9635    
## explicit          8.209e-01  2.005e-01   4.094 4.53e-05 ***
## bpm:danceability  8.599e-05  2.150e-04   0.400   0.6893    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.617 on 1170 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.02432,    Adjusted R-squared:  0.02099 
## F-statistic: 7.292 on 4 and 1170 DF,  p-value: 8.445e-06

Regression Model Summary Statistic Evaluation


The above regression model explains just about 2.4% of the variation in weeks at number one based on the R2 value, which indicates a sort of weak explanatory power overall. Of all the terms included in the model, the only term that is statistically significant is the explicit binary variable (p-value of 4.53 x 10-5) that was included, which suggests that the number one songs with some aspect of explicit content tend to spend more weeks at number one. The terms bpm (p-value of 0.6904), danceability (p-value of 0.9635), and their interaction (p-value of 0.6893) are not as statistically significant, showing that these features of songs may not influence chart longevity like explicit does. The overall model is statistically significant with a p value of 8.445 x 10-6 but with a low R2 value, there is low true significance.

Evaluating the Regression Model Using Diagnostic Plots


par(mfrow = c(2,2))
plot(new_reg)

plot(cooks.distance(new_reg),
     type = "h",
     main = "Cook's Distance",
     ylab = "Cook's Distance")

abline(h = 4/length(new_reg$fitted.values), col = "dodgerblue3", lty = 2)

Residuals vs. Fitted Plot


The residuals vs. fitted plot above shows more of a structured fit rather than a random scatter with clearly visible clustering. Based on this, the residuals vs. fitted plot suggests that the linear assumption may be violated. Also, the spread of the residuals seems a bit uneven across the fitted values, so there might be a bit of heteroscedasticity. Overall, based on the residuals vs. fitted plot, there’s moderate evidence that the regression model doesn’t truly capture the underlying relationship.

Q-Q Residuals Plot


The above Q-Q plot shows some substantial deviation overall near the far right upper tail of the reference line, showing the residuals aren’t normally distributed for the most part. The right tail of the line suggests the inclusion of “outliers” and/or that the response variable may be skewed. Based on this, the model may contain a moderate to more severe violation of the normality assumption.

Scale-Location Plot


The above scale-location plot shows a slight upward trend, indicating that the variance of the residuals increases together with the fitted values. This suggests a bit of heteroscedasticity, which means the constant variance assumption is also violated. The overall pattern is decently noticeable, and indicates a moderate concern.

Residuals vs. Leverage Plot


The above residuals vs. leverage plot shows a small number of observations with a higher leverage and larger residuals. None of the observations seem to exceed Cook’s distance bounds, but the observations may still have some sort of influence on the overall model, yet it is most likely limited.

Cook’s Distance Plot


Lastly, the Cook’s distance plot above shows multiple observations that exceed the usual threshold, indicating that some of the data points may have a disproportionate influence on the new regression model. Most of the observations have a lower influence, but these larger values suggest that the regression model might be sensitive to specific number one songs present in this Billboard Hot 100 Number Ones dataset.

Overall Conclusion


All in all, the new regression model for this week suggests multiple violations of key assumptions, for example, non-linearity, heteroscedasticity, and non-normal residuals, as well as a few potentially influential observations. Also, the model has a rather low explanatory power, again with an R2 of roughly 2.4%, which may suggest that the terms included in this specific model don’t meaningfully explain variation in the weeks at number one variable, besides the explicit term which was the only one that was statistically significant. That said, reiterating what was said previously in the regression model summary statistic evaluation section, the overall model is simply weaker. Based on the results from this model, there are more than likely absent terms that are better predictors than we have currently, and we can test for this in the future potentially. For example, future analyses could explore different terms or apply transformations, like we’ll be learning about in weeks 10 and 11, in order to improve model fit and improve regression assumptions.