library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "billboard.csv"
## 2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics
head(billboard)
## # A tibble: 6 × 105
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
## 2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
## 3 Littl… The E… 1958-08-25 00:00:00 1 0 5
## 4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
## 5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
## 6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## # front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
## lyrical_topics
## <chr>
## 1 Addiction
## 2 Anger
## 3 Appreciation
## 4 Badassery
## 5 Bad Behavior
## 6 Bad Relationships
lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)
summary(lm_bpm)
##
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9634 -1.9363 -0.9395 1.0523 16.0718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8979091 0.3723397 7.783 1.55e-14 ***
## bpm 0.0003743 0.0031440 0.119 0.905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.646 on 1173 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 1.208e-05, Adjusted R-squared: -0.0008404
## F-statistic: 0.01418 on 1 and 1173 DF, p-value: 0.9052
Continuous Terms
bpm (beats per minute, provided by Spotify)
danceability (how “danceable” a song is, provided by Spotify)
Binary Terms
Interaction Terms
new_reg <- lm(weeks_at_number_one ~ bpm + danceability + explicit + bpm:danceability, data = billboard)
summary(new_reg)
##
## Call:
## lm(formula = weeks_at_number_one ~ bpm + danceability + explicit +
## bpm:danceability, data = billboard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0077 -1.7405 -0.7675 0.6367 15.8937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.659e+00 1.476e+00 1.802 0.0718 .
## bpm -5.022e-03 1.260e-02 -0.398 0.6904
## danceability 1.147e-03 2.503e-02 0.046 0.9635
## explicit 8.209e-01 2.005e-01 4.094 4.53e-05 ***
## bpm:danceability 8.599e-05 2.150e-04 0.400 0.6893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.617 on 1170 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.02432, Adjusted R-squared: 0.02099
## F-statistic: 7.292 on 4 and 1170 DF, p-value: 8.445e-06
The above regression model explains just about 2.4% of the variation in weeks at number one based on the R2 value, which indicates a sort of weak explanatory power overall. Of all the terms included in the model, the only term that is statistically significant is the explicit binary variable (p-value of 4.53 x 10-5) that was included, which suggests that the number one songs with some aspect of explicit content tend to spend more weeks at number one. The terms bpm (p-value of 0.6904), danceability (p-value of 0.9635), and their interaction (p-value of 0.6893) are not as statistically significant, showing that these features of songs may not influence chart longevity like explicit does. The overall model is statistically significant with a p value of 8.445 x 10-6 but with a low R2 value, there is low true significance.
par(mfrow = c(2,2))
plot(new_reg)
plot(cooks.distance(new_reg),
type = "h",
main = "Cook's Distance",
ylab = "Cook's Distance")
abline(h = 4/length(new_reg$fitted.values), col = "dodgerblue3", lty = 2)
The residuals vs. fitted plot above shows more of a structured fit rather than a random scatter with clearly visible clustering. Based on this, the residuals vs. fitted plot suggests that the linear assumption may be violated. Also, the spread of the residuals seems a bit uneven across the fitted values, so there might be a bit of heteroscedasticity. Overall, based on the residuals vs. fitted plot, there’s moderate evidence that the regression model doesn’t truly capture the underlying relationship.
The above Q-Q plot shows some substantial deviation overall near the far right upper tail of the reference line, showing the residuals aren’t normally distributed for the most part. The right tail of the line suggests the inclusion of “outliers” and/or that the response variable may be skewed. Based on this, the model may contain a moderate to more severe violation of the normality assumption.
The above scale-location plot shows a slight upward trend, indicating that the variance of the residuals increases together with the fitted values. This suggests a bit of heteroscedasticity, which means the constant variance assumption is also violated. The overall pattern is decently noticeable, and indicates a moderate concern.
The above residuals vs. leverage plot shows a small number of observations with a higher leverage and larger residuals. None of the observations seem to exceed Cook’s distance bounds, but the observations may still have some sort of influence on the overall model, yet it is most likely limited.
Lastly, the Cook’s distance plot above shows multiple observations that exceed the usual threshold, indicating that some of the data points may have a disproportionate influence on the new regression model. Most of the observations have a lower influence, but these larger values suggest that the regression model might be sensitive to specific number one songs present in this Billboard Hot 100 Number Ones dataset.
All in all, the new regression model for this week suggests multiple violations of key assumptions, for example, non-linearity, heteroscedasticity, and non-normal residuals, as well as a few potentially influential observations. Also, the model has a rather low explanatory power, again with an R2 of roughly 2.4%, which may suggest that the terms included in this specific model don’t meaningfully explain variation in the weeks at number one variable, besides the explicit term which was the only one that was statistically significant. That said, reiterating what was said previously in the regression model summary statistic evaluation section, the overall model is simply weaker. Based on the results from this model, there are more than likely absent terms that are better predictors than we have currently, and we can test for this in the future potentially. For example, future analyses could explore different terms or apply transformations, like we’ll be learning about in weeks 10 and 11, in order to improve model fit and improve regression assumptions.