library(tidyverse)
library(dplyr)
tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
billboard <- tuesdata$billboard
topics <- tuesdata$topics
head(billboard)
## # A tibble: 6 × 105
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
## 2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
## 3 Littl… The E… 1958-08-25 00:00:00 1 0 5
## 4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
## 5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
## 6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## # front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
## lyrical_topics
## <chr>
## 1 Addiction
## 2 Anger
## 3 Appreciation
## 4 Badassery
## 5 Bad Behavior
## 6 Bad Relationships
lm_bpm <- lm(weeks_at_number_one ~ bpm, data = billboard)
summary(lm_bpm)
##
## Call:
## lm(formula = weeks_at_number_one ~ bpm, data = billboard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9634 -1.9363 -0.9395 1.0523 16.0718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.8979091 0.3723397 7.783 1.55e-14 ***
## bpm 0.0003743 0.0031440 0.119 0.905
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.646 on 1173 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 1.208e-05, Adjusted R-squared: -0.0008404
## F-statistic: 0.01418 on 1 and 1173 DF, p-value: 0.9052
Continuous Terms
bpm (beats per minute, provided by Spotify)
danceability (how “danceable” a song is, provided by Spotify)
Binary Terms
Interaction Terms
new_reg <- lm(weeks_at_number_one ~ bpm + danceability + explicit + bpm:danceability, data = billboard)
summary(new_reg)
##
## Call:
## lm(formula = weeks_at_number_one ~ bpm + danceability + explicit +
## bpm:danceability, data = billboard)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0077 -1.7405 -0.7675 0.6367 15.8937
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.659e+00 1.476e+00 1.802 0.0718 .
## bpm -5.022e-03 1.260e-02 -0.398 0.6904
## danceability 1.147e-03 2.503e-02 0.046 0.9635
## explicit 8.209e-01 2.005e-01 4.094 4.53e-05 ***
## bpm:danceability 8.599e-05 2.150e-04 0.400 0.6893
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.617 on 1170 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.02432, Adjusted R-squared: 0.02099
## F-statistic: 7.292 on 4 and 1170 DF, p-value: 8.445e-06
The above regression model explains just about 2.4% of the variation in weeks at number one based on the R2 value, which indicates a sort of weak explanatory power overall. Of all the terms included in the model, the only term that is statistically significant is the explicit binary variable (p-value of 4.53 x 10-5) that was included, which suggests that the number one songs with some aspect of explicit content tend to spend more weeks at number one. The terms bpm (p-value of 0.6904), danceability (p-value of 0.9635), and their interaction (p-value of 0.6893) are not as statistically significant, showing that these features of songs may not influence chart longevity like explicit does. The overall model is statistically significant with a p value of 8.445 x 10-6 but with a low R2 value, there is low true significance.
plot(new_reg$fitted.values, resid(new_reg),
xlab = "Fitted Values",
ylab = "Residuals",
main = "Residuals vs Fitted Values")
abline(h = 0, col = "red3")
model_data <- model.frame(new_reg)
plot(model_data$bpm, resid(new_reg),
xlab = "BPM",
ylab = "Residuals",
main = "Residuals vs BPM")
abline(h = 0, col = "red3")
plot(model_data$danceability, resid(new_reg),
xlab = "Danceability",
ylab = "Residuals",
main = "Residuals vs Danceability")
abline(h = 0, col = "red3")
library(GGally)
ggcorr(select(billboard,
explicit,
danceability,
bpm,
weeks_at_number_one),
label = TRUE) +
labs(title = "Correlation Heatmap")
qqnorm(resid(new_reg),
main = "QQ Plot of Residuals")
qqline(resid(new_reg), col = "red3")
plot(cooks.distance(new_reg),
type = "h",
main = "Cook's Distance",
ylab = "Cook's Distance")
abline(h = 4/length(new_reg$fitted.values), col = "red3", lty = 2)
The residuals vs. fitted plot above shows more of a structured fit rather than a random scatter with clearly visible clustering. Based on this, the residuals vs. fitted plot suggests that the linear assumption may be violated. Also, the spread of the residuals seems a bit uneven across the fitted values, so there might be a bit of heteroscedasticity. Overall, based on the residuals vs. fitted plot, there’s moderate evidence that the regression model doesn’t truly capture the underlying relationship.
The above Q-Q plot shows some substantial deviation overall near the far right upper tail of the reference line, showing the residuals aren’t normally distributed for the most part. The right tail of the line suggests the inclusion of “outliers” and/or that the response variable may be skewed. Based on this, the model may contain a moderate to more severe violation of the normality assumption.
For the Residuals vs BPM plot, the residuals are scattered around the zero line with no noticeable trend, so we can make the assumption that this is rather linear. On the other hand, some residuals reach up to roughly 15 or a bit higher, and go to about -2 to -3 on the lower end, which could suggest the errors are more right-skewed rather than more distributed around zero. This could be a bit concerning regarding homoscedasticity and normality, and future transformation of weeks_at_number_one could be beneficial. For the Residuals vs Danceability plot, similar to that of the Residuals vs BPM plot, the residuals scatter around zero, and suggests linearity pretty well. Also, the range of values from ~ -2 to ~15 is present here as well, and raises similar concerns about normality and homoscedasticity with a more right-skewed upper tail. For danceability, unlike the bpm plot, the overall spread doesn’t widen, so there’s a bit less of a variance problem here.
For the correlation heatmap, all predictor pair correlations are within the range 0.0 to 0.3, showing that the predictors are pretty much independent of each other, multicollinearity isn’t a concern in this instance, and that the overall correlation between them is weak or nearly non-existent.
Lastly, the Cook’s distance plot above shows multiple observations that exceed the usual threshold, indicating that some of the data points may have a disproportionate influence on the new regression model. Most of the observations have a lower influence, but these larger values suggest that the regression model might be sensitive to specific number one songs present in this Billboard Hot 100 Number Ones dataset.
All in all, the new regression model for this week suggests multiple violations of key assumptions, for example, non-linearity, heteroscedasticity, and non-normal residuals, as well as a few potentially influential observations. Also, the model has a rather low explanatory power, again with an R2 of roughly 2.4%, which may suggest that the terms included in this specific model don’t meaningfully explain variation in the weeks at number one variable, besides the explicit term which was the only one that was statistically significant. That said, reiterating what was said previously in the regression model summary statistic evaluation section, the overall model is simply weaker. Based on the results from this model, there are more than likely absent terms that are better predictors than we have currently, and we can test for this in the future potentially. For example, future analyses could explore different terms or apply transformations, like we’ll be learning about in weeks 10 and 11, in order to improve model fit and improve regression assumptions.