library(tidyverse, quietly = T)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr, quietly = T)
tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
##
##
## ── Downloading files ───────────────────────────────────────────────────────────
##
## 1 of 2: "billboard.csv"
## 2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics
head(billboard)
## # A tibble: 6 × 105
## song artist date weeks_at_number_one non_consecutive rating_1
## <chr> <chr> <dttm> <dbl> <dbl> <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00 2 0 4
## 2 Nel B… Domen… 1958-08-18 00:00:00 5 1 7
## 3 Littl… The E… 1958-08-25 00:00:00 1 0 5
## 4 It's … Tommy… 1958-09-29 00:00:00 6 0 3
## 5 It's … Conwa… 1958-11-10 00:00:00 2 1 7
## 6 Tom D… The K… 1958-11-17 00:00:00 1 0 5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## # divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## # cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## # artist_structure <dbl>, featured_artists <chr>,
## # multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## # talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## # front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
## lyrical_topics
## <chr>
## 1 Addiction
## 2 Anger
## 3 Appreciation
## 4 Badassery
## 5 Bad Behavior
## 6 Bad Relationships
The binary column of data I will be analyzing that is worth modeling is:
happy_song — (converted variable)
summary(billboard$happiness)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 4.00 42.50 66.00 61.78 83.00 99.00 2
billboard$happy_song <- ifelse(billboard$happiness > 66, 1, 0)
The explanatory variables to be applied are the following:
danceability
energy
bpm
loudness_d_b
happy_model <- glm(happy_song ~ danceability + energy + bpm + loudness_d_b, data = billboard, family = "binomial")
summary(happy_model)
##
## Call:
## glm(formula = happy_song ~ danceability + energy + bpm + loudness_d_b,
## family = "binomial", data = billboard)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.783124 0.767207 -12.752 <2e-16 ***
## danceability 0.049597 0.005218 9.506 <2e-16 ***
## energy 0.071308 0.005673 12.569 <2e-16 ***
## bpm -0.002289 0.002823 -0.811 0.417
## loudness_d_b -0.295241 0.030122 -9.801 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1628.8 on 1174 degrees of freedom
## Residual deviance: 1279.7 on 1170 degrees of freedom
## (2 observations deleted due to missingness)
## AIC: 1289.7
##
## Number of Fisher Scoring iterations: 4
Intercept — As you can see above in the logistic regression happy_model, the intercept is roughly -9.78 and essentially represents the log odds of a song being classified as “happy” when each of the four predictors are equal to zero. The intercept alone isn’t exactly very meaningful, but it can be viewed as the base value for the model while all predictors are at zero.
Danceability — For danceability, the coefficient is around 0.0496, which can be seen as statistically significant with regards to p-value. The coefficient is also positive, so as danceability increases, so do the log odds of a song reaching the “happy” song category. More simply, the more danceable a song is, the more likely the song is to be seen as “happy”. This makes sense in a real world context as a song people tend to dance to more than likely has a more upbeat or positive vibe.
Energy — Regarding energy, the model yielded a coefficient of 0.0713, similar to that of danceability where it’s definitely statistically significant when compared to the p-value, while it’s also positive, so as energy increases for a song, so does the likeliness the song is categorized as “happy”. This output also reinforces the idea that the intensity and how energetic a song is contributes to how “happy” it may be.
BPM — For bpm, the coefficient is -0.0023, but is not statistically significant as the p-value is 0.417, and mainly means that the bpm or tempo of a song doesn’t have a meaningful linear relationship for whether a song is categorized as “happy” in this specific model. Based on this, the model suggests that whether a song is faster or slower doesn’t heavily influence happiness for songs.
Loudness — The coefficient for loudness is about -0.2952 and is pretty statistically significant, but negative. So, the louder a song is, the less likely the song is to be categorized as “happy”. This is interesting when compared to danceability and energy from before, and this loudness coefficient seems a bit counterintuitive as it suggests that louder songs may be seen as more aggressive or over the top negative feelings, compared to what we’d think would be more positive feelings, like happiness.
Overall, the happy_model suggests that danceability and energy are more strong positive predictors of whether or not a song is seen as “happy”, whereas loudness has a more negative relationship with regards to the happiness of a song, and leaving bpm/tempo as a variable that doesn’t play a very big role in the model for the most part. These findings could indicate that the happiness or emotional vibe of a song is a bit more influenced by areas like rhythm and energy rather than the speed or how loud it is.
confint(happy_model, 'danceability')
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.03953179 0.06000067
confint(happy_model, 'energy')
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.06041047 0.08266486
As you can see, the 95% confidence interval for the danceability coefficient is roughly 0.03953 to 0.06. This supplied interval doesn’t include 0, so we can conclude that danceability has a statistically significant, positive effect on the likelihood of a song in the dataset being categorized as “happy”. Furthermore, for each unit increase for danceability, the log odds of a song being “happy” increases by an amount within the previously stated range, and supports that more danceable songs are more likely to be seen as “happy”.
The 95% confidence interval for the energy coefficient is somewhere within the range 0.0604 to 0.0827, and again, since the interval doesn’t include 0, energy can be seen as statistically significant and has a positive effect on the likelihood of a song in the database being categorized as “happy” as well. For each unit increase in energy, the log odds of a song being “happy” increases by an amount within the range above, and is strong evidence that songs with higher energy are more likely to be classified as “happy”.