Week 6 | Data Dive — Confidence Intervals


Loading my Billboard Hot 100 Number One’s Dataset


library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)

tuesdata <- tidytuesdayR::tt_load(2025, week = 34)
## ---- Compiling #TidyTuesday Information for 2025-08-26 ----
## --- There are 2 files available ---
## 
## 
## ── Downloading files ───────────────────────────────────────────────────────────
## 
##   1 of 2: "billboard.csv"
##   2 of 2: "topics.csv"
billboard <- tuesdata$billboard
topics <- tuesdata$topics

Head of the Dataset


head(billboard)
## # A tibble: 6 × 105
##   song   artist date                weeks_at_number_one non_consecutive rating_1
##   <chr>  <chr>  <dttm>                            <dbl>           <dbl>    <dbl>
## 1 Poor … Ricky… 1958-08-04 00:00:00                   2               0        4
## 2 Nel B… Domen… 1958-08-18 00:00:00                   5               1        7
## 3 Littl… The E… 1958-08-25 00:00:00                   1               0        5
## 4 It's … Tommy… 1958-09-29 00:00:00                   6               0        3
## 5 It's … Conwa… 1958-11-10 00:00:00                   2               1        7
## 6 Tom D… The K… 1958-11-17 00:00:00                   1               0        5
## # ℹ 99 more variables: rating_2 <dbl>, rating_3 <dbl>, overall_rating <dbl>,
## #   divisiveness <dbl>, label <chr>, parent_label <chr>, cdr_genre <chr>,
## #   cdr_style <chr>, discogs_genre <chr>, discogs_style <chr>,
## #   artist_structure <dbl>, featured_artists <chr>,
## #   multiple_lead_vocalists <dbl>, group_named_after_non_lead_singer <dbl>,
## #   talent_contestant <chr>, posthumous <dbl>, artist_place_of_origin <chr>,
## #   front_person_age <dbl>, artist_male <dbl>, artist_white <dbl>, …
head(topics)
## # A tibble: 6 × 1
##   lyrical_topics   
##   <chr>            
## 1 Addiction        
## 2 Anger            
## 3 Appreciation     
## 4 Badassery        
## 5 Bad Behavior     
## 6 Bad Relationships

Choosing Two Numeric Variables and Pairing Each with a Column I’ve Built


Pair #1: Response Variable (Original) & Explanatory Variable (Created)

  • Response Variable (Original)

    • weeks_at_number_one
  • Explanatory Variable (Created)

    • mean_rating —> (rating_1 + rating_2 + rating_3) / 3
billboard2 <- billboard |>
  mutate(
    mean_rating = (rating_1 + rating_2 + rating_3) / 3
  )

Pair #2: Continuous Variable (Original) & Calculated Variable (Created)

  • Continuous Variable (Original)

    • length_sec
  • Calculated Variable (Created)

    • instrumental_share —> instrumental_length_sec / length_sec
billboard3 <- billboard |>
  mutate(
    instrumental_share = instrumental_length_sec / length_sec
  )

Visualizations for Each Relationship


Pair #1 Visualization

ggplot(billboard2, aes(x = mean_rating, y = weeks_at_number_one)) +
  geom_jitter(width = 0.05, height = 0.1, alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    x = "Mean Judge Rating",
    y = "Weeks at Number One",
    title = "Do Higher-Rated Songs Stay #1 Longer?"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Pair #2 Visualization

ggplot(billboard3, aes(x = length_sec, y = instrumental_share)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    x = "Song Length (seconds)",
    y = "Share of Song That Is Instrumental",
    title = "Are Longer #1 Songs More Instrumental?"
  ) +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Drawing Conclusions Based on the Visualizations (Scrutinizing, e.g. Are there any outliers?)


Pair #1:

This plot with mean judge rating and weeks at number one overall shows a weak positive association. Higher-rated songs tend to stay at number one a bit longer on average, but the relationship is rather noisy and varies quite bit as well. Numerous outliers with longer runs at number one may influence the fitted line, and several highly rated songs still only stay at number one for a short time. Based on this visualization, it can be assumed that judge ratings themselves don’t strongly predict chart longevity for number one songs.

Pair #2:

This plot looking at total song length and the portion of the song that is instrumental shows that there is no strong relationship between the two variables. Most of the number one songs cluster at a low instrumental share regardless of the length of the song, indicating that number one songs are mainly vocal. There’s a small number of nearly completely instrumental songs that appear as outliers, and a few longer songs influence the curve at the far right end for song length, but this overall song length doesn’t meaningfully display how instrumental a song really is.

Calculating the Appropriate Correlation Coefficient for Each Pair


Pair #1:

cor(billboard2$mean_rating,
    billboard2$weeks_at_number_one,
    use = "complete.obs",
    method = "pearson")
## [1] 0.1401134

Pair #2:

cor(billboard3$length_sec,
    billboard3$instrumental_share,
    use = "complete.obs",
    method = "spearman")
## [1] 0.07036449

Why the Correlation Coefficients Make Sense (or not) Based on the Previous Visualizations


Pair #1:

The correlation coefficient of roughly 0.14 is small but positive, which reflects the visualization above. There’s a slight upward trend, but contains some variability and clustering at lower values for the weeks at number one. This shows that the higher ratings are weakly associated with a longer chart presence.

Pair #2:

The Spearman correlation coefficient of about 0.07 is rather close to 0, which also reflects the visualization as there’s no real relationship between song length and the instrumental share. Most of the songs have a low instrumental share regardless of song length, and the slight curve seen on the line doesn’t propose a meaningful overall association.

Building a Confidence Interval for Each of the Response Variable(s)


Pair #1:

weeks_ci <- t.test(billboard2$weeks_at_number_one, conf.level = 0.95)
weeks_ci
## 
##  One Sample t-test
## 
## data:  billboard2$weeks_at_number_one
## t = 38.128, df = 1176, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  2.786796 3.089160
## sample estimates:
## mean of x 
##  2.937978

Pair #2:

instr_ci <- t.test(billboard3$instrumental_share, conf.level = 0.95)
instr_ci
## 
##  One Sample t-test
## 
## data:  billboard3$instrumental_share
## t = 39.582, df = 1176, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  0.1887487 0.2084363
## sample estimates:
## mean of x 
## 0.1985925

Conclusion of the Response Variable(s) Based on the Confidence Interval


Pair #1:

In the population of all songs that have achieved the number one rank on the Billboard Hot 100 #1’s, the average song remains at the top of the chart for just about 3 weeks (2.94 sample mean), even though most individual songs have a shorter longevity than that.

Pair #2:

In the population of all songs that have reached #1 on the Billboard Hot 100, only roughly 20% of the typical song’s length in seconds is instrumental, helping to reinforce that number one hits are far more centered on vocals rather than instrumental sections.