Grammy Record Collaborations Exploration

Research question:

  • Are collaborations increasing over time? (using a proxy: credited artist count)

  • Did collaborations differ between older and modern eras?

Data we will use

This deck uses an illustrative dataset with Grammy-style structure:

  • year: 1980 to 2025

  • nominee_rank: 1 is the winner, 2–5 are nominees

  • n_artists: proxy for number of credited artists

  • popularity: proxy score (for a 3D plot)

  • era: 1980–1999 vs 2000–2025

year nominee_rank is_winner era n_artists popularity
1990 1 TRUE 1990–1999 3 77
1991 1 TRUE 1990–1999 1 79
1992 1 TRUE 1990–1999 1 67
1993 1 TRUE 1990–1999 1 55
1994 1 TRUE 1990–1999 1 62
1995 1 TRUE 1990–1999 1 52
1996 1 TRUE 1990–1999 2 69
1997 1 TRUE 1990–1999 1 63
1998 1 TRUE 1990–1999 2 41
1999 1 TRUE 1990–1999 3 53

Example: point estimation (winners only)

Point estimate for each era mean:

  • \(\bar{Y}_{old}\) = average credited artists for winners (1980–1999)

  • \(\bar{Y}_{new}\) = average credited artists for winners (2000–2025)

era n_years mean_artists sd_artists
1990–1999 10 1.600 0.843
2000–2025 26 2.115 0.588

Interpretation: These are single-number summaries, but they do not show uncertainty.

ggplot 1: Winners over time

ggplot 2: Winners by era distribution

Linear regression model

Let \(Y_i\) be the proxy credited-artist count for the winning record in year \(x_i\). \[ Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim N(0,\sigma^2) \]

Interpretation:

  • \(\beta_1\) is the average change in credited artist count per year.

  • If \(\beta_1 > 0\), collaborations increase over time on average.

Regression Result with confidence interval

We test:

\[ H_0:\ \beta_1 = 0 \quad \text{vs} \quad H_A:\ \beta_1 \ne 0 \]

term estimate std.error statistic p.value conf.low conf.high
(Intercept) -24.6394 22.2936 -1.1052 0.2768 -69.9455 20.6667
year 0.0133 0.0111 1.1937 0.2409 -0.0093 0.0358
  • The row for year contains the slope estimate \(\widehat{\beta}_1\)

  • conf.low / conf.high is a 95% CI for \(\beta_1\)

  • p.value is the regression p-value for \(H_0:\beta_1=0\)

Difference in means (era comparison)

Compare the mean collaboration proxy for winners:

\[ H_0:\ \mu_{\text{new}} - \mu_{\text{old}} = 0 \quad \text{vs} \quad H_A:\ \mu_{\text{new}} - \mu_{\text{old}} \ne 0 \]

A 95% confidence interval for the difference has the form:

\[ \widehat{\Delta} \pm t_{0.975,\nu}\, SE(\widehat{\Delta}) \]

Example of era comparison output (t-test)

estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high method alternative
-0.5154 1.6 2.1154 -1.7738 0.1004 12.5274 -1.1455 0.1147 Welch Two Sample t-test two.sided

Interpretation:

  • small p-value suggests evidence the era means differ

  • a CI that does not include 0 suggests a difference at the 5% level

R code

winners <- grammy %>% dplyr::filter(is_winner)

# Fit regression model
lm_fit <- lm(n_artists ~ year, data = winners)

# Plot the trend with a fitted line
library(ggplot2)
ggplot(winners, aes(x = year, y = n_artists)) +
  geom_point(alpha = 0.7) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(
    title = "Winners: credited artist count (proxy) over time",
    x = "Year",
    y = "Credited artists (proxy)"
  )

Plotly indicating year vs collaboration proxy vs popularity proxy

Summary

Takeaways:

  • Point estimates summarize typical values; CIs show uncertainty.

  • Regression slope estimates a trend and includes a p-value + CI.

  • A two-sample t-test compares eras and provides inference on the difference in means.