05 EDA

Author

Fu Wei Hsu

Executive Summary

Purpose :

To examine the evolution of popular music trends over the past 60 years by utilizing exploratory data analysis (EDA) and systematic data wrangling to address three core research questions.

Research Questions :

  • RQ1: How has the genre distribution of Billboard Hot 100 songs changed across decades?

  • RQ2: How have the emotional dimensions of popular music evolved across six decades, and what cultural and technological shifts might explain these changes?

  • RQ3: Does the rise of speechiness reflect the mainstreaming of Hip-Hop?

Data Source: 04_Data_Joining.qmd, the output generated from the 04_Data_Joining.qmd data processing pipeline.


Setup

# Setup
library(tidyverse)
library(janitor)
library(gt)

# Load the joined dataset
D1_D2_D3_joined <- readRDS("../Data/D1_D2_D3_joined.rds")

cat("Total observations:", nrow(D1_D2_D3_joined), "\n\n")
Total observations: 330087 
# Create the main analysis dataset (1960-2019)
analysis_base <- D1_D2_D3_joined %>%
  mutate(year = as.integer(format(date, "%Y"))) %>%
  filter(year >= 1960, year <= 2019) %>%
  filter(!is.na(target)) %>%
  mutate(decade = paste0(floor(year / 10) * 10, "s"))

cat("Analysis dataset (1960-2019, D2 matched):\n")
Analysis dataset (1960-2019, D2 matched):
cat("Total observations:", nrow(analysis_base), "\n")
Total observations: 262737 
cat("Year range:", min(analysis_base$year), "–", max(analysis_base$year), "\n")
Year range: 1960 – 2019 
cat("Decade distribution:\n")
Decade distribution:
analysis_base %>%
  count(decade) %>%
  gt() %>%
  tab_header(title = "Observations by Decade")
Observations by Decade
decade n
1960s 36826
1970s 40933
1980s 44945
1990s 42576
2000s 48301
2010s 49156

RQ1:How has the genre distribution of Billboard Hot 100 songs changed across decades?

# Subset for RQ1: Records with genre data
genre_base <- analysis_base %>%
  filter(!is.na(genre))

cat("Genre analysis subset count:", nrow(genre_base), "\n")
Genre analysis subset count: 44364 
cat("Decade coverage:\n")
Decade coverage:
genre_base %>% count(decade)
# A tibble: 6 × 2
  decade     n
  <chr>  <int>
1 1960s   5202
2 1970s   6147
3 1980s   8570
4 1990s   6356
5 2000s   9074
6 2010s   9015

Genre Distribution

cat("Research Question: How has the genre distribution of\n")
Research Question: How has the genre distribution of
cat("Billboard Hot 100 songs changed across decades?\n\n")
Billboard Hot 100 songs changed across decades?
# RQ1 Analysis Subset
genre_base <- analysis_base %>%
  filter(!is.na(genre))

cat("Genre analysis subset count:", nrow(genre_base), "\n")
Genre analysis subset count: 44364 
cat("Coverage:", round(nrow(genre_base) / nrow(analysis_base) * 100, 2), "% of analysis_base\n\n")
Coverage: 16.89 % of analysis_base
cat("Genre categories:\n")
Genre categories:
genre_base %>%
  count(genre, sort = TRUE) %>%
  gt() %>%
  tab_header(title = "Genre Distribution Overview") %>%
  fmt_number(columns = n, decimals = 0, use_seps = TRUE)
Genre Distribution Overview
genre n
pop 27,700
rock 7,528
country 5,323
blues 1,644
jazz 1,115
hip hop 895
reggae 159

Genre Distribution by Decade in percentage

# Calculate genre percentage by decade
genre_decade <- genre_base %>%
  count(decade, genre) %>%
  group_by(decade) %>%
  mutate(pct = round(n / sum(n) * 100, 2)) %>%
  ungroup()

# Table
genre_decade %>%
  select(decade, genre, pct) %>%
  pivot_wider(names_from = decade, values_from = pct, values_fill = 0) %>%
  gt() %>%
  tab_header(title = "Genre Distribution (%) by Decade") %>%
  fmt_number(columns = -genre, decimals = 1)
Genre Distribution (%) by Decade
genre 1960s 1970s 1980s 1990s 2000s 2010s
blues 15.1 7.9 2.7 1.1 0.6 0.2
country 8.1 15.1 8.5 6.8 15.7 15.4
jazz 6.4 8.2 2.7 0.7 0.0 0.0
pop 61.4 42.2 51.0 64.6 70.1 78.4
rock 9.1 26.6 34.3 18.0 10.5 4.3
hip hop 0.0 0.0 0.3 7.6 3.1 1.1
reggae 0.0 0.0 0.5 1.2 0.0 0.4

Stacked Bar Chart: Genre Distribution by Decade

# Stacked Bar Chart
genre_decade %>%
  ggplot(aes(x = decade, y = pct, fill = genre)) +
  geom_col(position = "stack") +
  scale_fill_brewer(palette = "Set2") +
  labs(
    title    = "Genre Distribution of Billboard Hot 100 by Decade",
    subtitle = "Based on D1–D3 matched subset (1960s–2010s)",
    x        = "Decade",
    y        = "Percentage (%)",
    fill     = "Genre",
    caption  = "Source: Billboard Hot 100 × Music Dataset 1950–2019"
  ) +
  theme_minimal() +
  theme(
    plot.title    = element_text(face = "bold", size = 14),
    plot.subtitle = element_text(size = 10, color = "gray40"),
    legend.position = "right"
  )

Summary of Observations

Note:

Analysis Summary

  • Pop dominance: Pop music has remained the dominant genre and has shown a consistent upward trend, climbing from 61.4% in the 1960s to 78.4% in the 2010s, effectively monopolizing the chart.

  • Rock trajectory: Rock followed an upward then downward trajectory, peaking in the 1970s–1980s (26–34%) before consistently declining to a mere 4.3% in the 2010s, which aligns with its decline in the mainstream market.

  • Blues and Jazz: Both genres have virtually disappeared, starting with 15% and 6% shares in the 1960s, respectively, and approaching zero by the 2010s, reflecting their gradual exit from the mainstream.

  • Hip-Hop classification: Hip-Hop appears minimal in the data, peaking at only 7.6% in the 1990s. This is likely an issue with the D3 dataset categorization, where many Hip-Hop tracks may have been classified as Pop. This limitation can be further addressed in RQ3 by using “speechiness” as a supplementary metric.