---
title: "05 EDA"
author: "Fu Wei Hsu"
format:
html:
theme: cosmo
toc: true
toc-location: right
toc-title: "On this page"
code-tools: true
embed-resources: true
execute:
warning: false
message: false
editor: visual
---
# Executive Summary
**Purpose :**
To examine the evolution of popular music trends over the past 60 years by utilizing exploratory data analysis (EDA) and systematic data wrangling to address three core research questions.
**Research Questions :**
- RQ1: How has the genre distribution of Billboard Hot 100 songs changed across decades?
- RQ2: How have the emotional dimensions of popular music evolved across six decades, and what cultural and technological shifts might explain these changes?
- RQ3: Does the rise of speechiness reflect the mainstreaming of Hip-Hop?
**Data Source:** `04_Data_Joining.qmd`, the output generated from the 04_Data_Joining.qmd data processing pipeline.
------------------------------------------------------------------------
# Setup
```{r}
# Setup
library(tidyverse)
library(janitor)
library(gt)
# Load the joined dataset
D1_D2_D3_joined <- readRDS("../Data/D1_D2_D3_joined.rds")
cat("Total observations:", nrow(D1_D2_D3_joined), "\n\n")
# Create the main analysis dataset (1960-2019)
analysis_base <- D1_D2_D3_joined %>%
mutate(year = as.integer(format(date, "%Y"))) %>%
filter(year >= 1960, year <= 2019) %>%
filter(!is.na(target)) %>%
mutate(decade = paste0(floor(year / 10) * 10, "s"))
cat("Analysis dataset (1960-2019, D2 matched):\n")
cat("Total observations:", nrow(analysis_base), "\n")
cat("Year range:", min(analysis_base$year), "–", max(analysis_base$year), "\n")
cat("Decade distribution:\n")
analysis_base %>%
count(decade) %>%
gt() %>%
tab_header(title = "Observations by Decade")
```
# RQ1:How has the genre distribution of Billboard Hot 100 songs changed across decades?
```{r}
# Subset for RQ1: Records with genre data
genre_base <- analysis_base %>%
filter(!is.na(genre))
cat("Genre analysis subset count:", nrow(genre_base), "\n")
cat("Decade coverage:\n")
genre_base %>% count(decade)
```
## Genre Distribution
```{r}
cat("Research Question: How has the genre distribution of\n")
cat("Billboard Hot 100 songs changed across decades?\n\n")
# RQ1 Analysis Subset
genre_base <- analysis_base %>%
filter(!is.na(genre))
cat("Genre analysis subset count:", nrow(genre_base), "\n")
cat("Coverage:", round(nrow(genre_base) / nrow(analysis_base) * 100, 2), "% of analysis_base\n\n")
cat("Genre categories:\n")
genre_base %>%
count(genre, sort = TRUE) %>%
gt() %>%
tab_header(title = "Genre Distribution Overview") %>%
fmt_number(columns = n, decimals = 0, use_seps = TRUE)
```
## Genre Distribution by Decade in percentage
```{r}
# Calculate genre percentage by decade
genre_decade <- genre_base %>%
count(decade, genre) %>%
group_by(decade) %>%
mutate(pct = round(n / sum(n) * 100, 2)) %>%
ungroup()
# Table
genre_decade %>%
select(decade, genre, pct) %>%
pivot_wider(names_from = decade, values_from = pct, values_fill = 0) %>%
gt() %>%
tab_header(title = "Genre Distribution (%) by Decade") %>%
fmt_number(columns = -genre, decimals = 1)
```
## Stacked Bar Chart: Genre Distribution by Decade
```{r rq1-genre-plot}
# Stacked Bar Chart
genre_decade %>%
ggplot(aes(x = decade, y = pct, fill = genre)) +
geom_col(position = "stack") +
scale_fill_brewer(palette = "Set2") +
labs(
title = "Genre Distribution of Billboard Hot 100 by Decade",
subtitle = "Based on D1–D3 matched subset (1960s–2010s)",
x = "Decade",
y = "Percentage (%)",
fill = "Genre",
caption = "Source: Billboard Hot 100 × Music Dataset 1950–2019"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
```
## Line Chart: Genre Trends over Decades
```{r rq1-genre-line}
# Line Chart
# Focusing on the rise and fall of Hip-Hop, Pop, and Rock
genre_decade %>%
ggplot(aes(x = decade, y = pct, color = genre, group = genre)) +
geom_line(linewidth = 1) +
geom_point(size = 2.5) +
scale_color_brewer(palette = "Set2") +
labs(
title = "Genre Trend on Billboard Hot 100 (1960s–2010s)",
subtitle = "Percentage of chart appearances per decade",
x = "Decade",
y = "Percentage (%)",
color = "Genre",
caption = "Source: Billboard Hot 100 × Music Dataset 1950–2019"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
```
## Summary of Observations
**Note:**
Analysis Summary
- **Pop dominance:** Pop music has remained the dominant genre and has shown a consistent upward trend, climbing from 61.4% in the 1960s to 78.4% in the 2010s, effectively monopolizing the chart.
- **Rock trajectory:** Rock followed an upward then downward trajectory, peaking in the 1970s–1980s (26–34%) before consistently declining to a mere 4.3% in the 2010s, which aligns with its decline in the mainstream market.
- **Blues and Jazz:** Both genres have virtually disappeared, starting with 15% and 6% shares in the 1960s, respectively, and approaching zero by the 2010s, reflecting their gradual exit from the mainstream.
- **Hip-Hop classification:** Hip-Hop appears minimal in the data, peaking at only 7.6% in the 1990s. This is likely an issue with the D3 dataset categorization, where many Hip-Hop tracks may have been classified as Pop. This limitation can be further addressed in RQ3 by using "speechiness" as a supplementary metric.
# RQ2 - How have the emotional dimensions of popular music evolved across six decades, and what cultural and technological shifts might explain these changes?
## Data preparation
```{r}
# Calculate average valence per year
valence_yearly <- analysis_base %>%
group_by(year) %>%
summarise(
avg_valence = mean(valence, na.rm = TRUE),
n = n()
) %>%
ungroup()
```
### Calculate five setup features
```{r}
# Calculate annual averages for five audio features
features_yearly <- analysis_base %>%
group_by(year) %>%
summarise(
valence = mean(valence, na.rm = TRUE),
energy = mean(energy, na.rm = TRUE),
danceability = mean(danceability, na.rm = TRUE),
acousticness = mean(acousticness, na.rm = TRUE),
speechiness = mean(speechiness, na.rm = TRUE)
) %>%
pivot_longer(
cols = c(valence, energy, danceability, acousticness, speechiness),
names_to = "feature",
values_to = "score"
) %>%
mutate(
feature = factor(feature, levels = c(
"valence", "energy", "danceability", "acousticness", "speechiness"
))
)
```
## Total emotion plot in three phases
```{r, fig.width = 14, fig.height = 7}
ggplot(features_yearly, aes(x = year, y = score, color = feature)) +
# Define three phases by background color
annotate("rect",
xmin = 1960, xmax = 1979,
ymin = -Inf, ymax = Inf,
fill = "#FFD700", alpha = 0.06) +
annotate("rect",
xmin = 1980, xmax = 1999,
ymin = -Inf, ymax = Inf,
fill = "#FF6B6B", alpha = 0.06) +
annotate("rect",
xmin = 2000, xmax = 2019,
ymin = -Inf, ymax = Inf,
fill = "#4A90D9", alpha = 0.06) +
# Define three phases labels
annotate("text",
x = 1969, y = 0.90,
label = "Phase 1\nDisco Era",
size = 5, color = "#B8860B", fontface = "bold") +
annotate("text",
x = 1989, y = 0.90,
label = "Phase 2\nPost-Disco & Rock",
size = 5, color = "#CC0000", fontface = "bold") +
annotate("text",
x = 2009, y = 0.90,
label = "Phase 3\nHip-Hop & Digital",
size = 5, color = "#1A5276", fontface = "bold") +
# Phase divider lines
geom_vline(xintercept = 1980,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
geom_vline(xintercept = 2000,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
# Five trend lines
geom_line(linewidth = 0.8, alpha = 0.85) +
geom_point(size = 1.5, alpha = 0.85) +
# Color settings
scale_color_manual(
values = c(
"valence" = "#E74C3C",
"energy" = "#F39C12",
"danceability" = "#27AE60",
"acousticness" = "#8E44AD",
"speechiness" = "#2980B9"
),
labels = c(
"valence" = "Valence (Positivity)",
"energy" = "Energy",
"danceability" = "Danceability",
"acousticness" = "Acousticness",
"speechiness" = "Speechiness"
)
) +
coord_cartesian(ylim = c(0, 1)) +
labs(
title = "Audio Feature Trends on Billboard Hot 100 (1960–2019)",
subtitle = "Five Spotify audio features across three eras of popular music",
x = "Year",
y = "Average Score (0–1)",
color = "Audio Feature",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "right"
)
```
## Valence emotion plot in three phases
```{r, fig.width = 14, fig.height = 6}
ggplot(valence_yearly, aes(x = year, y = avg_valence)) +
# Phase 1 background: Disco Era (1960–1979)
annotate("rect",
xmin = 1960, xmax = 1979,
ymin = -Inf, ymax = Inf,
fill = "#FFD700", alpha = 0.08) +
# Phase 2 background: Post-Disco & Rock Era (1980–1999)
annotate("rect",
xmin = 1980, xmax = 1999,
ymin = -Inf, ymax = Inf,
fill = "#FF6B6B", alpha = 0.08) +
# Phase 3 background: Hip-Hop & Digital Era (2000–2019)
annotate("rect",
xmin = 2000, xmax = 2019,
ymin = -Inf, ymax = Inf,
fill = "#4A90D9", alpha = 0.08) +
# Phase labels
annotate("text",
x = 1969, y = 0.2,
label = "Phase 1\nDisco Era",
size = 5, color = "#B8860B", fontface = "bold") +
annotate("text",
x = 1989, y = 0.2,
label = "Phase 2\nPost-Disco & Rock",
size = 5, color = "#CC0000", fontface = "bold") +
annotate("text",
x = 2009, y = 0.2,
label = "Phase 3\nHip-Hop & Digital",
size = 5, color = "#1A5276", fontface = "bold") +
# Trend line
geom_line(color = "steelblue", linewidth = 0.8) +
geom_point(color = "steelblue", size = 1.8) +
# Force Y-axis to 0-1
coord_cartesian(ylim = c(0, 1)) +
# Phase divider lines
geom_vline(xintercept = 1980,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.6) +
geom_vline(xintercept = 2000,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.6) +
labs(
title = "Average Valence of Billboard Hot 100 Songs (1960–2019)",
subtitle = "Three distinct eras of popular music reflect shifting emotional tones",
x = "Year",
y = "Average Valence (0 = Negative, 1 = Positive)",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "none"
)
```
**Note**
The decline of Disco in 1979 marked the first major drop, followed by a steady fall driven by the rise of Hip-Hop and eventually accelerated by Emo Rap and social media culture in the 2010s. Historical events may have amplified the trend, but genre evolution appears to be the primary driver.
## Analysis of Valence vs. Energy Divergence
```{r, fig.width = 14, fig.height = 6}
# Subset: Valence and Energy only
valence_energy <- features_yearly %>%
filter(feature %in% c("valence", "energy"))
ggplot(valence_energy, aes(x = year, y = score, color = feature)) +
# Three-phase background shading
annotate("rect",
xmin = 1960, xmax = 1979,
ymin = -Inf, ymax = Inf,
fill = "#FFD700", alpha = 0.06) +
annotate("rect",
xmin = 1980, xmax = 1999,
ymin = -Inf, ymax = Inf,
fill = "#FF6B6B", alpha = 0.06) +
annotate("rect",
xmin = 2000, xmax = 2019,
ymin = -Inf, ymax = Inf,
fill = "#4A90D9", alpha = 0.06) +
# Phase separators
geom_vline(xintercept = 1980,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
geom_vline(xintercept = 2000,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
# Annotation for divergence
annotate("text",
x = 1993, y = 0.55,
label = "Valence & Energy\nbegin to diverge",
size = 3, color = "gray30", fontface = "italic") +
annotate("segment",
x = 1993, xend = 1993,
y = 0.58, yend = 0.62,
arrow = arrow(length = unit(0.2, "cm")),
color = "gray30") +
# Trend lines
geom_line(linewidth = 1, alpha = 0.9) +
geom_point(size = 1.8, alpha = 0.9) +
# Colors and Labels
scale_color_manual(
values = c(
"valence" = "#E74C3C",
"energy" = "#F39C12"
),
labels = c(
"valence" = "Valence (Positivity)",
"energy" = "Energy (Intensity)"
)
) +
# End-point annotations
annotate("text",
x = 2020, y = 0.49,
label = "Valence\n0.49",
size = 3, color = "#E74C3C", fontface = "bold") +
annotate("text",
x = 2020, y = 0.73,
label = "Energy\n0.73",
size = 3, color = "#F39C12", fontface = "bold") +
labs(
title = "Valence vs. Energy: The Divergence of Popular Music (1960–2019)",
subtitle = "Music has become darker in mood (↓ Valence) yet more intense in energy (↑ Energy)",
x = "Year",
y = "Average Score (0–1)",
color = "Audio Feature",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "bottom"
)
```
### Find low valence songs
```{r}
low_valence_songs <- analysis_base %>%
filter(year %in% c(2016, 2018)) %>%
distinct(song_clean, artist_clean, .keep_all = TRUE) %>%
arrange(valence) %>%
select(year, song_clean, artist_clean, valence, energy)
low_valence_songs %>%
head(10) %>%
gt() %>%
tab_header(title = "Top 10 Lowest Valence Songs (2016&2018)") %>%
cols_label(
song_clean = "Song",
artist_clean = "Artist"
)
```
**Note**
- 2016: Drake "9", Drake "Summer Sixteen", Beyoncé "Forward"
- 2018: Trippie Redd "Topanga", Travis Scott "Stargazing", Taylor Swift "Delicate"
The two most notable valence drops in Phase 3 occurred around 2016 and 2018, coinciding with the peak influence of Emo Rap and Trap artists such as Drake, Travis Scott, and Trippie Redd whose signature dark, introspective sound directly contributed to the historic lows in musical positivity.
## Figure A: Sociopolitical Context
```{r, fig.width = 14, fig.height = 6}
library(tidyverse)
# Define Historical Events
events_ve <- tribble(
~year, ~label, ~color, ~y_pos,
1964, "Vietnam War\nBegins", "red", 0.730,
1975, "Vietnam War\nEnds", "red", 0.695,
1981, "AIDS\nCrisis", "purple", 0.730,
1991, "Gulf War", "brown", 0.695,
2001, "9/11\nAttacks", "darkorange", 0.730,
2006, "Facebook\nLaunch", "royalblue", 0.695,
2008, "Financial\nCrisis", "darkblue", 0.730,
2008, "Facebook\n100M Users", "royalblue", 0.660,
2010, "Instagram\nLaunch", "deeppink", 0.695,
2013, "BLM\nMovement", "darkgreen", 0.730,
2016, "U.S.\nElection", "darkred", 0.660,
2017, "#MeToo\nMovement", "hotpink", 0.695
)
# Plotting
ggplot(valence_energy, aes(x = year, y = score, color = feature)) +
# Phase Background Shading
annotate("rect", xmin = 1960, xmax = 1979, ymin = -Inf, ymax = Inf, fill = "#FFD700", alpha = 0.06) +
annotate("rect", xmin = 1980, xmax = 1999, ymin = -Inf, ymax = Inf, fill = "#FF6B6B", alpha = 0.06) +
annotate("rect", xmin = 2000, xmax = 2019, ymin = -Inf, ymax = Inf, fill = "#4A90D9", alpha = 0.06) +
# Event Background Highlights
annotate("rect", xmin = 2006, xmax = 2008, ymin = -Inf, ymax = Inf, fill = "royalblue", alpha = 0.07) +
annotate("rect", xmin = 2001, xmax = 2003, ymin = -Inf, ymax = Inf, fill = "orange", alpha = 0.08) +
annotate("rect", xmin = 1964, xmax = 1975, ymin = -Inf, ymax = Inf, fill = "red", alpha = 0.06) +
annotate("rect", xmin = 2016, xmax = 2019, ymin = -Inf, ymax = Inf, fill = "darkred", alpha = 0.05) +
# Vertical Lines for Historical Events
geom_vline(data = events_ve, aes(xintercept = year),
color = "gray60", linetype = "dashed", linewidth = 0.4, alpha = 0.6) +
# Event Labels
geom_text(data = events_ve, aes(x = year, y = y_pos, label = label),
color = "gray35", size = 2.3, fontface = "bold", hjust = -0.1) +
# Phase Separators
geom_vline(xintercept = 1980, linetype = "solid", linewidth = 0.6, color = "gray50", alpha = 0.5) +
geom_vline(xintercept = 2000, linetype = "solid", linewidth = 0.6, color = "gray50", alpha = 0.5) +
# Trend Lines
geom_line(linewidth = 1.1, alpha = 0.95) +
geom_point(size = 1.8, alpha = 0.95) +
# Divergence Annotation
annotate("text", x = 1993, y = 0.63, label = "Valence & Energy\nbegin to diverge",
size = 2.8, color = "gray30", fontface = "italic") +
annotate("segment", x = 1993, xend = 1993, y = 0.615, yend = 0.585,
arrow = arrow(length = unit(0.2, "cm")), color = "gray30") +
# Custom Colors
scale_color_manual(
values = c("valence" = "#E74C3C", "energy" = "#F39C12"),
labels = c("valence" = "Valence (Positivity ↓)", "energy" = "Energy (Intensity ↑)")
) +
# Endpoint Value Labels
annotate("text", x = 2020, y = 0.49, label = "Valence\n0.49", size = 3, color = "#E74C3C", fontface = "bold") +
annotate("text", x = 2020, y = 0.73, label = "Energy\n0.73", size = 3, color = "#F39C12", fontface = "bold") +
# Labs and Theme
labs(
title = "Valence vs Energy: The Divergence of Popular Music (1960–2019)",
subtitle = "Music became darker in mood (↓ Valence) yet more intense in energy (↑ Energy) | Key U.S. events marked",
x = "Year",
y = "Average Score (0–1)",
color = "Audio Feature",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 9, color = "gray40"),
legend.position = "bottom"
)
```
## Figure B: Music Technology Evolution
```{r, fig.width = 14, fig.height = 6}
# ── Define Music Technology Events ──────────────────────
events_tech <- tribble(
~year, ~label, ~y_pos,
1982, "CD\nLaunches", 0.730,
1999, "Napster\nLaunches", 0.695,
2001, "iPod\nLaunches", 0.730,
2003, "iTunes\nStore Opens", 0.695,
2008, "Spotify\nFounded", 0.730,
2011, "Spotify\nEnters US", 0.695,
2015, "Streaming\nDominates Market", 0.730
)
ggplot(valence_energy, aes(x = year, y = score, color = feature)) +
# Three-phase background color blocks
annotate("rect", xmin = 1960, xmax = 1979, ymin = -Inf, ymax = Inf, fill = "#FFD700", alpha = 0.06) +
annotate("rect", xmin = 1980, xmax = 1999, ymin = -Inf, ymax = Inf, fill = "#FF6B6B", alpha = 0.06) +
annotate("rect", xmin = 2000, xmax = 2019, ymin = -Inf, ymax = Inf, fill = "#4A90D9", alpha = 0.06) +
# Digital Era background (1999–2015)
annotate("rect", xmin = 1999, xmax = 2015, ymin = -Inf, ymax = Inf, fill = "#2ECC71", alpha = 0.05) +
# Streaming Era background (2015–2019)
annotate("rect", xmin = 2015, xmax = 2019, ymin = -Inf, ymax = Inf, fill = "#1DB954", alpha = 0.08) +
# Tech event vertical lines
geom_vline(data = events_tech, aes(xintercept = year), color = "gray50", linetype = "dashed", linewidth = 0.4, alpha = 0.7) +
# Tech event text labels
geom_text(data = events_tech, aes(x = year, y = y_pos, label = label), color = "gray30", size = 2.3, fontface = "bold", hjust = -0.1) +
# Phase separators
geom_vline(xintercept = 1980, linetype = "solid", linewidth = 0.6, color = "gray50", alpha = 0.5) +
geom_vline(xintercept = 2000, linetype = "solid", linewidth = 0.6, color = "gray50", alpha = 0.5) +
# Main trend lines
geom_line(linewidth = 1.1, alpha = 0.95) +
geom_point(size = 1.8, alpha = 0.95) +
# Divergence annotation
annotate("text", x = 1993, y = 0.63, label = "Valence & Energy\nbegin to diverge", size = 2.8, color = "gray30", fontface = "italic") +
annotate("segment", x = 1993, xend = 1993, y = 0.615, yend = 0.585, arrow = arrow(length = unit(0.2, "cm")), color = "gray30") +
# Colors and Legend Labels
scale_color_manual(
values = c("valence" = "#E74C3C", "energy" = "#F39C12"),
labels = c("valence" = "Valence (Positivity ↓)", "energy" = "Energy (Intensity ↑)")
) +
# Endpoint score labels
annotate("text", x = 2020, y = 0.49, label = "Valence\n0.49", size = 3, color = "#E74C3C", fontface = "bold") +
annotate("text", x = 2020, y = 0.73, label = "Energy\n0.73", size = 3, color = "#F39C12", fontface = "bold") +
# Era-specific labels
annotate("text", x = 2007, y = 0.455, label = "Digital Music Era", size = 3, color = "#27AE60", fontface = "bold") +
annotate("text", x = 2017, y = 0.455, label = "Streaming Era", size = 2.8, color = "#1DB954", fontface = "bold") +
# Labs and Formatting
labs(
title = "Valence vs Energy: The Impact of Music Consumption Technology (1960–2019)",
subtitle = "From vinyl to streaming — how technology reshaped the emotional landscape of popular music",
x = "Year",
y = "Average Score (0–1)",
color = "Audio Feature",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 9, color = "gray40"),
legend.position = "bottom"
)
```
## Comparative Table of Audio Features by Decade
```{r}
analysis_base %>%
group_by(decade) %>%
summarise(
Valence = round(mean(valence, na.rm = TRUE), 3),
Energy = round(mean(energy, na.rm = TRUE), 3),
Danceability = round(mean(danceability, na.rm = TRUE), 3),
Acousticness = round(mean(acousticness, na.rm = TRUE), 3),
Speechiness = round(mean(speechiness, na.rm = TRUE), 3),
n = n()
) %>%
gt() %>%
tab_header(
title = "Summary of Average Audio Features by Decade",
subtitle = "Aggregated trends from Billboard Hot 100 (1960s–2010s)"
) %>%
cols_label(
decade = "Decade",
Valence = "Valence",
Energy = "Energy",
Danceability = "Danceability",
Acousticness = "Acousticness",
Speechiness = "Speechiness",
n = "Sample Size (n)"
) %>%
fmt_number(columns = n, decimals = 0, use_seps = TRUE) %>%
# Data coloration to highlight trends
data_color(
columns = c(Valence, Energy, Danceability, Acousticness, Speechiness),
palette = "RdYlGn",
alpha = 0.7
) %>%
tab_source_note(
source_note = "Source: Billboard Hot 100 × Spotify Dataset"
)
```
**Key Findings: A Three-Tiered Evolution**
**1. The Darkening of Sound (Valence ↓)**
The Trend: Musical "positivity" has been in steady decline.
Data Insight: Average Valence dropped from \~0.670 in the 1960s to \~0.502 in the 2010s (a 25% decrease).
Reflection: Modern hits have moved away from the collective optimism of the past toward more somber and complex emotional landscapes.
**2. The Surge in Intensity (Energy ↑)**
The Trend: While music got "sadder," it also got "stronger."
Data Insight: Energy levels have risen consistently, creating a significant divergence from Valence.
Reflection: Modern music is not "sad and weak" but rather "angry and powerful." High-intensity production (Trap, EDM) provides a sonic outlet for frustration.
**3. Structural Transformation (Danceability ↑, Speechiness ↑, Acousticness ↓)**
The Trend: A shift from acoustic instruments to rhythmic, vocal-heavy digital production.
Data Insight:
Danceability: Increased with the rise of club culture and electronic beats.
Speechiness: Surged due to the mainstream dominance of Hip-Hop.
Acousticness: Plummeted as synthesizers and 808 drums replaced traditional instrumentation.
**Final Conclusion**
Modern popular music is not simply "sadder"—it is angrier, more energetic, and more danceable, yet emotionally darker. This historic shift reflects a cultural transition from collective joy toward an individual expression of frustration, anxiety, and resilience. Popular music has evolved into a high-intensity tool for emotional catharsis in an increasingly complex world.
# RQ3 - Does the rise of speechiness in popular music reflect the mainstreaming of Hip-Hop?
```{r}
# Yearly Trends
speechiness_yearly <- analysis_base %>%
group_by(year) %>%
summarise(
avg_speechiness = mean(speechiness, na.rm = TRUE),
n = n()
) %>%
ungroup()
# Decadal Averages
speechiness_decade <- analysis_base %>%
group_by(decade) %>%
summarise(
avg_speechiness = round(mean(speechiness, na.rm = TRUE), 4),
n = n()
)
cat("Year Range:", min(speechiness_yearly$year),
"–", max(speechiness_yearly$year), "\n")
```
## Long-term Trend of Speechiness
```{r, fig.width = 14, fig.height = 6}
ggplot(speechiness_yearly, aes(x = year, y = avg_speechiness)) +
# Three-phase background color blocks
annotate("rect",
xmin = 1960, xmax = 1979,
ymin = -Inf, ymax = Inf,
fill = "#FFD700", alpha = 0.06) +
annotate("rect",
xmin = 1980, xmax = 1999,
ymin = -Inf, ymax = Inf,
fill = "#FF6B6B", alpha = 0.06) +
annotate("rect",
xmin = 2000, xmax = 2019,
ymin = -Inf, ymax = Inf,
fill = "#4A90D9", alpha = 0.06) +
# Phase labels
annotate("text",
x = 1969, y = 0.028,
label = "Phase 1\nPre-Rap Era",
size = 3, color = "#B8860B", fontface = "bold") +
annotate("text",
x = 1989, y = 0.028,
label = "Phase 2\nHip-Hop Emerges",
size = 3, color = "#CC0000", fontface = "bold") +
annotate("text",
x = 2009, y = 0.028,
label = "Phase 3\nHip-Hop Dominates",
size = 3, color = "#1A5276", fontface = "bold") +
# Key event markers
annotate("rect",
xmin = 1988, xmax = 1992,
ymin = -Inf, ymax = Inf,
fill = "purple", alpha = 0.06) +
annotate("text",
x = 1990, y = 0.120,
label = "Golden Age\nof Hip-Hop",
size = 2.8, color = "purple", fontface = "bold") +
# Phase separators
geom_vline(xintercept = 1980,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
geom_vline(xintercept = 2000,
linetype = "solid", linewidth = 0.6,
color = "gray50", alpha = 0.5) +
# Trend lines
geom_line(color = "#8E44AD", linewidth = 0.9) +
geom_point(color = "#8E44AD", size = 1.8) +
# Endpoint labels
annotate("text",
x = 2020, y = speechiness_yearly %>%
filter(year == 2019) %>% pull(avg_speechiness),
label = paste0("2019\n",
round(speechiness_yearly %>%
filter(year == 2019) %>%
pull(avg_speechiness), 3)),
size = 3, color = "#8E44AD", fontface = "bold") +
labs(
title = "Average Speechiness of Billboard Hot 100 Songs (1960–2019)",
subtitle = "Rising speechiness reflects the growing influence of Hip-Hop on mainstream music",
x = "Year",
y = "Average Speechiness (0 = No Speech, 1 = All Speech)",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40")
)
```
## Decadal Comparison of Speechiness
```{r, fig.width = 10, fig.height = 6}
speechiness_decade %>%
ggplot(aes(x = decade, y = avg_speechiness, fill = decade)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = avg_speechiness),
vjust = -0.5, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c(
"1960s" = "#F9E79F",
"1970s" = "#FAD7A0",
"1980s" = "#F1948A",
"1990s" = "#C39BD3",
"2000s" = "#85C1E9",
"2010s" = "#5DADE2"
)) +
labs(
title = "Average Speechiness by Decade (Billboard Hot 100)",
subtitle = "Speechiness increases sharply from the 1990s onward",
x = "Decade",
y = "Average Speechiness",
caption = "Source: Billboard Hot 100 × Spotify Hit Predictor"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40")
)
```
## Speechiness Comparison by Genre
```{r, fig.width = 10, fig.height = 6}
# Analysis based on D3 subset
genre_speech <- analysis_base %>%
filter(!is.na(genre)) %>%
group_by(genre) %>%
summarise(
avg_speechiness = round(mean(speechiness, na.rm = TRUE), 4),
n = n()
) %>%
arrange(desc(avg_speechiness))
cat("Average Speechiness per Genre:\n")
genre_speech %>%
gt() %>%
tab_header(title = "Speechiness by Genre") %>%
fmt_number(columns = n, decimals = 0, use_seps = TRUE) %>%
data_color(
columns = avg_speechiness,
palette = "Purples"
)
```
## Genre Speechiness Bar Chart
```{r, fig.width = 10, fig.height = 6}
# Genre Speechiness Bar Chart
genre_speech %>%
mutate(
genre = fct_reorder(genre, avg_speechiness),
is_hiphop = genre == "hip hop"
) %>%
ggplot(aes(x = genre, y = avg_speechiness,
fill = is_hiphop)) +
geom_col(show.legend = FALSE) +
geom_text(aes(label = avg_speechiness),
hjust = -0.2, size = 3.5, fontface = "bold") +
scale_fill_manual(values = c(
"TRUE" = "#8E44AD",
"FALSE" = "#BDC3C7"
)) +
coord_flip() +
labs(
title = "Average Speechiness by Genre",
subtitle = "Hip-Hop shows significantly higher speechiness than all other genres",
x = "Genre",
y = "Average Speechiness",
caption = "Source: Billboard Hot 100 × Music Dataset 1950–2019"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40")
)
```
## Hip-Hop Speechiness Trend by Decade
```{r}
# Speechiness: Hip-Hop vs Other Genres
analysis_base %>%
filter(!is.na(genre)) %>%
mutate(is_hiphop = ifelse(genre == "hip hop",
"Hip-Hop", "Other Genres")) %>%
group_by(decade, is_hiphop) %>%
summarise(
avg_speechiness = round(mean(speechiness, na.rm = TRUE), 4),
.groups = "drop"
) %>%
ggplot(aes(x = decade, y = avg_speechiness,
fill = is_hiphop)) +
geom_col(position = "dodge") +
geom_text(aes(label = avg_speechiness),
position = position_dodge(width = 0.9),
vjust = -0.5, size = 3) +
scale_fill_manual(values = c(
"Hip-Hop" = "#8E44AD",
"Other Genres" = "#BDC3C7"
)) +
labs(
title = "Speechiness: Hip-Hop vs Other Genres by Decade",
subtitle = "Hip-Hop consistently shows higher speechiness across all decades",
x = "Decade",
y = "Average Speechiness",
fill = "Genre Group",
caption = "Source: Billboard Hot 100 × Music Dataset 1950–2019"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 10, color = "gray40"),
legend.position = "bottom"
)
```
## Summary of Observations
RQ3 Summary: Rise of Speechiness and Hip-Hop Mainstreaming
**Key Findings**
**1. Long-term Growth in Speechiness**
The average speechiness of Billboard Hot 100 songs has increased significantly over the past six decades:
1960s Average: 0.0382
2010s Average: 0.1037
Total Growth: 171.5%
This indicates a structural shift in popular music from melodic singing toward rhythmic speech and rap-like delivery.
**2. Hip-Hop vs. Other Genres**
Genre-specific analysis (D3 Subset) reveals a stark contrast:
Hip-Hop Average: 0.2325
Other Genres Average: \~0.0463
Contrast: Hip-Hop’s speechiness is over 4.8 times higher than the average of all other genres combined, confirming it as the primary driver of this acoustic trend.
**Conclusion**
The rise of speechiness on the Billboard Hot 100 is strongly associated with Hip-Hop's growing sonic influence on mainstream music. While genre classification data suggests Hip-Hop accounts for only 7.6% of chart appearances — likely an underestimate due to classification limitations — speechiness provides an objective, algorithm-based measure of Hip-Hop's true impact.
As Hip-Hop's sonic characteristics permeated the mainstream from the 1980s onward, the spoken-word quality of chart-topping hits increased proportionally — reflecting a broader cultural shift toward lyrical density and rhythmic delivery that transcends genre labels.
**Limitations**
- Sample Coverage:
The D3 genre subset covers only 13.94% of the data; results may be subject to sampling bias.
- Classification Overlap:
Some "Pop" tracks with heavy rap elements may not be labeled as Hip-Hop, potentially understating Hip-Hop's total impact.
- Data Constraints:
The actual rise in speechiness across the full dataset might be even more dramatic than observed in this subset.