For this project, we want to analyze the emotions of the top Rolling Stone’s artists via sentiment analysis of song lyrics.
Data sources:
Top artists data: * Rolling Stone Album Rankings CSV: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07
Songs data: * Wikipedia Singles Discography (List of singles, with selected chart positions and certifications chart). Example: https://en.wikipedia.org/wiki/The_Beatles_singles_discography
Song Lyrics data: * Genius API and genius.com: https://docs.genius.com/
We chose these data sets so we can:
Additionally, these data sets would give us data preparation experience such as:
library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
library(rvest)
library(janitor)
library(stringr)
library(tidytext)
library(tibble)
library(textdata)
library(tidyr)
library(readr)
library(purrr)
library(forcats)
library(ggplot2)
library(doParallel)
library(foreach)
library(DBI)
library(RMariaDB)
# Load sentiment lexicons
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")
# Detect number of cores and create cluster
num_cores <- parallel::detectCores() - 1 # leave one core free
cl <- makeCluster(num_cores)
registerDoParallel(cl)
rs <- read_csv("https://raw.githubusercontent.com/aaliyahmjh/DATA607Project/refs/heads/main/rolling_stone.csv", show_col_types = FALSE)
head(rs)
## # A tibble: 6 × 21
## sort_name clean_name album rank_2003 rank_2012 rank_2020 differential
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Sinatra, Frank Frank Sinatra In t… 100 101 282 -182
## 2 Diddley, Bo Bo Diddley Bo D… 214 216 455 -241
## 3 Presley, Elvis Elvis Presley Elvi… 55 56 332 -277
## 4 Sinatra, Frank Frank Sinatra Song… 306 308 NA -195
## 5 Little Richard Little Richard Here… 50 50 227 -177
## 6 Beyonce Beyonce Lemo… NA NA 32 469
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## # weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>
This data set is untidy because it has a wide structure with duplicate columns such as name and 3 separate rank columns that can be combined and referenced with a “year” column:
# Tidy the data
tidy_rs <- rs %>%
pivot_longer(
cols = starts_with("rank_"),
names_to = "year",
values_to = "rank",
names_pattern = "rank_(\\d+)"
)
head(tidy_rs)
## # A tibble: 6 × 20
## sort_name clean_name album differential release_year genre type
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Sinatra, Frank Frank Sinatra In the Wee… -182 1955 Big … Stud…
## 2 Sinatra, Frank Frank Sinatra In the Wee… -182 1955 Big … Stud…
## 3 Sinatra, Frank Frank Sinatra In the Wee… -182 1955 Big … Stud…
## 4 Diddley, Bo Bo Diddley Bo Diddley… -241 1955 Rock… Stud…
## 5 Diddley, Bo Bo Diddley Bo Diddley… -241 1955 Rock… Stud…
## 6 Diddley, Bo Bo Diddley Bo Diddley… -241 1955 Rock… Stud…
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>
Now each observation represents an album/year combination making the data set tidy.
Next we want to filter the data set to only focus on albums ranked in 2020:
filter_rs <- tidy_rs %>%
select(clean_name, album, genre, year, rank)
# Filter dataset to only focus on albums ranked in 2020
filter_rs <- filter_rs %>%
filter(year == 2020) %>%
filter(!is.na(rank))
head(filter_rs)
## # A tibble: 6 × 5
## clean_name album genre year rank
## <chr> <chr> <chr> <chr> <dbl>
## 1 Frank Sinatra In the Wee Small Hours Big Band/Jazz 2020 282
## 2 Bo Diddley Bo Diddley / Go Bo Diddley Rock n' Roll/Rhythm & B… 2020 455
## 3 Elvis Presley Elvis Presley Rock n' Roll/Rhythm & B… 2020 332
## 4 Little Richard Here's Little Richard <NA> 2020 227
## 5 Beyonce Lemonade <NA> 2020 32
## 6 Amy Winehouse Back to Black Soul/Gospel/R&B 2020 33
Now we want to grab only the top 10 artists:
# Select the top 10 artists & albums by rank
top10_by_rank <- filter_rs %>%
arrange(rank) %>% # sort ascending
slice_head(n = 10) # take the first 10 rows
top10_by_rank
## # A tibble: 10 × 5
## clean_name album genre year rank
## <chr> <chr> <chr> <chr> <dbl>
## 1 Marvin Gaye What's Going On Soul/Gospel/R&B 2020 1
## 2 The Beach Boys Pet Sounds <NA> 2020 2
## 3 Joni Mitchell Blue Country/Folk/Coun… 2020 3
## 4 Stevie Wonder Songs in the Key of Life Soul/Gospel/R&B 2020 4
## 5 The Beatles Abbey Road <NA> 2020 5
## 6 Nirvana Nevermind Indie/Alternative… 2020 6
## 7 Fleetwood Mac Rumours Singer-Songwriter… 2020 7
## 8 Prince Purple Rain <NA> 2020 8
## 9 Bob Dylan Blood on the Tracks Country/Folk/Coun… 2020 9
## 10 Lauryn Hill The Miseducation of Lauryn Hill Hip-Hop/Rap 2020 10
The top artists are:
Note: none of the top artists from Rolling Stone are recent artists, so when we look at the songs and when they were released, we’ll have more data from 1960s - 1990s.
Now that we have our artists to focus on, next is grabbing the songs for each artist. We’ll be grabbing this data from Wikipedia for us to get web scraping experience.
Each Wikipedia page is set up a bit differently, so we’re going to get the songs for each artist individually:
# Get the html
html_marvin_gaye <- read_html("https://en.wikipedia.org/wiki/Marvin_Gaye_discography#Singles")
html_beach_boys <- read_html("https://en.wikipedia.org/wiki/The_Beach_Boys_discography#Singles")
html_joni_mitchell <- read_html("https://en.wikipedia.org/wiki/Joni_Mitchell_discography#Singles")
html_stevie_wonder <- read_html("https://en.wikipedia.org/wiki/Stevie_Wonder_discography#Singles")
html_the_beatles <- read_html("https://en.wikipedia.org/wiki/The_Beatles_singles_discography#Singles")
html_nirvana <- read_html("https://en.wikipedia.org/wiki/Nirvana_discography#Singles")
html_fleetwood_mac <- read_html("https://en.wikipedia.org/wiki/Fleetwood_Mac_discography#Singles")
html_prince <- read_html("https://en.wikipedia.org/wiki/Prince_singles_discography#Singles")
html_bob_dylan <- read_html("https://en.wikipedia.org/wiki/Bob_Dylan_discography#Singles")
html_lauryn_hill <- read_html("https://en.wikipedia.org/wiki/Lauryn_Hill_discography#Singles")
Next we select the tables we want from the HTML. Note, some artist’s pages have multiple tables for songs where each table represents a decade. For these artists, we’ll need to gather multiple tables in order to gather all the songs:
# Scrape the songs table from wikipedia
get_songs_df <- function(html, css_selector) {
songs <- html |> html_elements(css_selector) |>
html_table()
return(as.data.frame(songs[[1]]))
}
# Note: these Wikipedia pages can be updated, so it's possible this code won't run in the future (css selector would need to be updated)
marvin_gaye_df <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(30)") # 1960s
marvin_gaye_df2 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(32)") # 1970 - 1984
marvin_gaye_df3 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(34)") # Posthumous
beach_boys_df <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(21)") # 1960s
beach_boys_df2 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)") # 1970s
beach_boys_df3 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(25)") # 1980s
beach_boys_df4 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(27)") # 1990s - present
joni_mitchell_df <- get_songs_df(html_joni_mitchell, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(19)")
stevie_wonder_df <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1960s
stevie_wonder_df2 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1970s
stevie_wonder_df3 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 1980s
stevie_wonder_df4 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 1990s - present
the_beatles_df <- get_songs_df(html_the_beatles, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(10)")
nirvana_df <- get_songs_df(html_nirvana, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)")
fleetwood_mac_df <- get_songs_df(html_fleetwood_mac, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)")
prince_df <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1970s & 80s
prince_df2 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1990s
prince_df3 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 2000s
prince_df4 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 2010s - 20s
bob_dylan_df <- get_songs_df(html_bob_dylan, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
lauryn_hill_df <- get_songs_df(html_lauryn_hill, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
Next we’ll go through each unique dataframe (note, each dataframe is
slightly different from each other), and tidy the data. These dataframes
are untidy because each chart category is a unique column. Instead there
should be a general albums_chart
category variable and a
rank
variable to represent this information.
# Marvin Gaye
## 1960s ##
# Keep the important variables
marvin_gaye_df <- marvin_gaye_df[, 1:11]
# First row is actually the column names
marvin_gaye_df <- marvin_gaye_df |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df_clean <- marvin_gaye_df %>%
clean_names()
colnames(marvin_gaye_df_clean)[4:9] <- c("us", "us_r_and_b", "can", "ire", "ned", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Clean up
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df_clean$title <- str_extract(marvin_gaye_df_clean$title , "^(.).*\\1")
head(marvin_gaye_df_clean)
## # A tibble: 6 × 7
## year title b_side certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 "\"Let You… "\"Ne… "" The … us —
## 2 1961 "\"Let You… "\"Ne… "" The … us_r_and_b —
## 3 1961 "\"Let You… "\"Ne… "" The … can —
## 4 1961 "\"Let You… "\"Ne… "" The … ire —
## 5 1961 "\"Let You… "\"Ne… "" The … ned —
## 6 1961 "\"Let You… "\"Ne… "" The … uk —
## 1970 - 1984 ##
# First row is actually the column names
marvin_gaye_df2 <- marvin_gaye_df2 |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df2_clean <- marvin_gaye_df2 %>%
clean_names()
colnames(marvin_gaye_df2_clean)[4:12] <- c("us", "us_r_and_b", "aus", "can", "ger", "ire", "ned", "swe", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(marvin_gaye_df2_clean)
## # A tibble: 6 × 8
## year title b_side certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1970 "\"H… "\"Go… "" That… <NA> us 41
## 2 1970 "\"H… "\"Go… "" That… <NA> us_r_and_b 18
## 3 1970 "\"H… "\"Go… "" That… <NA> aus 37
## 4 1970 "\"H… "\"Go… "" That… <NA> can 37
## 5 1970 "\"H… "\"Go… "" That… <NA> ger —
## 6 1970 "\"H… "\"Go… "" That… <NA> ire —
# Clean up
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df2_clean$title <- str_extract(marvin_gaye_df2_clean$title , "^(.).*\\1")
## Posthumous ##
# First row is actually the column names
marvin_gaye_df3 <- marvin_gaye_df3 |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df3_clean <- marvin_gaye_df3 %>%
clean_names()
colnames(marvin_gaye_df3_clean)[4:12] <- c("us", "us_r_and_b", "aus", "ger", "ire", "ned", "sco", "swe", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Clean up
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory. "N/A" indicates chart not yet published.')
marvin_gaye_df3_clean$title <- str_extract(marvin_gaye_df3_clean$title , "^(.).*\\1")
head(marvin_gaye_df3_clean)
## # A tibble: 6 × 8
## year title b_side certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1985 "\"S… "\"Sa… "" Drea… <NA> us [K]
## 2 1985 "\"S… "\"Sa… "" Drea… <NA> us_r_and_b 2
## 3 1985 "\"S… "\"Sa… "" Drea… <NA> aus —
## 4 1985 "\"S… "\"Sa… "" Drea… <NA> ger —
## 5 1985 "\"S… "\"Sa… "" Drea… <NA> ire —
## 6 1985 "\"S… "\"Sa… "" Drea… <NA> ned —
# The Beach Boys
## 1960s ##
beach_boys_df <- beach_boys_df |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df_clean <- beach_boys_df %>%
clean_names()
colnames(beach_boys_df_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "aus", "can", "nl", "nor", "swe", "uk", "phi")
# Make tidy - make the albums chart it's own variable
beach_boys_df_clean <- beach_boys_df_clean |>
pivot_longer(cols = us:phi, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df_clean <- separate_rows(beach_boys_df_clean, title, sep = '""')
beach_boys_df_clean
## # A tibble: 710 × 6
## year title certifications_sales…¹ album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 "\"Surfi… — Surf… us 75—
## 2 1961 "Luau\" … — Surf… us 75—
## 3 1961 "\"Surfi… — Surf… us_cash_box 85—
## 4 1961 "Luau\" … — Surf… us_cash_box 85—
## 5 1961 "\"Surfi… — Surf… us_record_w… ——
## 6 1961 "Luau\" … — Surf… us_record_w… ——
## 7 1961 "\"Surfi… — Surf… aus ——
## 8 1961 "Luau\" … — Surf… aus ——
## 9 1961 "\"Surfi… — Surf… can ——
## 10 1961 "Luau\" … — Surf… can ——
## # ℹ 700 more rows
## # ℹ abbreviated name: ¹certifications_sales_thresholds
# Clean up
beach_boys_df_clean <- beach_boys_df_clean |>
filter(title != '"—" denotes a release that did not chart, was not released in the country or the information is unknown (* - US Record World chart data incomplete for early 1964)')
beach_boys_df_clean$title <- str_extract(beach_boys_df_clean$title , "^(.).*\\1")
## 1970s ##
beach_boys_df2 <- beach_boys_df2 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df2_clean <- beach_boys_df2 %>%
clean_names()
colnames(beach_boys_df2_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "us_ac", "aus", "can", "nl", "nor", "swe", "uk")
# Make tidy - make the albums chart it's own variable
beach_boys_df2_clean <- beach_boys_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df2_clean <- separate_rows(beach_boys_df2_clean, title, sep = '""')
## 1980s ##
beach_boys_df3 <- beach_boys_df3 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df3_clean <- beach_boys_df3 %>%
clean_names()
colnames(beach_boys_df3_clean)[2:9] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk")
# Make tidy - make the albums chart it's own variable
beach_boys_df3_clean <- beach_boys_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '""')
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '" "')
## 1990s - Present
beach_boys_df4 <- beach_boys_df4 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df4_clean <- beach_boys_df4 %>%
clean_names()
colnames(beach_boys_df4_clean)[2:10] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk", "us_cou")
# Make tidy - make the albums chart it's own variable
beach_boys_df4_clean <- beach_boys_df4_clean |>
pivot_longer(cols = us:us_cou, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '""')
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '" "')
the_beach_boys_df_clean <- beach_boys_df_clean
the_beach_boys_df2_clean <- beach_boys_df2_clean
the_beach_boys_df3_clean <- beach_boys_df3_clean
the_beach_boys_df4_clean <- beach_boys_df4_clean
head(the_beach_boys_df_clean)
## # A tibble: 6 × 6
## year title certifications_sales_thre…¹ album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 <NA> — Surf… us 75—
## 2 1961 <NA> — Surf… us 75—
## 3 1961 <NA> — Surf… us_cash_box 85—
## 4 1961 <NA> — Surf… us_cash_box 85—
## 5 1961 <NA> — Surf… us_record_w… ——
## 6 1961 <NA> — Surf… us_record_w… ——
## # ℹ abbreviated name: ¹certifications_sales_thresholds
head(the_beach_boys_df2_clean)
## # A tibble: 6 × 5
## year title album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 1970 "\"Add Some Music to Your Day" Sunf… us 64—
## 2 1970 "Susie Cincinnati\" (non-album B… Sunf… us 64—
## 3 1970 "\"Add Some Music to Your Day" Sunf… us_cash_box 49—
## 4 1970 "Susie Cincinnati\" (non-album B… Sunf… us_cash_box 49—
## 5 1970 "\"Add Some Music to Your Day" Sunf… us_record_w… 39—
## 6 1970 "Susie Cincinnati\" (non-album B… Sunf… us_record_w… 39—
head(the_beach_boys_df3_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1980 "\"Goin' On" — Keep… us 83—
## 2 1980 "Endless Harmony\… — Keep… us 83—
## 3 1980 "\"Goin' On" — Keep… us_cash_box 103—
## 4 1980 "Endless Harmony\… — Keep… us_cash_box 103—
## 5 1980 "\"Goin' On" — Keep… us_ac ——
## 6 1980 "Endless Harmony\… — Keep… us_ac ——
head(the_beach_boys_df4_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1990 "\"Somewhere Near… — Stil… us ——
## 2 1990 "Kokomo\"" — Stil… us ——
## 3 1990 "\"Somewhere Near… — Stil… us_cash_box ——
## 4 1990 "Kokomo\"" — Stil… us_cash_box ——
## 5 1990 "\"Somewhere Near… — Stil… us_ac ——
## 6 1990 "Kokomo\"" — Stil… us_ac ——
# Joni Mitchell
joni_mitchell_df <- joni_mitchell_df |>
row_to_names(row_number = 1)
# Clean up the variable names
joni_mitchell_df_clean <- joni_mitchell_df %>%
clean_names()
colnames(joni_mitchell_df_clean)[3:9] <- c("can", "can_uc", "aus", "uk", "us", "us_ac", "us_main")
# Make tidy - make the albums chart it's own variable
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
pivot_longer(cols = can:us_main, names_to = "albums_chart", values_to = "peak_chart_position")
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
filter(title != '"—" denotes releases that did not chart')
# Clean up song names
joni_mitchell_df_clean$title <- str_extract(joni_mitchell_df_clean$title , "^(.).*\\1")
head(joni_mitchell_df_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Night in the C… 1968 "" Song… can —
## 2 "\"Night in the C… 1968 "" Song… can_uc —
## 3 "\"Night in the C… 1968 "" Song… aus —
## 4 "\"Night in the C… 1968 "" Song… uk —
## 5 "\"Night in the C… 1968 "" Song… us —
## 6 "\"Night in the C… 1968 "" Song… us_ac —
# Stevie Wonder
stevie_wonder_df <- stevie_wonder_df |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
stevie_wonder_df_clean <- stevie_wonder_df %>%
clean_names()
colnames(stevie_wonder_df_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df_clean <- stevie_wonder_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df_clean <- separate_rows(stevie_wonder_df_clean, title, sep = 'b/w')
head(stevie_wonder_df_clean)
## # A tibble: 6 × 10
## title year swi_39 uk_40 certifications album na na_2 albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us
## 2 " Part 2" 1962 — — "" Non-… <NA> <NA> us
## 3 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us_r_and_b
## 4 " Part 2" 1962 — — "" Non-… <NA> <NA> us_r_and_b
## 5 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us_ac
## 6 " Part 2" 1962 — — "" Non-… <NA> <NA> us_ac
## # ℹ 1 more variable: peak_chart_position <chr>
## 1970s ##
stevie_wonder_df2 <- stevie_wonder_df2 |>
row_to_names(row_number = 1)
# Clean up the names
stevie_wonder_df2_clean <- stevie_wonder_df2 %>%
clean_names()
colnames(stevie_wonder_df2_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df2_clean <- stevie_wonder_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df2_clean <- separate_rows(stevie_wonder_df2_clean, title, sep = 'b/w')
head(stevie_wonder_df2_clean)
## # A tibble: 6 × 8
## title year uk_40 certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Ne… 1970 6 "" Sign… <NA> us 26
## 2 " \"S… 1970 6 "" Sign… <NA> us 26
## 3 "\"Ne… 1970 6 "" Sign… <NA> us_r_and_b 11
## 4 " \"S… 1970 6 "" Sign… <NA> us_r_and_b 11
## 5 "\"Ne… 1970 6 "" Sign… <NA> us_ac 31
## 6 " \"S… 1970 6 "" Sign… <NA> us_ac 31
## 1980s ##
stevie_wonder_df3 <- stevie_wonder_df3 |>
row_to_names(row_number = 1)
# Clean up the names
stevie_wonder_df3_clean <- stevie_wonder_df3 %>%
clean_names()
colnames(stevie_wonder_df3_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df3_clean <- stevie_wonder_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df3_clean <- separate_rows(stevie_wonder_df3_clean, title, sep = 'b/w')
head(stevie_wonder_df3_clean)
## # A tibble: 6 × 9
## title year swi_39 uk_40 certifications album na albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us
## 2 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us
## 3 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us_r_and_b
## 4 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us_r_and_b
## 5 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us_ac
## 6 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us_ac
## # ℹ 1 more variable: peak_chart_position <chr>
## 1990s - present ##
stevie_wonder_df4 <- stevie_wonder_df4 |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
stevie_wonder_df4_clean <- stevie_wonder_df4 %>%
clean_names()
colnames(stevie_wonder_df4_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df4_clean <- stevie_wonder_df4_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df4_clean <- separate_rows(stevie_wonder_df4_clean, title, sep = 'b/w')
head(stevie_wonder_df4_clean)
## # A tibble: 6 × 9
## title year uk_40 album na na_2 na_3 albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us —
## 2 " Instru… 1990 77 Non-… <NA> <NA> <NA> us —
## 3 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us_r_and_b 24
## 4 " Instru… 1990 77 Non-… <NA> <NA> <NA> us_r_and_b 24
## 5 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us_ac —
## 6 " Instru… 1990 77 Non-… <NA> <NA> <NA> us_ac —
# The Beatles
the_beatles_df <- the_beatles_df |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
the_beatles_df_clean <- the_beatles_df %>%
clean_names()
colnames(the_beatles_df_clean)[1:15] <- c("title", "year", "uk", "aus", "aut", "bel", "can", "ger", "nld", "nor", "nz", "swi", "us", "us_cash_box", "us_record_world")
# Make tidy - make the albums chart it's own variable
the_beatles_df_clean <- the_beatles_df_clean |>
pivot_longer(cols = uk:us_record_world, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
the_beatles_df_clean <- separate_rows(the_beatles_df_clean, title, sep = '""')
head(the_beatles_df_clean)
## # A tibble: 6 × 9
## title year certifications uk_album us_album na na_2 albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> uk
## 2 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> uk
## 3 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> aus
## 4 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> aus
## 5 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> aut
## 6 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> aut
## # ℹ 1 more variable: peak_chart_position <chr>
# Nirvana
nirvana_df <- nirvana_df |>
row_to_names(row_number = 1)
# Clean up the names
nirvana_df_clean <- nirvana_df %>%
clean_names()
colnames(nirvana_df_clean)[3:12] <- c("us", "aus", "bel", "fin", "fra", "ire", "nz", "prt", "swe", "uk")
# Make tidy - make the albums chart it's own variable
nirvana_df_clean <- nirvana_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(nirvana_df_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Love Buzz\"[A]" 1988 "" Blea… us —
## 2 "\"Love Buzz\"[A]" 1988 "" Blea… aus —
## 3 "\"Love Buzz\"[A]" 1988 "" Blea… bel —
## 4 "\"Love Buzz\"[A]" 1988 "" Blea… fin —
## 5 "\"Love Buzz\"[A]" 1988 "" Blea… fra —
## 6 "\"Love Buzz\"[A]" 1988 "" Blea… ire —
# Fleetwood Mac
fleetwood_mac_df <- fleetwood_mac_df |>
row_to_names(row_number = 1)
# Clean up the names
fleetwood_mac_df_clean <- fleetwood_mac_df %>%
clean_names()
colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", "us_ac", "aus", "can", "ger", "irl", "nl", "nz")
## Warning in colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", :
## number of items to replace is not a multiple of replacement length
# Make tidy - make the albums chart it's own variable
fleetwood_mac_df_clean <- fleetwood_mac_df_clean |>
pivot_longer(cols = uk:nz, names_to = "albums_chart", values_to = "peak_chart_position")
head(fleetwood_mac_df_clean)
## # A tibble: 6 × 5
## title year album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 "\"I Believe My Time Ain't Long\… 1967 Non-… uk "—"
## 2 "\"I Believe My Time Ain't Long\… 1967 Non-… us ""
## 3 "\"I Believe My Time Ain't Long\… 1967 Non-… us_rock ""
## 4 "\"I Believe My Time Ain't Long\… 1967 Non-… us_ac "—"
## 5 "\"I Believe My Time Ain't Long\… 1967 Non-… aus ""
## 6 "\"I Believe My Time Ain't Long\… 1967 Non-… can ""
# Prince
## 70s & 80s ##
prince_df <- prince_df |>
row_to_names(row_number = 1)
# Clean up the names
prince_df_clean <- prince_df %>%
clean_names()
colnames(prince_df_clean)[3:7] <- c("us", "us_r_and_b", "us_dance", "nld_tip", "uk")
# Make tidy - make the albums chart it's own variable
prince_df_clean <- prince_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df_clean)
## # A tibble: 6 × 11
## title year ger_17 nz_18 swe_19 swi_20 uk_21 certifications album albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"S… 1978 — — — — — "" For … us
## 2 "\"S… 1978 — — — — — "" For … us_r_and_b
## 3 "\"S… 1978 — — — — — "" For … us_dance
## 4 "\"S… 1978 — — — — — "" For … nld_tip
## 5 "\"S… 1978 — — — — — "" For … uk
## 6 "\"J… 1978 — — — — — "" For … us
## # ℹ 1 more variable: peak_chart_position <chr>
## 1990s ##
prince_df2 <- prince_df2 |>
row_to_names(row_number = 1)
# Clean up the names
prince_df2_clean <- prince_df2 %>%
clean_names()
colnames(prince_df2_clean)[3:12] <- c("us", "us_r_and_b", "aus", "can", "fra", "ger", "nz", "swe", "swi", "uk")
# Make tidy - make the albums chart it's own variable
prince_df2_clean <- prince_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df2_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"The Future\" (… 1990 "" Batm… us —
## 2 "\"The Future\" (… 1990 "" Batm… us_r_and_b —
## 3 "\"The Future\" (… 1990 "" Batm… aus —
## 4 "\"The Future\" (… 1990 "" Batm… can —
## 5 "\"The Future\" (… 1990 "" Batm… fra —
## 6 "\"The Future\" (… 1990 "" Batm… ger 39
## 2000s ##
prince_df3 <- prince_df3 |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
prince_df3_clean <- prince_df3 %>%
clean_names()
colnames(prince_df3_clean)[3:11] <- c("us", "us_r_and_b", "aus", "ger", "ire", "nld", "nor", "swi", "uk")
# Make tidy - make the albums chart it's own variable
prince_df3_clean <- prince_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df3_clean)
## # A tibble: 6 × 7
## title year album na na_2 albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"U Make My Sun Shi… 2000 The … <NA> <NA> us —[G]
## 2 "\"U Make My Sun Shi… 2000 The … <NA> <NA> us_r_and_b 108
## 3 "\"U Make My Sun Shi… 2000 The … <NA> <NA> aus —
## 4 "\"U Make My Sun Shi… 2000 The … <NA> <NA> ger —
## 5 "\"U Make My Sun Shi… 2000 The … <NA> <NA> ire —
## 6 "\"U Make My Sun Shi… 2000 The … <NA> <NA> nld —
## 2010s - present ##
prince_df4 <- prince_df4 |>
row_to_names(row_number = 1)
# Clean up the names
prince_df4_clean <- prince_df4 %>%
clean_names()
colnames(prince_df4_clean)[3:7] <- c("us_adult_r_and_b", "us_r_and_b", "us_r_and_b_digital", "us_r_and_b_sales", "uk")
# Make tidy - make the albums chart it's own variable
prince_df4_clean <- prince_df4_clean |>
pivot_longer(cols = us_adult_r_and_b:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Wikipedia added a "Posthumous" section within the same table, so need to remove that row to make the data frame tidy
prince_df4_clean <- prince_df4_clean |>
filter(title != "Posthumous", title != '"—" denotes a single that was not released or did not chart in the region.')
head(prince_df4_clean)
## # A tibble: 6 × 5
## title year album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 "\"Extraloveable\" (featuring An… 2011 Hit … us_adult_r_… —
## 2 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b —
## 3 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b_… —
## 4 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b_… —
## 5 "\"Extraloveable\" (featuring An… 2011 Hit … uk —
## 6 "\"Rock and Roll Love Affair\" [… 2012 Hit … us_adult_r_… —[J]
# Bob Dylan
bob_dylan_df <- bob_dylan_df |>
row_to_names(row_number = 1)
# Clean up the names
bob_dylan_df_clean <- bob_dylan_df %>%
clean_names()
colnames(bob_dylan_df_clean)[2:9] <- c("title", "us", "us_main", "us_cash", "aus", "ire", "nl", "uk")
# Make tidy - make the albums chart it's own variable
bob_dylan_df_clean <- bob_dylan_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
bob_dylan_df_clean <- separate_rows(bob_dylan_df_clean, title, sep = 'b/w')
head(bob_dylan_df_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1962 "\"Mixed-Up Confu… "" Non-… us —
## 2 1962 " \"Corrina Corri… "" Non-… us —
## 3 1962 "\"Mixed-Up Confu… "" Non-… us_main —
## 4 1962 " \"Corrina Corri… "" Non-… us_main —
## 5 1962 "\"Mixed-Up Confu… "" Non-… us_cash —
## 6 1962 " \"Corrina Corri… "" Non-… us_cash —
# Lauryn Hill
lauryn_hill_df <- lauryn_hill_df |>
row_to_names(row_number = 1)
# Clean up the names
lauryn_hill_df_clean <- lauryn_hill_df %>%
clean_names()
colnames(lauryn_hill_df_clean)[3:12] <- c("us", "us_r_and_b", "us_rhy", "aus", "fra", "nl", "nz", "swe", "swi", "uk")
# Make tidy - make the albums chart it's own variable
lauryn_hill_df_clean <- lauryn_hill_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(lauryn_hill_df_clean)
## # A tibble: 6 × 7
## title year certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us 1
## 2 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us_r_and_b 2
## 3 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us_rhy 1
## 4 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> aus 35
## 5 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> fra 23
## 6 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> nl 4
We now have a tidy song dataframe for each artist.
Now let’s combine all the dataframes into one called
artist_dfs
:
artist_dfs <- c(
# Marvin Gaye
"marvin_gaye_df_clean",
"marvin_gaye_df2_clean",
"marvin_gaye_df3_clean",
# The Beach Boys
"the_beach_boys_df_clean",
"the_beach_boys_df2_clean",
"the_beach_boys_df3_clean",
"the_beach_boys_df4_clean",
# Joni Mitchell
"joni_mitchell_df_clean",
# Stevie Wonder
"stevie_wonder_df_clean",
"stevie_wonder_df2_clean",
"stevie_wonder_df3_clean",
"stevie_wonder_df4_clean",
# The Beatles
"the_beatles_df_clean",
# Nirvana
"nirvana_df_clean",
# Fleetwood Mac
"fleetwood_mac_df_clean",
# Prince
"prince_df_clean",
"prince_df2_clean",
"prince_df3_clean",
"prince_df4_clean",
# Bob Dylan
"bob_dylan_df_clean",
# Lauryn Hill
"lauryn_hill_df_clean"
)
Since some of the values from the song title variable are messy, let’s clean the dataframe and write the dataframe to a CSV:
# This function:
# - extracts artist name from object name (everything before "_df")
# - cleans song titles
# - adds artist column
# - saves cleaned df to global env with name like "artist_cleaned_songs"
clean_and_save_songs_df <- function(df_name_string) {
# get df by name
df <- get(df_name_string, envir = .GlobalEnv)
# get artist name from df name (before "_df")
artist_id <- str_extract(df_name_string, "^(.*?)(?=_df)")
artist_title <- str_to_title(str_replace_all(artist_id, "_", " "))
cleaned_df <- df |>
mutate(
# clean title — remove quotes, () [] {}, cut after cd, us, uk etc.
title = title |>
str_trim() |>
str_replace_all('"', "") |>
str_remove_all("\\(.*?\\)") |>
str_remove_all("\\[.*?\\]") |>
str_remove_all("\\{.*?\\}") |>
str_remove_all("(?i)[ ]?(cd|us|uk|canadian)[\\s\\-:]+.*$") |>
str_squish(),
# add artist column
artist = artist_title
) |>
# drop junk rows like "denotes", "not released", etc.
filter(
!str_detect(title, "(?i)denotes|indicates|refers to|represents|not released")
) |>
# keep only one row per title
distinct(title, .keep_all = TRUE)
# save cleaned df to global env
cleaned_df_name <- paste0(artist_id, "_cleaned_songs")
assign(cleaned_df_name, cleaned_df, envir = .GlobalEnv)
# if album/year column missing, add NA
if (!"album" %in% names(cleaned_df)) cleaned_df$album <- NA
if (!"year" %in% names(cleaned_df)) cleaned_df$year <- NA
# keep only needed columns
minimal_df <- cleaned_df |>
select(artist, title, album, year)
# if all_artists_songs doesn’t exist, create it
if (!exists("all_artists_songs", envir = .GlobalEnv)) {
assign("all_artists_songs", minimal_df, envir = .GlobalEnv)
} else {
# if exists, add to it
existing <- get("all_artists_songs", envir = .GlobalEnv)
updated <- bind_rows(existing, minimal_df)
assign("all_artists_songs", updated, envir = .GlobalEnv)
}
return(cleaned_df)
}
artists_songs_album <- invisible(lapply(artist_dfs, clean_and_save_songs_df))
write.csv(all_artists_songs, "all_artists_songs.csv")
all_artists_songs |>
group_by(artist, title) |>
filter(n() > 1) |>
arrange(artist, title)
## # A tibble: 32 × 4
## # Groups: artist, title [15]
## artist title album year
## <chr> <chr> <chr> <chr>
## 1 Marvin Gaye Ain't No Mountain High Enough United 1967
## 2 Marvin Gaye Ain't No Mountain High Enough Non-album singles 2013
## 3 Marvin Gaye Let's Get It On Let's Get It On 1973
## 4 Marvin Gaye Let's Get It On Non-album singles 2005
## 5 Marvin Gaye Sexual Healing Midnight Love 1982
## 6 Marvin Gaye Sexual Healing Non-album singles 2015
## 7 Prince 1999 1999 1982
## 8 Prince 1999 1999 1998
## 9 Prince 7 Love Symbol 1992
## 10 Prince 7 Non-album singles 2023
## # ℹ 22 more rows
all_artists_songs <- all_artists_songs |>
distinct(artist, title, .keep_all = TRUE)
all_artists_songs |>
group_by(artist, title) |>
filter(n() > 1) |>
arrange(artist, title)
## # A tibble: 0 × 4
## # Groups: artist, title [0]
## # ℹ 4 variables: artist <chr>, title <chr>, album <chr>, year <chr>
all_artists_songs <- all_artists_songs |>
distinct(artist, title, .keep_all = TRUE)
head(all_artists_songs)
## # A tibble: 6 × 4
## artist title album year
## <chr> <chr> <chr> <chr>
## 1 Marvin Gaye Let Your Conscience Be Your Guide The Soulful Moodsof Marvi… 1961
## 2 Marvin Gaye Sandman Greatest Hits 1962
## 3 Marvin Gaye Soldier's Plea That StubbornKinda Fellow 1962
## 4 Marvin Gaye Stubborn Kind of Fellow That StubbornKinda Fellow 1962
## 5 Marvin Gaye Hitch Hike That StubbornKinda Fellow 1962
## 6 Marvin Gaye Pride and Joy That StubbornKinda Fellow 1963
We now have a clean, tidy dataframe all_artists_songs
which contains artist, song title, album and year data. Next we’ll add
on the emotion for each song.
This section is for MySQL (we wanted to include MySQL for this project). We decided to upload a messy dataframe, grab it from MySQL, then tidy it back up again:
Connecting to MySQL:
# get values securely from env
user <- Sys.getenv("MYSQL_USER")
password <- Sys.getenv("MYSQL_PASSWORD")
dbname <- Sys.getenv("MYSQL_DB")
host <- Sys.getenv("MYSQL_HOST")
port <- as.integer(Sys.getenv("MYSQL_PORT"))
print(user)
# connect to MySQL
con <- dbConnect(
RMariaDB::MariaDB(),
user = user,
user = user,
password = password,
dbname = dbname,
host = host,
port = 3306,
)
Uploading messy table to MySQL:
messy_all_artists <- all_artists_songs |>
mutate(
song_1 = title,
song_2 = ifelse(row_number() %% 2 == 0, paste(title, "Remix"), NA), # some second titles
notes = case_when(
grepl("Remix", title) ~ "CD version",
grepl("Love", title, ignore.case = TRUE) ~ "— denotes b-side",
TRUE ~ ""
),
album_name = album,
release_year = year,
id = row_number()
) |>
select(id, artist_name = artist, song_1, song_2, album_name, release_year, notes)
# upload messy table to MySQL
dbWriteTable(con, "messy_songs", messy_all_artists, overwrite = TRUE, row.names = FALSE)
Grabbing the dataframe and tidying it again:
messy_from_mysql <- dbReadTable(con, "messy_songs")
tidy_from_mysql <- messy_from_mysql |>
filter(!is.na(artist_name)) |>
filter(!str_detect(notes, "(?i)denotes|cd|remix")) |>
pivot_longer(
cols = starts_with("song_"),
names_to = "song_slot",
values_to = "title"
) |>
# Remove missing or empty titles
filter(!is.na(title), str_trim(title) != "") |>
# Remove titles with remix/variation keywords
filter(!str_detect(title, "(?i)remix|version|edit|live|instrumental|mix|b[- ]?side")) |>
distinct(artist_name, title, .keep_all = TRUE) |>
select(artist = artist_name, title, album = album_name, year = release_year)
write.csv(tidy_from_mysql, "tidy_from_mysql.csv")
dbDisconnect(con)
Next to gather song lyrics, we’ll use the Genuis API. Unfortunately, the Genius API can’t provide the lyrics directly, but it can provide the URL that contains the lyrics to a song. We’ll then have to web scrape again to create a lyrics dataframe.
# Get the API Key
genius_info <- read_csv("https://raw.githubusercontent.com/farhodibr/DATA607_FINAL_PROJECT/refs/heads/main/genius_API.csv")
genius_token <- genius_info$api_token
#For example we can use this song
song_name <- "Yesterday"
artist_band <- "The Beatles"
# This function searches Genius for a song URL
search_song_final <- function(song_title, artist_name) {
# Setting up to call the Genius API's search endpoint
base_url <- "https://api.genius.com/search"
query <- list(q = paste(song_title, artist_name))
# Making the request to the API
res <- GET(
url = base_url,
add_headers(Authorization = paste("Bearer", genius_token)), # Authorization header includes the token
query = query
)
# Check if the request worked (Status 200 means OK)
if (status_code(res) != 200) {
warning("API request failed. Status code: ", status_code(res))
return(NA) # Return NA on API call failure
}
# Now parse the JSON results. Using flatten=TRUE based on previous findings.
content_data <- tryCatch(
fromJSON(content(res, "text", encoding = "UTF-8"), flatten = TRUE),
error = function(e) {
warning("Failed to parse JSON response: ", e$message)
return(NULL) # Return NULL on JSON parsing error
}
)
# Ensure a valid response structure was received
if (is.null(content_data) || is.null(content_data$response)) return(NA)
# Search results are typically inside response$hits
hits <- content_data$response$hits
# Process the 'hits' data frame (assuming it's a data frame)
# Handle cases with no results or if 'hits' isn't a data frame
if (is.null(hits) || !inherits(hits, "data.frame") || nrow(hits) == 0) {
return(NA)
}
# Define expected column names based on flatten=TRUE producing specific columns
artist_col <- "result.primary_artist.name"
url_col <- "result.url"
# Check if the required columns exist in the data frame
if (!all(c(artist_col, url_col) %in% names(hits))) {
warning("Required columns ('", artist_col, "', '", url_col, "') not found in flattened API response.")
return(NA) # Cannot proceed without these columns
}
# Loop through each row (each potential song hit)
for (i in 1:nrow(hits)) {
# Get the artist name for the current row from the flattened column
artist_raw <- hits[[artist_col]][[i]]
artist <- tolower(as.character(artist_raw))
# Check if the artist name is valid and matches the search query
if (!is.na(artist) && nzchar(artist) && str_detect(artist, fixed(tolower(artist_name)))) {
# If the artist matched, get the URL from its column
song_url_found <- hits[[url_col]][[i]]
if (!is.na(song_url_found) && nzchar(song_url_found)) {
# Match found! Return the URL and exit the function.
return(song_url_found)
}
}
}
# If the loop finishes without returning, no suitable match was found
return(NA)
}
# --- Example: Find the URL ---
song_url <- search_song_final(song_name, artist_band)
# This should print the found URL...
print(song_url)
## [1] "https://genius.com/The-beatles-yesterday-lyrics"
Now we need to create functions that will get all the words for a
song, get_words_df
, and get the song emotions,
get_song_sentiments
:
# Returns a dataframe where each observation is a word and the corresponding in a song
get_words_df <- function(artist_name, song_title) {
song_url <- search_song_final(song_title, artist_name)
if (is.na(song_url)) {
return(NULL)
}
# Using the song_url found previously to scrape the lyrics
# Only run this if a valid song_url was found
if (!is.na(song_url) && nzchar(song_url)) {
# Read the HTML content from the URL
page_html <- tryCatch(
read_html(song_url),
error = function(e) {
warning("Failed to read URL: ", song_url, "\nError: ", e$message)
return(NULL) # Return NULL on page read failure
}
)
# Only proceed if the webpage HTML was read successfully
if (!is.null(page_html)) {
# Find the part of the HTML containing the lyrics
lyrics_selector <- "div[data-lyrics-container='true']" # Verify this selector!
# Use rvest to find the HTML element(s) matching the selector
lyrics_nodes <- page_html |>
html_elements(css = lyrics_selector)
# Check if any nodes were found using the selector
if (length(lyrics_nodes) == 0) {
# If not found, the selector was likely wrong or the page structure changed.
warning("Could not find lyrics container using selector: '", lyrics_selector,
"'. The website structure may have changed or the selector is wrong. ",
"Please inspect the page HTML.")
return(NULL) # Cannot proceed.
} else {
# Extract the text content from the found node(s)
# html_text2 often handles line breaks better
lyrics_text_raw <- lyrics_nodes |>
html_text2()
# Clean up the extracted raw text
temp_cleaned_vector <- lyrics_text_raw %>%
str_split("\n") %>% # Split into lines based on the newline character (\n)
unlist() %>% # The result of str_split is a list, so unlist it
str_remove_all("\\[[^\\]]+\\]") %>% # Remove structural markers like [Verse 1], [Chorus] etc.
str_trim() # Remove leading/trailing whitespace from lines
# Now, apply the filtering using standard base R subsetting (outside the pipe)
lyrics_text_cleaned <- temp_cleaned_vector[nzchar(temp_cleaned_vector)]
# Structure the cleaned lines into a data frame
scraped_lyrics_df <- tibble(
line = 1:length(lyrics_text_cleaned),
text = lyrics_text_cleaned
)
}
} else {
# Scraping skipped because reading the URL failed earlier
warning("Skipping scraping due to error reading URL.")
return(NULL)
}
} else {
warning("Cannot scrape lyrics because song_url was not found successfully.")
return(NULL) # Ensure variable exists even if scraping is skipped
}
lyrics_df <- scraped_lyrics_df
words <- lyrics_df %>%
unnest_tokens(word, text)
return(words)
}
# Returns top n song emotions for a particular song
get_song_sentiments <- function(words, artist_name, song_title, n, year = NULL) {
null_top_emotion_df <- tibble(artist = artist_name, song = song_title, top_emotion = NA)
if (is.null(words)) {
return(null_top_emotion_df)
}
top_emotions_count <- words %>%
inner_join(nrc, by = "word", relationship = "many-to-many") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
count(sentiment, sort = TRUE) %>%
slice_head(n = n)
top_emotions <- top_emotions_count |>
pull(sentiment)
top_emotions_count$artist <- artist_name
top_emotions_count$song <- song_title
colnames(top_emotions_count)[1] <- "top_emotion"
if (length(top_emotions) == 0) return(null_top_emotion_df)
return(top_emotions_count)
}
# Testing
words_test <- get_words_df(artist_band, song_name)
head(words_test, 15)
## # A tibble: 15 × 2
## line word
## <int> <chr>
## 1 1 113
## 2 1 contributors
## 3 2 translations
## 4 3 türkçe
## 5 4 español
## 6 5 português
## 7 6 deutsch
## 8 7 česky
## 9 8 yesterday
## 10 8 lyrics
## 11 9 yesterday
## 12 9 is
## 13 9 the
## 14 9 most
## 15 9 covered
scraped_lyrics_test <- get_song_sentiments(words_test, artist_band, song_name, 1)
head(scraped_lyrics_test, 15)
## # A tibble: 1 × 4
## top_emotion n artist song
## <chr> <int> <chr> <chr>
## 1 fear 3 The Beatles Yesterday
Now let’s apply those functions to our tidy
artist_song_list
dataframe to create a top emotion
variable:
# Grab 5 songs from each of the top 10 artists
get_top5_valid_sentiments <- function(artist_name, song_titles, max_songs = 5) {
results <- list()
for (song in song_titles) {
lyrics_words <- get_words_df(artist_name, song)
res <- get_song_sentiments(lyrics_words, artist_name, song, 1)
# keep only song + top_emotion
if (!is.na(res$top_emotion)) {
results <- append(results, list(
tibble(song = res$song,
top_emotion = res$top_emotion)
))
}
if (length(results) >= max_songs) break
}
if (length(results) > 0) {
return(bind_rows(results))
} else {
return(tibble(song = NA, top_emotion = NA))
}
}
get_top5_valid_sentiments_year <- function(artist_name, title) {
words <- get_words_df(artist_name, title)
sent <- get_song_sentiments(words, artist_name, title, 1)
return(sent)
}
artist_song_list <- all_artists_songs %>%
filter(!is.na(artist), artist != "artist") %>%
group_by(artist) %>%
summarise(songs = list(title), .groups = "drop")
song_list_by_year <- all_artists_songs %>%
filter(!is.na(artist), artist != "artist", year != "Source:[42]") %>%
group_by(artist, year)
# Now apply get_top5_valid_sentiments to each row
# Use foreach to speed up the process
valid_sentiments <- foreach(i = 1:nrow(artist_song_list),
.combine = bind_rows,
.packages = c(
"dplyr", "purrr", "tibble", "httr", "jsonlite",
"stringr", "rvest", "tidytext", "textdata", "readr", "xml2"
)) %dopar% {
artist <- artist_song_list$artist[i]
songs <- artist_song_list$songs[[i]]
result <- get_top5_valid_sentiments(artist, songs)
result$artist <- artist
result
}
# Final tidy dataframe
head(valid_sentiments)
## # A tibble: 6 × 3
## song top_emotion artist
## <chr> <chr> <chr>
## 1 Mixed-Up Confusion anger Bob Dylan
## 2 Corrina Corrina anticipation Bob Dylan
## 3 Blowin' in the Wind joy Bob Dylan
## 4 The Times They Are a-Changin' anger Bob Dylan
## 5 Honey, Just Allow Me One More Chance surprise Bob Dylan
## 6 I Believe My Time Ain't Long anticipation Fleetwood Mac
We now have a tidy data frame, valid_sentiments
where
each observation represents a song, artist, and the top emotion for that
song.
This allows for the data to be easily summarized and grouped without reshaping. We are also able to make very straightforward joins with lexicons like ‘nrc’ and ‘bing’ or other lookup tables. Tidying up this table future-proofed it with the scalability factor since we can append new columns or stack new tables if needed.
Now we’re ready to perform our analysis. Let’s look at word frequency for the top songs and analyze the emotions. Note, when looking at decades, we only look at a subset of data, and pick only one year per decade since the code runs so slowly.
## # A tibble: 5 × 3
## year top_emotion n
## <chr> <chr> <int>
## 1 1970 joy 12
## 2 1980 joy 15
## 3 1990 joy 6
## 4 2000 anticipation 2
## 5 2010 joy 2
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 love 215
## 2 baby 113
## 3 time 69
## 4 home 65
## 5 summer 54
## 6 music 48
## 7 mind 46
## 8 world 46
## 9 grandma 43
## 10 girl 38
Let’s graph these charts and analyze:
Word frequency:
# Word Frequency
word_freq_decade_years <- word_freq_decade_years |>
left_join(bing, by = "word", relationship = "many-to-many")
word_freq_decade_years |>
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Top 10 Words 1960 - 2010",
x = "Word",
y = "Count")
We can see here that love is the most frequent word, and is the only word that has a positive or negative emotion attached to it. For this we tried the “bing” lexicon which categorizes into either positive or negative sentiment (binary), and it could not determine a sentiment for most of these words.
Possibly the most interesting takeaway from this graph is that grandma is one of the top words.
Sentiments for each decade:
# Top Emotion in Songs Per Year
valid_sentiments_by_decade |>
ggplot(aes(x = year, y = n, fill = top_emotion)) +
geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Top Emotion in Songs Per Year (One Year Per Decade)",
x = "Year", y = "Count")
Over time, the emotion represented for each decade is joy excluding 2000. The year 2000 had more anticipation songs than joyful.
Top Emotions in Songs by Top 10 Artists:
# Top Emotions in Songs by Top 10 Artists
ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = top_emotion)) +
geom_bar() +
coord_flip() +
labs(title = "Top Emotions in Songs by Top 10 Artists",
x = "Emotion", y = "Count") +
theme_minimal()
This bar chart summarizes the most frequent dominant emotion detected in songs by the top 10 Rolling Stone artists.
Joy is by far the most common top emotion, appearing in nearly 25 songs, suggesting that uplifting and positive themes are prominent across these artists’ lyrics.
Anticipation is the second most frequent, often associated with themes of hope, excitement, or longing.
Emotions like trust, anger, and fear are present but much less frequent.
Sadness and surprise appear rarely as the primary emotion, indicating they are less likely to be the dominant tone in top songs.
Overall, this distribution highlights the emotional leaning of popular music from legendary artists toward positive or forward-looking sentiments.
Top Emotions by Artist:
valid_sentiments %>%
group_by(artist, top_emotion) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(artist, desc(n))
## # A tibble: 27 × 3
## artist top_emotion n
## <chr> <chr> <int>
## 1 Bob Dylan anger 2
## 2 Bob Dylan anticipation 1
## 3 Bob Dylan joy 1
## 4 Bob Dylan surprise 1
## 5 Fleetwood Mac joy 3
## 6 Fleetwood Mac anticipation 2
## 7 Joni Mitchell anticipation 2
## 8 Joni Mitchell joy 2
## 9 Joni Mitchell anger 1
## 10 Lauryn Hill joy 2
## # ℹ 17 more rows
ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = artist)) +
geom_bar(position = "dodge") +
labs(title = "Top Emotions by Artist", x = "Emotion", y = "Number of Songs") +
theme_minimal()
This bar chart shows the most common emotional tone for songs by each of the top 10 Rolling Stone artists based on NRC sentiment analysis.
Joy and anticipation are the most dominant emotions across artists, particularly for Marvin Gaye, The Beatles, and Fleetwood Mac.
Fear and sadness appear most often in Lauryn Hill’s songs, indicating a more emotionally intense or somber tone.
Bob Dylan stands out with a more diverse emotional profile that includes anger and surprise.
The chart highlights how different artists tend to gravitate toward different emotional themes in their songwriting.
Emotional Variety by Artist:
#Artists with the most emotionally varied songs
valid_sentiments %>%
group_by(artist) %>%
summarise(num_unique_emotions = n_distinct(top_emotion)) %>%
arrange(desc(num_unique_emotions))
## # A tibble: 10 × 2
## artist num_unique_emotions
## <chr> <int>
## 1 Bob Dylan 4
## 2 Lauryn Hill 4
## 3 The Beach Boys 4
## 4 Joni Mitchell 3
## 5 Fleetwood Mac 2
## 6 Marvin Gaye 2
## 7 Nirvana 2
## 8 Prince 2
## 9 Stevie Wonder 2
## 10 The Beatles 2
This chart shows us that Bob Dylan and The Beach Boys show the greatest diversity of emotions. Most artists span 2–3 emotions in their top 5 songs.
Our analysis showed us that:
Some challenges we faced were:
Potential Future Analysis:
Focusing on just the Rolling Stone’s top artists did not provide us
artists and songs from recent decades, so to expand our data, we used
the billboard
library which provides us the Billboard Top
100 songs for each year from 1960 - 2016.
Let’s load billboard
’s wiki_hot_100s
dataframe which contains the Billboard Top 100 songs:
library(billboard)
library(geniusr)
#Genius API
Sys.setenv(GENIUS_API_TOKEN = genius_token)
#Creating Billboard dataframe (Includes top100 songs for each year in range of 1960 - 2016)
if (require("dplyr")) {
data(wiki_hot_100s)
wiki_hot_100s |>
glimpse()
}
## Rows: 5,701
## Columns: 4
## $ no <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", …
## $ title <chr> "Theme from A Summer Place", "He'll Have to Go", "Cathy's Clown…
## $ artist <chr> "Percy Faith", "Jim Reeves", "The Everly Brothers", "Johnny Pre…
## $ year <chr> "1960", "1960", "1960", "1960", "1960", "1960", "1960", "1960",…
# This is a random sample for random 20 songs from each year
# Only look at 20 due to slow process time to obtain song lyrics
set.seed(146)
sample_billboard_1960_2016 <- wiki_hot_100s |>
mutate(year = as.numeric(year))|>
group_by(year) |>
sample_n(size = 20, replace = FALSE) |> #size can be decreased to speed up the process
ungroup()
# Grab the top 20 songs from each year
# Only look at 20 due to slow process time to obtain song lyrics
top_20_billboard_1960_2016 <- wiki_hot_100s |>
mutate(year = as.numeric(year))|>
group_by(year) |>
slice_head(n = 20) |> #size can be decreased to speed up the process
ungroup()
#this removes duplicate rows
song_jobs <- top_20_billboard_1960_2016 |>
distinct(artist, title, year, .keep_all = TRUE)
sample_song_jobs <- sample_billboard_1960_2016 |>
distinct(artist, title, year, .keep_all = TRUE)
head(song_jobs)
## # A tibble: 6 × 4
## no title artist year
## <chr> <chr> <chr> <dbl>
## 1 1 Theme from A Summer Place Percy Faith 1960
## 2 2 He'll Have to Go Jim Reeves 1960
## 3 3 Cathy's Clown The Everly Brothers 1960
## 4 4 Running Bear Johnny Preston 1960
## 5 5 Teen Angel Mark Dinning 1960
## 6 6 I'm Sorry Brenda Lee 1960
head(sample_song_jobs)
## # A tibble: 6 × 4
## no title artist year
## <chr> <chr> <chr> <dbl>
## 1 24 Why Frankie Avalon 1960
## 2 94 My Little Corner of the World Anita Bryant 1960
## 3 100 Sandy Larry Hall 1960
## 4 95 Doggin' Around Jackie Wilson 1960
## 5 84 Happy-Go-Lucky Me Paul Evans 1960
## 6 88 O Dio Mio Annette Funicello 1960
These dataframes, one for the top 20, and the other for a group of random 20 songs, are tidy since each observation represents a single song on the Billboard Top 100. Each cell contains a single value as well.
We tapped into the Genius API using the geniusr
package, so that OAuth tokens, request construction, and JSON parsing
all happen behind the scenes.
geniusr
returns perfectly tidy data frames from
nested JSON and even handles token refreshes and rate‑limit retries
automatically.
That let us pull song metadata and lyric URLs from Genius with a single function call—no low‑level HTTP or manual JSON wrangling required.
Now let’s add on the top emotion for each song:
#doParallel
num_cores <- parallel::detectCores() - 1 # leave 1 core for safety
cl <- makeCluster(num_cores)
registerDoParallel(cl)
#helper function to get sentiments
get_song_emotions <- function(artist, title, n = 1) {
words <- get_words_df(artist, title)
sentiments <- get_song_sentiments(words, artist, title, n)
return(sentiments)
}
#here we are getting sentiments for each song, and storing it in new df.
#<<<______this part can take from 8 to 15 mins to complete.______>>>>>>>
songs_with_emotions <- foreach(i = 1:nrow(song_jobs),
.combine = bind_rows,
.packages = c("dplyr", "tibble", "purrr", "stringr",
"tidytext", "rvest", "httr", "jsonlite")) %dopar% {
tryCatch({
artist <- song_jobs$artist[i]
title <- song_jobs$title[i]
year <- song_jobs$year[i]
sent_df <- get_song_emotions(artist, title, n = 1)
sent_df$year <- year
sent_df
}, error = function(e) {
message("Failed for ", song_jobs$title[i], " by ", song_jobs$artist[i])
tibble(artist = song_jobs$artist[i], song = song_jobs$title[i], top_emotion = NA, year = song_jobs$year[i])
})
}
sample_songs_with_emotions <- foreach(i = 1:nrow(sample_song_jobs),
.combine = bind_rows,
.packages = c("dplyr", "tibble", "purrr", "stringr",
"tidytext", "rvest", "httr", "jsonlite")) %dopar% {
tryCatch({
artist <- sample_song_jobs$artist[i]
title <- sample_song_jobs$title[i]
year <- sample_song_jobs$year[i]
sent_df <- get_song_emotions(artist, title, n = 1)
sent_df$year <- year
sent_df
}, error = function(e) {
message("Failed for ", sample_song_jobs$title[i], " by ", sample_song_jobs$artist[i])
tibble(artist = sample_song_jobs$artist[i], song = sample_song_jobs$title[i], top_emotion = NA, year = sample_song_jobs$year[i])
})
}
stopCluster(cl)
registerDoSEQ()
# Final tidy dataframe
songs_with_emotions |>
filter(is.na(top_emotion))
## # A tibble: 248 × 5
## top_emotion n artist song year
## <chr> <int> <chr> <chr> <dbl>
## 1 <NA> NA The String-A-Longs Wheels 1961
## 2 <NA> NA Lawrence Welk Calcutta 1961
## 3 <NA> NA The Mar-Keys Last Night 1961
## 4 <NA> NA Ferrante & Teicher Exodus 1961
## 5 <NA> NA Acker Bilk Stranger on the Shore 1962
## 6 <NA> NA David Rose The Stripper 1962
## 7 <NA> NA Chubby Checker & Dee Dee Sharp Slow Twistin' 1962
## 8 <NA> NA Little Stevie Wonder Fingertips 1963
## 9 <NA> NA The Village Stompers Washington Square 1963
## 10 <NA> NA Kyu Sakamoto Sukiyaki 1963
## # ℹ 238 more rows
sample_songs_with_emotions |>
filter(is.na(top_emotion))
## # A tibble: 230 × 5
## top_emotion n artist song year
## <chr> <int> <chr> <chr> <dbl>
## 1 <NA> NA The Ventures Walk, Don't Run 1960
## 2 <NA> NA Skip and Flip Cherry Pie 1960
## 3 <NA> NA The Dave Brubeck Quartet Take Five 1961
## 4 <NA> NA The Mar-Keys Last Night 1961
## 5 <NA> NA The String-A-Longs Wheels 1961
## 6 <NA> NA Jimmy Smith Walk on the Wild Side 1962
## 7 <NA> NA Bent Fabric Alley Cat 1962
## 8 <NA> NA Nat King Cole Ramblin' Rose 1962
## 9 <NA> NA Booker T. & the M.G.'s Green Onions 1962
## 10 <NA> NA Acker Bilk Stranger on the Shore 1962
## # ℹ 220 more rows
head(songs_with_emotions)
## # A tibble: 6 × 5
## top_emotion n artist song year
## <chr> <int> <chr> <chr> <dbl>
## 1 joy 9 Percy Faith Theme from A Summer Place 1960
## 2 joy 8 Jim Reeves He'll Have to Go 1960
## 3 joy 10 The Everly Brothers Cathy's Clown 1960
## 4 joy 21 Johnny Preston Running Bear 1960
## 5 joy 20 Mark Dinning Teen Angel 1960
## 6 joy 3 Brenda Lee I'm Sorry 1960
head(sample_songs_with_emotions)
## # A tibble: 6 × 5
## top_emotion n artist song year
## <chr> <int> <chr> <chr> <dbl>
## 1 joy 31 Frankie Avalon Why 1960
## 2 joy 4 Anita Bryant My Little Corner of the World 1960
## 3 joy 7 Larry Hall Sandy 1960
## 4 joy 3 Jackie Wilson Doggin' Around 1960
## 5 joy 33 Paul Evans Happy-Go-Lucky Me 1960
## 6 joy 10 Annette Funicello O Dio Mio 1960
We now have tidy data frames where each observation represent a song, and it’s top emotion.
Now we can start our analysis. Let’s see how the emotions of popular songs have changed through the years by decade:
songs_with_emotions_grouped_by_decade <- songs_with_emotions |>
filter(!is.na(top_emotion)) |>
mutate(year = as.numeric(year)) |>
group_by(decade = floor(year / 10) * 10, top_emotion)|>
summarise(n = n(), .groups = "drop")
#charts for each decade
songs_with_emotions_grouped_by_decade |>
ggplot(aes(x = factor(decade), y = n, fill = top_emotion)) +
geom_col(position = "dodge") +
labs(title = "Emotion Trends in Top 20 Billboard Song Lyrics by Decade", x = "Decade", y = "Count")
# Sampled Data
sample_songs_with_emotions_grouped_by_decade <- sample_songs_with_emotions |>
filter(!is.na(top_emotion)) |>
mutate(year = as.numeric(year)) |>
group_by(decade = floor(year / 10) * 10, top_emotion)|>
summarise(n = n(), .groups = "drop")
#charts for each decade
sample_songs_with_emotions_grouped_by_decade |>
ggplot(aes(x = factor(decade), y = n, fill = top_emotion)) +
geom_col(position = "dodge") +
labs(title = "Emotion Trends in Sampled Billboard Song Lyrics by Decade", x = "Decade", y = "Count")
Top 20 Songs:
Looking at the count bar graph, Joy is still by far the most common top emotion, appearing in nearly 25 songs. Yet, the count gets much small after the 90s. This makes sense because the data we have access to only goes up to 2016, so the 2010 would have less data. For the 2000s decade, it’s possible for some reason we could not find the lyrics to the songs, or a top emotion could not be specified.
Sampled:
Joy is still the top emotion for random sampled data. It looks like sampled data has the same issue.
Let’s check where the missing emotions are:
songs_with_emotions |>
filter(is.na(top_emotion)) |>
group_by(decade = floor(year / 10) * 10) |>
summarise(total = n())
## # A tibble: 6 × 2
## decade total
## <dbl> <int>
## 1 1960 31
## 2 1970 19
## 3 1980 28
## 4 1990 34
## 5 2000 80
## 6 2010 56
sample_songs_with_emotions |>
filter(is.na(top_emotion)) |>
group_by(decade = floor(year / 10) * 10) |>
summarise(total = n())
## # A tibble: 6 × 2
## decade total
## <dbl> <int>
## 1 1960 33
## 2 1970 31
## 3 1980 17
## 4 1990 31
## 5 2000 66
## 6 2010 52
As you can see, the decade with the most missing emotions is the 2000s decade. If we were doing further analysis with more time, we could pinpoint exactly why this data is missing. The sampled data has a bit less missing values in the 2000s.
Let’s make another percentage graph to look at the proportions:
# Emotion Trends in Billboard Lyrics by Decade
# Focus on percentage
songs_with_emotions_grouped_by_decade_pct <- songs_with_emotions_grouped_by_decade |>
group_by(decade) |>
mutate(percent = n / sum(n)) |>
ungroup()
songs_with_emotions_grouped_by_decade_pct |>
ggplot(aes(x = factor(decade), y = percent, fill = top_emotion)) +
geom_col(position = "dodge") +
labs(title = "Emotion Trends in Top 20 Billboard Song Lyrics by Decade", x = "Decade", y = "Percentage")
# Sampled Data
sample_songs_with_emotions_grouped_by_decade_pct <- sample_songs_with_emotions_grouped_by_decade |>
group_by(decade) |>
mutate(percent = n / sum(n)) |>
ungroup()
sample_songs_with_emotions_grouped_by_decade_pct |>
ggplot(aes(x = factor(decade), y = percent, fill = top_emotion)) +
geom_col(position = "dodge") +
labs(title = "Proportions Of Emotion Trends in Sampled Billboard Song Lyrics by Decade", x = "Decade", y = "Percentage")
Top 20 Songs:
This graph looks a lot more stabilized. This graph still shows Joy as the top emotion for each decade, yet this percentage decreases in the 2000s. We can also see that Anger percentage increases in the 2000s, then drops in the 2010s. This makes one wonder if world events occurring the 2000s changed overall population sentiment, which is shown in the art/music that was released during that time. Joy showed the highest percentage in 90s. Additionally, Trust decreases in the 90s and 2000s, then increases again in the 2010 decade.
Sampled:
The sampled data shows pretty similar results. Anticpation decreases over the decades, and fear has a lower percentage in the 2010s. Trust remains a bit more constant over the decades compared to the top 20 songs.
Now let’s look at overall Positive vs Negative emotions, and how they’ve changed over time:
# Positive vs. Negative Emotional Trends Over Time
songs_with_emotions_polarity <- songs_with_emotions |>
filter(!is.na(top_emotion)) |>
mutate(polarity = case_when(
top_emotion %in% c("joy", "trust", "anticipation") ~ "Positive",
top_emotion %in% c("sadness", "anger", "fear", "disgust") ~ "Negative",
TRUE ~ "Neutral"
))
songs_with_emotions_polarity |>
count(year, polarity) |>
group_by(year) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(x = as.numeric(year), y = prop, color = polarity)) +
geom_line(size = 1.2) +
labs(title = "Positive vs. Negative Emotional Trends Over Time (Top 20 Songs per Year)",
x = "Year", y = "Proportion of Songs") +
theme_minimal(base_size = 14)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Sampled Data
sample_songs_with_emotions_polarity <- sample_songs_with_emotions |>
filter(!is.na(top_emotion)) |>
mutate(polarity = case_when(
top_emotion %in% c("joy", "trust", "anticipation") ~ "Positive",
top_emotion %in% c("sadness", "anger", "fear", "disgust") ~ "Negative",
TRUE ~ "Neutral"
))
sample_songs_with_emotions_polarity |>
count(year, polarity) |>
group_by(year) |>
mutate(prop = n / sum(n)) |>
ggplot(aes(x = as.numeric(year), y = prop, color = polarity)) +
geom_line(size = 1.2) +
labs(title = "Positive vs. Negative Emotional Trends Over Time (Sampled Data)",
x = "Year", y = "Proportion of Songs") +
theme_minimal(base_size = 14)
Top 20 songs:
Overall there’s no major trend shown in this graph, but we can see in the 2000s positive emotions decreased, while negative emotions increased surpassing positive. This corresponds with our bar graph data. Also, neutral songs do not appear in the 2010s.
Sampled:
Overall the sampled data looks very similar to the top 20 songs.
This analysis showed us that: