DATA607 Final Project

Overview

For this project, we want to analyze the emotions of the top Rolling Stone’s artists via sentiment analysis of song lyrics.

Data sources:

Top artists data: * Rolling Stone Album Rankings CSV: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07

Songs data: * Wikipedia Singles Discography (List of singles, with selected chart positions and certifications chart). Example: https://en.wikipedia.org/wiki/The_Beatles_singles_discography

Song Lyrics data: * Genius API and genius.com: https://docs.genius.com/

Motivation

We chose these data sets so we can:

Identify the emotions of songs by top artists via the lyrics
Explore how emotions of popular songs have changed over time
Explore how emotions of popular songs vary by genre

Additionally, these data sets would give us data preparation experience such as:

Tidying and transformation
API
Web scraping
MySQL

Libraries

library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
library(rvest)
library(janitor)
library(stringr)
library(tidytext)
library(tibble)
library(textdata)
library(tidyr)
library(readr)
library(purrr)
library(forcats)
library(ggplot2)
library(doParallel)
library(foreach)
library(DBI)  
library(RMariaDB)


# Load sentiment lexicons
nrc  <- get_sentiments("nrc")
bing  <- get_sentiments("bing")

# Detect number of cores and create cluster
num_cores <- parallel::detectCores() - 1  # leave one core free
cl <- makeCluster(num_cores)
registerDoParallel(cl)

Find the Top Artists

Read the Data

rs <- read_csv("https://raw.githubusercontent.com/aaliyahmjh/DATA607Project/refs/heads/main/rolling_stone.csv", show_col_types = FALSE)
head(rs)

## # A tibble: 6 × 21
##   sort_name      clean_name     album rank_2003 rank_2012 rank_2020 differential
##   <chr>          <chr>          <chr>     <dbl>     <dbl>     <dbl>        <dbl>
## 1 Sinatra, Frank Frank Sinatra  In t…       100       101       282         -182
## 2 Diddley, Bo    Bo Diddley     Bo D…       214       216       455         -241
## 3 Presley, Elvis Elvis Presley  Elvi…        55        56       332         -277
## 4 Sinatra, Frank Frank Sinatra  Song…       306       308        NA         -195
## 5 Little Richard Little Richard Here…        50        50       227         -177
## 6 Beyonce        Beyonce        Lemo…        NA        NA        32          469
## # ℹ 14 more variables: release_year <dbl>, genre <chr>, type <chr>,
## #   weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>

Tidy the Data

This data set is untidy because it has a wide structure with duplicate columns such as name and 3 separate rank columns that can be combined and referenced with a “year” column:

# Tidy the data
tidy_rs <- rs %>%
  pivot_longer(
    cols = starts_with("rank_"),    
    names_to = "year",               
    values_to = "rank",             
    names_pattern = "rank_(\\d+)" 
  )

head(tidy_rs)

## # A tibble: 6 × 20
##   sort_name      clean_name    album       differential release_year genre type 
##   <chr>          <chr>         <chr>              <dbl>        <dbl> <chr> <chr>
## 1 Sinatra, Frank Frank Sinatra In the Wee…         -182         1955 Big … Stud…
## 2 Sinatra, Frank Frank Sinatra In the Wee…         -182         1955 Big … Stud…
## 3 Sinatra, Frank Frank Sinatra In the Wee…         -182         1955 Big … Stud…
## 4 Diddley, Bo    Bo Diddley    Bo Diddley…         -241         1955 Rock… Stud…
## 5 Diddley, Bo    Bo Diddley    Bo Diddley…         -241         1955 Rock… Stud…
## 6 Diddley, Bo    Bo Diddley    Bo Diddley…         -241         1955 Rock… Stud…
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>

Now each observation represents an album/year combination making the data set tidy.

Filter the Data

Next we want to filter the data set to only focus on albums ranked in 2020:

filter_rs <- tidy_rs %>%
  select(clean_name, album, genre, year, rank)

# Filter dataset to only focus on albums ranked in 2020
filter_rs <- filter_rs %>%
  filter(year == 2020) %>%
  filter(!is.na(rank))

head(filter_rs)

## # A tibble: 6 × 5
##   clean_name     album                      genre                    year   rank
##   <chr>          <chr>                      <chr>                    <chr> <dbl>
## 1 Frank Sinatra  In the Wee Small Hours     Big Band/Jazz            2020    282
## 2 Bo Diddley     Bo Diddley / Go Bo Diddley Rock n' Roll/Rhythm & B… 2020    455
## 3 Elvis Presley  Elvis Presley              Rock n' Roll/Rhythm & B… 2020    332
## 4 Little Richard Here's Little Richard      <NA>                     2020    227
## 5 Beyonce        Lemonade                   <NA>                     2020     32
## 6 Amy Winehouse  Back to Black              Soul/Gospel/R&B          2020     33

Now we want to grab only the top 10 artists:

# Select the top 10 artists & albums by rank 
top10_by_rank <- filter_rs %>%
  arrange(rank) %>%      # sort ascending
  slice_head(n = 10)      # take the first 10 rows

top10_by_rank

## # A tibble: 10 × 5
##    clean_name     album                           genre              year   rank
##    <chr>          <chr>                           <chr>              <chr> <dbl>
##  1 Marvin Gaye    What's Going On                 Soul/Gospel/R&B    2020      1
##  2 The Beach Boys Pet Sounds                      <NA>               2020      2
##  3 Joni Mitchell  Blue                            Country/Folk/Coun… 2020      3
##  4 Stevie Wonder  Songs in the Key of Life        Soul/Gospel/R&B    2020      4
##  5 The Beatles    Abbey Road                      <NA>               2020      5
##  6 Nirvana        Nevermind                       Indie/Alternative… 2020      6
##  7 Fleetwood Mac  Rumours                         Singer-Songwriter… 2020      7
##  8 Prince         Purple Rain                     <NA>               2020      8
##  9 Bob Dylan      Blood on the Tracks             Country/Folk/Coun… 2020      9
## 10 Lauryn Hill    The Miseducation of Lauryn Hill Hip-Hop/Rap        2020     10

The top artists are:

Marvin Gaye
The Beach Boys
Joni Mitchell
Stevie Wonder
The Beatles
Nirvana
Fleetwood Mac
Prince
Bob Dylan
Lauryn Hill

Note: none of the top artists from Rolling Stone are recent artists, so when we look at the songs and when they were released, we’ll have more data from 1960s - 1990s.

Find the Songs for Each Artist

Now that we have our artists to focus on, next is grabbing the songs for each artist. We’ll be grabbing this data from Wikipedia for us to get web scraping experience.

Read the Data

Each Wikipedia page is set up a bit differently, so we’re going to get the songs for each artist individually:

# Get the html
html_marvin_gaye <- read_html("https://en.wikipedia.org/wiki/Marvin_Gaye_discography#Singles")
html_beach_boys <- read_html("https://en.wikipedia.org/wiki/The_Beach_Boys_discography#Singles")
html_joni_mitchell <- read_html("https://en.wikipedia.org/wiki/Joni_Mitchell_discography#Singles")
html_stevie_wonder <- read_html("https://en.wikipedia.org/wiki/Stevie_Wonder_discography#Singles")
html_the_beatles <- read_html("https://en.wikipedia.org/wiki/The_Beatles_singles_discography#Singles")
html_nirvana <- read_html("https://en.wikipedia.org/wiki/Nirvana_discography#Singles")
html_fleetwood_mac <- read_html("https://en.wikipedia.org/wiki/Fleetwood_Mac_discography#Singles")
html_prince <- read_html("https://en.wikipedia.org/wiki/Prince_singles_discography#Singles")
html_bob_dylan <- read_html("https://en.wikipedia.org/wiki/Bob_Dylan_discography#Singles")
html_lauryn_hill <- read_html("https://en.wikipedia.org/wiki/Lauryn_Hill_discography#Singles")

Next we select the tables we want from the HTML. Note, some artist’s pages have multiple tables for songs where each table represents a decade. For these artists, we’ll need to gather multiple tables in order to gather all the songs:

# Scrape the songs table from wikipedia
get_songs_df <- function(html, css_selector) {
  songs <- html |> html_elements(css_selector) |>
    html_table()
  return(as.data.frame(songs[[1]]))
}

# Note: these Wikipedia pages can be updated, so it's possible this code won't run in the future (css selector would need to be updated)
marvin_gaye_df <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(30)") # 1960s
marvin_gaye_df2 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(32)") # 1970 - 1984
marvin_gaye_df3 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(34)") # Posthumous
beach_boys_df <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(21)") # 1960s
beach_boys_df2 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)") # 1970s
beach_boys_df3 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(25)") # 1980s
beach_boys_df4 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(27)") # 1990s - present
joni_mitchell_df <- get_songs_df(html_joni_mitchell, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(19)")
stevie_wonder_df <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1960s
stevie_wonder_df2 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1970s
stevie_wonder_df3 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 1980s
stevie_wonder_df4 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 1990s - present
the_beatles_df <- get_songs_df(html_the_beatles, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(10)")
nirvana_df <- get_songs_df(html_nirvana, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)")
fleetwood_mac_df <- get_songs_df(html_fleetwood_mac, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)")
prince_df <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1970s & 80s
prince_df2 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1990s
prince_df3 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 2000s
prince_df4 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 2010s - 20s
bob_dylan_df <- get_songs_df(html_bob_dylan, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
lauryn_hill_df <- get_songs_df(html_lauryn_hill, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")

Tidy and Prepare the Data

Next we’ll go through each unique dataframe (note, each dataframe is slightly different from each other), and tidy the data. These dataframes are untidy because each chart category is a unique column. Instead there should be a general albums_chart category variable and a rank variable to represent this information.

Marvin Gaye

# Marvin Gaye
## 1960s ##
# Keep the important variables
marvin_gaye_df <- marvin_gaye_df[, 1:11]

# First row is actually the column names
marvin_gaye_df <- marvin_gaye_df |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df_clean <- marvin_gaye_df %>%
  clean_names()
colnames(marvin_gaye_df_clean)[4:9] <- c("us", "us_r_and_b", "can", "ire", "ned", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Clean up
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df_clean$title <- str_extract(marvin_gaye_df_clean$title , "^(.).*\\1")
head(marvin_gaye_df_clean)

## # A tibble: 6 × 7
##   year  title       b_side certifications album albums_chart peak_chart_position
##   <chr> <chr>       <chr>  <chr>          <chr> <chr>        <chr>              
## 1 1961  "\"Let You… "\"Ne… ""             The … us           —                  
## 2 1961  "\"Let You… "\"Ne… ""             The … us_r_and_b   —                  
## 3 1961  "\"Let You… "\"Ne… ""             The … can          —                  
## 4 1961  "\"Let You… "\"Ne… ""             The … ire          —                  
## 5 1961  "\"Let You… "\"Ne… ""             The … ned          —                  
## 6 1961  "\"Let You… "\"Ne… ""             The … uk           —

## 1970 - 1984 ##
# First row is actually the column names
marvin_gaye_df2 <- marvin_gaye_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df2_clean <- marvin_gaye_df2 %>%
  clean_names()
colnames(marvin_gaye_df2_clean)[4:12] <- c("us", "us_r_and_b", "aus", "can", "ger", "ire", "ned", "swe", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(marvin_gaye_df2_clean)

## # A tibble: 6 × 8
##   year  title b_side certifications album na    albums_chart peak_chart_position
##   <chr> <chr> <chr>  <chr>          <chr> <chr> <chr>        <chr>              
## 1 1970  "\"H… "\"Go… ""             That… <NA>  us           41                 
## 2 1970  "\"H… "\"Go… ""             That… <NA>  us_r_and_b   18                 
## 3 1970  "\"H… "\"Go… ""             That… <NA>  aus          37                 
## 4 1970  "\"H… "\"Go… ""             That… <NA>  can          37                 
## 5 1970  "\"H… "\"Go… ""             That… <NA>  ger          —                  
## 6 1970  "\"H… "\"Go… ""             That… <NA>  ire          —

# Clean up
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df2_clean$title <- str_extract(marvin_gaye_df2_clean$title , "^(.).*\\1")

## Posthumous ##
# First row is actually the column names
marvin_gaye_df3 <- marvin_gaye_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df3_clean <- marvin_gaye_df3 %>%
  clean_names()
colnames(marvin_gaye_df3_clean)[4:12] <- c("us", "us_r_and_b", "aus", "ger", "ire", "ned", "sco", "swe", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Clean up
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory. "N/A" indicates chart not yet published.')
marvin_gaye_df3_clean$title <- str_extract(marvin_gaye_df3_clean$title , "^(.).*\\1")

head(marvin_gaye_df3_clean)

## # A tibble: 6 × 8
##   year  title b_side certifications album na    albums_chart peak_chart_position
##   <chr> <chr> <chr>  <chr>          <chr> <chr> <chr>        <chr>              
## 1 1985  "\"S… "\"Sa… ""             Drea… <NA>  us           [K]                
## 2 1985  "\"S… "\"Sa… ""             Drea… <NA>  us_r_and_b   2                  
## 3 1985  "\"S… "\"Sa… ""             Drea… <NA>  aus          —                  
## 4 1985  "\"S… "\"Sa… ""             Drea… <NA>  ger          —                  
## 5 1985  "\"S… "\"Sa… ""             Drea… <NA>  ire          —                  
## 6 1985  "\"S… "\"Sa… ""             Drea… <NA>  ned          —

The Beach Boys

# The Beach Boys

## 1960s ##
beach_boys_df <- beach_boys_df |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df_clean <- beach_boys_df %>%
  clean_names()
colnames(beach_boys_df_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "aus", "can", "nl", "nor", "swe", "uk", "phi")

# Make tidy - make the albums chart it's own variable
beach_boys_df_clean <- beach_boys_df_clean |>
  pivot_longer(cols = us:phi, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df_clean <- separate_rows(beach_boys_df_clean, title, sep = '""')
beach_boys_df_clean

## # A tibble: 710 × 6
##    year  title     certifications_sales…¹ album albums_chart peak_chart_position
##    <chr> <chr>     <chr>                  <chr> <chr>        <chr>              
##  1 1961  "\"Surfi… —                      Surf… us           75—                
##  2 1961  "Luau\" … —                      Surf… us           75—                
##  3 1961  "\"Surfi… —                      Surf… us_cash_box  85—                
##  4 1961  "Luau\" … —                      Surf… us_cash_box  85—                
##  5 1961  "\"Surfi… —                      Surf… us_record_w… ——                 
##  6 1961  "Luau\" … —                      Surf… us_record_w… ——                 
##  7 1961  "\"Surfi… —                      Surf… aus          ——                 
##  8 1961  "Luau\" … —                      Surf… aus          ——                 
##  9 1961  "\"Surfi… —                      Surf… can          ——                 
## 10 1961  "Luau\" … —                      Surf… can          ——                 
## # ℹ 700 more rows
## # ℹ abbreviated name: ¹certifications_sales_thresholds

# Clean up
beach_boys_df_clean <- beach_boys_df_clean |>
  filter(title != '"—" denotes a release that did not chart, was not released in the country or the information is unknown (* - US Record World chart data incomplete for early 1964)')
beach_boys_df_clean$title <- str_extract(beach_boys_df_clean$title , "^(.).*\\1")

## 1970s ##
beach_boys_df2 <- beach_boys_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df2_clean <- beach_boys_df2 %>%
  clean_names()
colnames(beach_boys_df2_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "us_ac", "aus", "can", "nl", "nor", "swe", "uk")

# Make tidy - make the albums chart it's own variable
beach_boys_df2_clean <- beach_boys_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df2_clean <- separate_rows(beach_boys_df2_clean, title, sep = '""')

## 1980s ##
beach_boys_df3 <- beach_boys_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df3_clean <- beach_boys_df3 %>%
  clean_names()
colnames(beach_boys_df3_clean)[2:9] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk")

# Make tidy - make the albums chart it's own variable
beach_boys_df3_clean <- beach_boys_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '""')
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '" "')

## 1990s - Present
beach_boys_df4 <- beach_boys_df4 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df4_clean <- beach_boys_df4 %>%
  clean_names()
colnames(beach_boys_df4_clean)[2:10] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk", "us_cou")

# Make tidy - make the albums chart it's own variable
beach_boys_df4_clean <- beach_boys_df4_clean |>
  pivot_longer(cols = us:us_cou, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '""')
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '" "')

the_beach_boys_df_clean <- beach_boys_df_clean
the_beach_boys_df2_clean <- beach_boys_df2_clean
the_beach_boys_df3_clean <- beach_boys_df3_clean
the_beach_boys_df4_clean <- beach_boys_df4_clean
head(the_beach_boys_df_clean)

## # A tibble: 6 × 6
##   year  title certifications_sales_thre…¹ album albums_chart peak_chart_position
##   <chr> <chr> <chr>                       <chr> <chr>        <chr>              
## 1 1961  <NA>  —                           Surf… us           75—                
## 2 1961  <NA>  —                           Surf… us           75—                
## 3 1961  <NA>  —                           Surf… us_cash_box  85—                
## 4 1961  <NA>  —                           Surf… us_cash_box  85—                
## 5 1961  <NA>  —                           Surf… us_record_w… ——                 
## 6 1961  <NA>  —                           Surf… us_record_w… ——                 
## # ℹ abbreviated name: ¹certifications_sales_thresholds

head(the_beach_boys_df2_clean)

## # A tibble: 6 × 5
##   year  title                             album albums_chart peak_chart_position
##   <chr> <chr>                             <chr> <chr>        <chr>              
## 1 1970  "\"Add Some Music to Your Day"    Sunf… us           64—                
## 2 1970  "Susie Cincinnati\" (non-album B… Sunf… us           64—                
## 3 1970  "\"Add Some Music to Your Day"    Sunf… us_cash_box  49—                
## 4 1970  "Susie Cincinnati\" (non-album B… Sunf… us_cash_box  49—                
## 5 1970  "\"Add Some Music to Your Day"    Sunf… us_record_w… 39—                
## 6 1970  "Susie Cincinnati\" (non-album B… Sunf… us_record_w… 39—

head(the_beach_boys_df3_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1980  "\"Goin' On"       —              Keep… us           83—                
## 2 1980  "Endless Harmony\… —              Keep… us           83—                
## 3 1980  "\"Goin' On"       —              Keep… us_cash_box  103—               
## 4 1980  "Endless Harmony\… —              Keep… us_cash_box  103—               
## 5 1980  "\"Goin' On"       —              Keep… us_ac        ——                 
## 6 1980  "Endless Harmony\… —              Keep… us_ac        ——

head(the_beach_boys_df4_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1990  "\"Somewhere Near… —              Stil… us           ——                 
## 2 1990  "Kokomo\""         —              Stil… us           ——                 
## 3 1990  "\"Somewhere Near… —              Stil… us_cash_box  ——                 
## 4 1990  "Kokomo\""         —              Stil… us_cash_box  ——                 
## 5 1990  "\"Somewhere Near… —              Stil… us_ac        ——                 
## 6 1990  "Kokomo\""         —              Stil… us_ac        ——

Joni Mitchell

# Joni Mitchell
joni_mitchell_df <- joni_mitchell_df |>
  row_to_names(row_number = 1)

# Clean up the variable names
joni_mitchell_df_clean <- joni_mitchell_df %>%
  clean_names()
colnames(joni_mitchell_df_clean)[3:9] <- c("can", "can_uc", "aus", "uk", "us", "us_ac", "us_main")

# Make tidy - make the albums chart it's own variable
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
  pivot_longer(cols = can:us_main, names_to = "albums_chart", values_to = "peak_chart_position")

joni_mitchell_df_clean <- joni_mitchell_df_clean |>
  filter(title != '"—" denotes releases that did not chart')

# Clean up song names
joni_mitchell_df_clean$title <- str_extract(joni_mitchell_df_clean$title , "^(.).*\\1")
head(joni_mitchell_df_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"Night in the C… 1968  ""             Song… can          —                  
## 2 "\"Night in the C… 1968  ""             Song… can_uc       —                  
## 3 "\"Night in the C… 1968  ""             Song… aus          —                  
## 4 "\"Night in the C… 1968  ""             Song… uk           —                  
## 5 "\"Night in the C… 1968  ""             Song… us           —                  
## 6 "\"Night in the C… 1968  ""             Song… us_ac        —

Stevie Wonder

# Stevie Wonder
stevie_wonder_df <- stevie_wonder_df |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
stevie_wonder_df_clean <- stevie_wonder_df %>%
  clean_names()
colnames(stevie_wonder_df_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df_clean <- stevie_wonder_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df_clean <- separate_rows(stevie_wonder_df_clean, title, sep = 'b/w')
head(stevie_wonder_df_clean)

## # A tibble: 6 × 10
##   title         year  swi_39 uk_40 certifications album na    na_2  albums_chart
##   <chr>         <chr> <chr>  <chr> <chr>          <chr> <chr> <chr> <chr>       
## 1 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us          
## 2 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us          
## 3 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us_r_and_b  
## 4 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us_r_and_b  
## 5 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us_ac       
## 6 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us_ac       
## # ℹ 1 more variable: peak_chart_position <chr>

## 1970s ##
stevie_wonder_df2 <- stevie_wonder_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
stevie_wonder_df2_clean <- stevie_wonder_df2 %>%
  clean_names()
colnames(stevie_wonder_df2_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df2_clean <- stevie_wonder_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df2_clean <- separate_rows(stevie_wonder_df2_clean, title, sep = 'b/w')
head(stevie_wonder_df2_clean)

## # A tibble: 6 × 8
##   title  year  uk_40 certifications album na    albums_chart peak_chart_position
##   <chr>  <chr> <chr> <chr>          <chr> <chr> <chr>        <chr>              
## 1 "\"Ne… 1970  6     ""             Sign… <NA>  us           26                 
## 2 " \"S… 1970  6     ""             Sign… <NA>  us           26                 
## 3 "\"Ne… 1970  6     ""             Sign… <NA>  us_r_and_b   11                 
## 4 " \"S… 1970  6     ""             Sign… <NA>  us_r_and_b   11                 
## 5 "\"Ne… 1970  6     ""             Sign… <NA>  us_ac        31                 
## 6 " \"S… 1970  6     ""             Sign… <NA>  us_ac        31

## 1980s ##
stevie_wonder_df3 <- stevie_wonder_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
stevie_wonder_df3_clean <- stevie_wonder_df3 %>%
  clean_names()
colnames(stevie_wonder_df3_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df3_clean <- stevie_wonder_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df3_clean <- separate_rows(stevie_wonder_df3_clean, title, sep = 'b/w')
head(stevie_wonder_df3_clean)

## # A tibble: 6 × 9
##   title               year  swi_39 uk_40 certifications album na    albums_chart
##   <chr>               <chr> <chr>  <chr> <chr>          <chr> <chr> <chr>       
## 1 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us          
## 2 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us          
## 3 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us_r_and_b  
## 4 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us_r_and_b  
## 5 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us_ac       
## 6 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us_ac       
## # ℹ 1 more variable: peak_chart_position <chr>

## 1990s - present ##
stevie_wonder_df4 <- stevie_wonder_df4 |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
stevie_wonder_df4_clean <- stevie_wonder_df4 %>%
  clean_names()
colnames(stevie_wonder_df4_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df4_clean <- stevie_wonder_df4_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df4_clean <- separate_rows(stevie_wonder_df4_clean, title, sep = 'b/w')
head(stevie_wonder_df4_clean)

## # A tibble: 6 × 9
##   title     year  uk_40 album na    na_2  na_3  albums_chart peak_chart_position
##   <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>        <chr>              
## 1 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us           —                  
## 2 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us           —                  
## 3 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us_r_and_b   24                 
## 4 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us_r_and_b   24                 
## 5 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us_ac        —                  
## 6 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us_ac        —

The Beatles

# The Beatles
the_beatles_df <- the_beatles_df |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
the_beatles_df_clean <- the_beatles_df %>%
  clean_names()
colnames(the_beatles_df_clean)[1:15] <- c("title", "year", "uk", "aus", "aut", "bel", "can", "ger", "nld", "nor", "nz", "swi", "us", "us_cash_box", "us_record_world")

# Make tidy - make the albums chart it's own variable
the_beatles_df_clean <- the_beatles_df_clean |>
  pivot_longer(cols = uk:us_record_world, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
the_beatles_df_clean <- separate_rows(the_beatles_df_clean, title, sep = '""')
head(the_beatles_df_clean)

## # A tibble: 6 × 9
##   title          year  certifications uk_album us_album na    na_2  albums_chart
##   <chr>          <chr> <chr>          <chr>    <chr>    <chr> <chr> <chr>       
## 1 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  uk          
## 2 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  uk          
## 3 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  aus         
## 4 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  aus         
## 5 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  aut         
## 6 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  aut         
## # ℹ 1 more variable: peak_chart_position <chr>

Nirvana

# Nirvana
nirvana_df <- nirvana_df |>
  row_to_names(row_number = 1)

# Clean up the names
nirvana_df_clean <- nirvana_df %>%
  clean_names()
colnames(nirvana_df_clean)[3:12] <- c("us", "aus", "bel", "fin", "fra", "ire", "nz", "prt", "swe", "uk")

# Make tidy - make the albums chart it's own variable
nirvana_df_clean <- nirvana_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(nirvana_df_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"Love Buzz\"[A]" 1988  ""             Blea… us           —                  
## 2 "\"Love Buzz\"[A]" 1988  ""             Blea… aus          —                  
## 3 "\"Love Buzz\"[A]" 1988  ""             Blea… bel          —                  
## 4 "\"Love Buzz\"[A]" 1988  ""             Blea… fin          —                  
## 5 "\"Love Buzz\"[A]" 1988  ""             Blea… fra          —                  
## 6 "\"Love Buzz\"[A]" 1988  ""             Blea… ire          —

Fleetwood Mac

# Fleetwood Mac
fleetwood_mac_df <- fleetwood_mac_df |>
  row_to_names(row_number = 1)

# Clean up the names
fleetwood_mac_df_clean <- fleetwood_mac_df %>%
  clean_names()
colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", "us_ac", "aus", "can", "ger", "irl", "nl", "nz")

## Warning in colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", :
## number of items to replace is not a multiple of replacement length

# Make tidy - make the albums chart it's own variable
fleetwood_mac_df_clean <- fleetwood_mac_df_clean |>
  pivot_longer(cols = uk:nz, names_to = "albums_chart", values_to = "peak_chart_position")
head(fleetwood_mac_df_clean)

## # A tibble: 6 × 5
##   title                             year  album albums_chart peak_chart_position
##   <chr>                             <chr> <chr> <chr>        <chr>              
## 1 "\"I Believe My Time Ain't Long\… 1967  Non-… uk           "—"                
## 2 "\"I Believe My Time Ain't Long\… 1967  Non-… us           ""                 
## 3 "\"I Believe My Time Ain't Long\… 1967  Non-… us_rock      ""                 
## 4 "\"I Believe My Time Ain't Long\… 1967  Non-… us_ac        "—"                
## 5 "\"I Believe My Time Ain't Long\… 1967  Non-… aus          ""                 
## 6 "\"I Believe My Time Ain't Long\… 1967  Non-… can          ""

Prince

# Prince

## 70s & 80s ##
prince_df <- prince_df |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df_clean <- prince_df %>%
  clean_names()
colnames(prince_df_clean)[3:7] <- c("us", "us_r_and_b", "us_dance", "nld_tip", "uk")

# Make tidy - make the albums chart it's own variable
prince_df_clean <- prince_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df_clean)

## # A tibble: 6 × 11
##   title year  ger_17 nz_18 swe_19 swi_20 uk_21 certifications album albums_chart
##   <chr> <chr> <chr>  <chr> <chr>  <chr>  <chr> <chr>          <chr> <chr>       
## 1 "\"S… 1978  —      —     —      —      —     ""             For … us          
## 2 "\"S… 1978  —      —     —      —      —     ""             For … us_r_and_b  
## 3 "\"S… 1978  —      —     —      —      —     ""             For … us_dance    
## 4 "\"S… 1978  —      —     —      —      —     ""             For … nld_tip     
## 5 "\"S… 1978  —      —     —      —      —     ""             For … uk          
## 6 "\"J… 1978  —      —     —      —      —     ""             For … us          
## # ℹ 1 more variable: peak_chart_position <chr>

## 1990s ##
prince_df2 <- prince_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df2_clean <- prince_df2 %>%
  clean_names()
colnames(prince_df2_clean)[3:12] <- c("us", "us_r_and_b", "aus", "can", "fra", "ger", "nz", "swe", "swi", "uk")

# Make tidy - make the albums chart it's own variable
prince_df2_clean <- prince_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df2_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"The Future\" (… 1990  ""             Batm… us           —                  
## 2 "\"The Future\" (… 1990  ""             Batm… us_r_and_b   —                  
## 3 "\"The Future\" (… 1990  ""             Batm… aus          —                  
## 4 "\"The Future\" (… 1990  ""             Batm… can          —                  
## 5 "\"The Future\" (… 1990  ""             Batm… fra          —                  
## 6 "\"The Future\" (… 1990  ""             Batm… ger          39

## 2000s ##
prince_df3 <- prince_df3 |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
prince_df3_clean <- prince_df3 %>%
  clean_names()
colnames(prince_df3_clean)[3:11] <- c("us", "us_r_and_b", "aus", "ger", "ire", "nld", "nor", "swi", "uk")

# Make tidy - make the albums chart it's own variable
prince_df3_clean <- prince_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df3_clean)

## # A tibble: 6 × 7
##   title                 year  album na    na_2  albums_chart peak_chart_position
##   <chr>                 <chr> <chr> <chr> <chr> <chr>        <chr>              
## 1 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  us           —[G]               
## 2 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  us_r_and_b   108                
## 3 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  aus          —                  
## 4 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  ger          —                  
## 5 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  ire          —                  
## 6 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  nld          —

## 2010s - present ##
prince_df4 <- prince_df4 |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df4_clean <- prince_df4 %>%
  clean_names()
colnames(prince_df4_clean)[3:7] <- c("us_adult_r_and_b", "us_r_and_b", "us_r_and_b_digital", "us_r_and_b_sales", "uk")

# Make tidy - make the albums chart it's own variable
prince_df4_clean <- prince_df4_clean |>
  pivot_longer(cols = us_adult_r_and_b:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Wikipedia added a "Posthumous" section within the same table, so need to remove that row to make the data frame tidy
prince_df4_clean <- prince_df4_clean |>
  filter(title != "Posthumous", title != '"—" denotes a single that was not released or did not chart in the region.')
head(prince_df4_clean)

## # A tibble: 6 × 5
##   title                             year  album albums_chart peak_chart_position
##   <chr>                             <chr> <chr> <chr>        <chr>              
## 1 "\"Extraloveable\" (featuring An… 2011  Hit … us_adult_r_… —                  
## 2 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b   —                  
## 3 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b_… —                  
## 4 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b_… —                  
## 5 "\"Extraloveable\" (featuring An… 2011  Hit … uk           —                  
## 6 "\"Rock and Roll Love Affair\" [… 2012  Hit … us_adult_r_… —[J]

Bob Dylan

# Bob Dylan
bob_dylan_df <- bob_dylan_df |>
  row_to_names(row_number = 1)

# Clean up the names
bob_dylan_df_clean <- bob_dylan_df %>%
  clean_names()
colnames(bob_dylan_df_clean)[2:9] <- c("title", "us", "us_main", "us_cash", "aus", "ire", "nl", "uk")

# Make tidy - make the albums chart it's own variable
bob_dylan_df_clean <- bob_dylan_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
bob_dylan_df_clean <- separate_rows(bob_dylan_df_clean, title, sep = 'b/w')
head(bob_dylan_df_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1962  "\"Mixed-Up Confu… ""             Non-… us           —                  
## 2 1962  " \"Corrina Corri… ""             Non-… us           —                  
## 3 1962  "\"Mixed-Up Confu… ""             Non-… us_main      —                  
## 4 1962  " \"Corrina Corri… ""             Non-… us_main      —                  
## 5 1962  "\"Mixed-Up Confu… ""             Non-… us_cash      —                  
## 6 1962  " \"Corrina Corri… ""             Non-… us_cash      —

Lauryn Hill

# Lauryn Hill
lauryn_hill_df <- lauryn_hill_df |>
  row_to_names(row_number = 1)

# Clean up the names
lauryn_hill_df_clean <- lauryn_hill_df %>%
  clean_names()
colnames(lauryn_hill_df_clean)[3:12] <- c("us", "us_r_and_b", "us_rhy", "aus", "fra", "nl", "nz", "swe", "swi", "uk")

# Make tidy - make the albums chart it's own variable
lauryn_hill_df_clean <- lauryn_hill_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(lauryn_hill_df_clean)

## # A tibble: 6 × 7
##   title        year  certifications album na    albums_chart peak_chart_position
##   <chr>        <chr> <chr>          <chr> <chr> <chr>        <chr>              
## 1 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us           1                  
## 2 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us_r_and_b   2                  
## 3 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us_rhy       1                  
## 4 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  aus          35                 
## 5 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  fra          23                 
## 6 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  nl           4

We now have a tidy song dataframe for each artist.

Creating a Single Dataframe for Artist/Song/Album/Year

Now let’s combine all the dataframes into one called artist_dfs:

artist_dfs <- c(
  # Marvin Gaye
  "marvin_gaye_df_clean",
  "marvin_gaye_df2_clean",
  "marvin_gaye_df3_clean",
  
  # The Beach Boys
  "the_beach_boys_df_clean",
  "the_beach_boys_df2_clean",
  "the_beach_boys_df3_clean",
  "the_beach_boys_df4_clean",  
  
  
  # Joni Mitchell
  "joni_mitchell_df_clean",
  
  # Stevie Wonder
  "stevie_wonder_df_clean",
  "stevie_wonder_df2_clean",
  "stevie_wonder_df3_clean",
  "stevie_wonder_df4_clean",
  
  # The Beatles
  "the_beatles_df_clean",
  
  # Nirvana
  "nirvana_df_clean",
  
  # Fleetwood Mac
  "fleetwood_mac_df_clean",
  
  # Prince
  "prince_df_clean",
  "prince_df2_clean",
  "prince_df3_clean",
  "prince_df4_clean",
  
  # Bob Dylan
  "bob_dylan_df_clean",
  
  # Lauryn Hill
  "lauryn_hill_df_clean"
)

Clean the Data

Since some of the values from the song title variable are messy, let’s clean the dataframe and write the dataframe to a CSV:

# This function:
# - extracts artist name from object name (everything before "_df")
# - cleans song titles
# - adds artist column
# - saves cleaned df to global env with name like "artist_cleaned_songs"

clean_and_save_songs_df <- function(df_name_string) {
  # get df by name
  df <- get(df_name_string, envir = .GlobalEnv)

  # get artist name from df name (before "_df")
  artist_id <- str_extract(df_name_string, "^(.*?)(?=_df)")
  artist_title <- str_to_title(str_replace_all(artist_id, "_", " "))

  cleaned_df <- df |>
    mutate(
      # clean title — remove quotes, () [] {}, cut after cd, us, uk etc.
      title = title |>
        str_trim() |>
        str_replace_all('"', "") |>
        str_remove_all("\\(.*?\\)") |>
        str_remove_all("\\[.*?\\]") |>
        str_remove_all("\\{.*?\\}") |>
        str_remove_all("(?i)[ ]?(cd|us|uk|canadian)[\\s\\-:]+.*$") |>
        str_squish(),
      # add artist column
      artist = artist_title
    ) |>
    # drop junk rows like "denotes", "not released", etc.
    filter(
      !str_detect(title, "(?i)denotes|indicates|refers to|represents|not released")
    ) |>
    # keep only one row per title
    distinct(title, .keep_all = TRUE)

  # save cleaned df to global env
  cleaned_df_name <- paste0(artist_id, "_cleaned_songs")
  assign(cleaned_df_name, cleaned_df, envir = .GlobalEnv)

  # if album/year column missing, add NA
  if (!"album" %in% names(cleaned_df)) cleaned_df$album <- NA
  if (!"year" %in% names(cleaned_df)) cleaned_df$year <- NA

  # keep only needed columns
  minimal_df <- cleaned_df |>
    select(artist, title, album, year)

  # if all_artists_songs doesn’t exist, create it
  if (!exists("all_artists_songs", envir = .GlobalEnv)) {
    assign("all_artists_songs", minimal_df, envir = .GlobalEnv)
  } else {
    # if exists, add to it
    existing <- get("all_artists_songs", envir = .GlobalEnv)
    updated <- bind_rows(existing, minimal_df)
    assign("all_artists_songs", updated, envir = .GlobalEnv)
  }

  return(cleaned_df)
}


artists_songs_album <- invisible(lapply(artist_dfs, clean_and_save_songs_df))

write.csv(all_artists_songs, "all_artists_songs.csv")
all_artists_songs |>
  group_by(artist, title) |>
  filter(n() > 1) |>
  arrange(artist, title)

## # A tibble: 32 × 4
## # Groups:   artist, title [15]
##    artist      title                         album             year 
##    <chr>       <chr>                         <chr>             <chr>
##  1 Marvin Gaye Ain't No Mountain High Enough United            1967 
##  2 Marvin Gaye Ain't No Mountain High Enough Non-album singles 2013 
##  3 Marvin Gaye Let's Get It On               Let's Get It On   1973 
##  4 Marvin Gaye Let's Get It On               Non-album singles 2005 
##  5 Marvin Gaye Sexual Healing                Midnight Love     1982 
##  6 Marvin Gaye Sexual Healing                Non-album singles 2015 
##  7 Prince      1999                          1999              1982 
##  8 Prince      1999                          1999              1998 
##  9 Prince      7                             Love Symbol       1992 
## 10 Prince      7                             Non-album singles 2023 
## # ℹ 22 more rows

all_artists_songs <- all_artists_songs |>
  distinct(artist, title, .keep_all = TRUE)
all_artists_songs |>
  group_by(artist, title) |>
  filter(n() > 1) |>
  arrange(artist, title)

## # A tibble: 0 × 4
## # Groups:   artist, title [0]
## # ℹ 4 variables: artist <chr>, title <chr>, album <chr>, year <chr>

all_artists_songs <- all_artists_songs |>
  distinct(artist, title, .keep_all = TRUE)

head(all_artists_songs)

## # A tibble: 6 × 4
##   artist      title                             album                      year 
##   <chr>       <chr>                             <chr>                      <chr>
## 1 Marvin Gaye Let Your Conscience Be Your Guide The Soulful Moodsof Marvi… 1961 
## 2 Marvin Gaye Sandman                           Greatest Hits              1962 
## 3 Marvin Gaye Soldier's Plea                    That StubbornKinda Fellow  1962 
## 4 Marvin Gaye Stubborn Kind of Fellow           That StubbornKinda Fellow  1962 
## 5 Marvin Gaye Hitch Hike                        That StubbornKinda Fellow  1962 
## 6 Marvin Gaye Pride and Joy                     That StubbornKinda Fellow  1963

We now have a clean, tidy dataframe all_artists_songs which contains artist, song title, album and year data. Next we’ll add on the emotion for each song.

Creating messy SQL database from all_artists_songs and returning it as tidy df

This section is for MySQL (we wanted to include MySQL for this project). We decided to upload a messy dataframe, grab it from MySQL, then tidy it back up again:

Connecting to MySQL:

# get values securely from env
user <- Sys.getenv("MYSQL_USER")
password <- Sys.getenv("MYSQL_PASSWORD")
dbname <- Sys.getenv("MYSQL_DB")
host <- Sys.getenv("MYSQL_HOST")
port <- as.integer(Sys.getenv("MYSQL_PORT"))
print(user)

# connect to MySQL
con <- dbConnect(
  RMariaDB::MariaDB(),
   user = user,
  user = user,
  password = password,
  dbname = dbname,
  host = host,
  port = 3306,
  

)

Uploading messy table to MySQL:

messy_all_artists <- all_artists_songs |>
  mutate(
    song_1 = title,
    song_2 = ifelse(row_number() %% 2 == 0, paste(title, "Remix"), NA),  # some second titles
    notes = case_when(
      grepl("Remix", title) ~ "CD version",
      grepl("Love", title, ignore.case = TRUE) ~ "— denotes b-side",
      TRUE ~ ""
    ),
    album_name = album,
    release_year = year,
    id = row_number()  
  ) |>
  select(id, artist_name = artist, song_1, song_2, album_name, release_year, notes)


# upload messy table to MySQL
dbWriteTable(con, "messy_songs", messy_all_artists, overwrite = TRUE, row.names = FALSE)

Grabbing the dataframe and tidying it again:

messy_from_mysql <- dbReadTable(con, "messy_songs")

tidy_from_mysql <- messy_from_mysql |>
  filter(!is.na(artist_name)) |>
  filter(!str_detect(notes, "(?i)denotes|cd|remix")) |>
  pivot_longer(
    cols = starts_with("song_"),
    names_to = "song_slot",
    values_to = "title"
  ) |>
  # Remove missing or empty titles
  filter(!is.na(title), str_trim(title) != "") |>
  # Remove titles with remix/variation keywords
  filter(!str_detect(title, "(?i)remix|version|edit|live|instrumental|mix|b[- ]?side")) |>
  distinct(artist_name, title, .keep_all = TRUE) |>
  select(artist = artist_name, title, album = album_name, year = release_year)
write.csv(tidy_from_mysql, "tidy_from_mysql.csv")

dbDisconnect(con)

Get the Song Lyrics Using Genius.com API

Next to gather song lyrics, we’ll use the Genuis API. Unfortunately, the Genius API can’t provide the lyrics directly, but it can provide the URL that contains the lyrics to a song. We’ll then have to web scrape again to create a lyrics dataframe.

# Get the API Key
genius_info <- read_csv("https://raw.githubusercontent.com/farhodibr/DATA607_FINAL_PROJECT/refs/heads/main/genius_API.csv")
genius_token <- genius_info$api_token

#For example we can use this song
song_name <- "Yesterday"
artist_band <- "The Beatles"

# This function searches Genius for a song URL
search_song_final <- function(song_title, artist_name) {
  # Setting up to call the Genius API's search endpoint
 base_url <- "https://api.genius.com/search"
  query <- list(q = paste(song_title, artist_name))

  # Making the request to the API
  res <- GET(
    url = base_url,
    add_headers(Authorization = paste("Bearer", genius_token)), # Authorization header includes the token
    query = query
  )

  # Check if the request worked (Status 200 means OK)
  if (status_code(res) != 200) {
    warning("API request failed. Status code: ", status_code(res))
    return(NA) # Return NA on API call failure
  }

  # Now parse the JSON results. Using flatten=TRUE based on previous findings.
  content_data <- tryCatch(
      fromJSON(content(res, "text", encoding = "UTF-8"), flatten = TRUE),
      error = function(e) {
          warning("Failed to parse JSON response: ", e$message)
          return(NULL) # Return NULL on JSON parsing error
      }
  )
  # Ensure a valid response structure was received
  if (is.null(content_data) || is.null(content_data$response)) return(NA)

  # Search results are typically inside response$hits
  hits <- content_data$response$hits

  # Process the 'hits' data frame (assuming it's a data frame)
  # Handle cases with no results or if 'hits' isn't a data frame
  if (is.null(hits) || !inherits(hits, "data.frame") || nrow(hits) == 0) {
    return(NA)
  }

  # Define expected column names based on flatten=TRUE producing specific columns
  artist_col <- "result.primary_artist.name"
  url_col <- "result.url"

  # Check if the required columns exist in the data frame
  if (!all(c(artist_col, url_col) %in% names(hits))) {
     warning("Required columns ('", artist_col, "', '", url_col, "') not found in flattened API response.")
     return(NA) # Cannot proceed without these columns
  }

  # Loop through each row (each potential song hit)
  for (i in 1:nrow(hits)) {
      # Get the artist name for the current row from the flattened column
      artist_raw <- hits[[artist_col]][[i]]
      artist <- tolower(as.character(artist_raw))

      # Check if the artist name is valid and matches the search query
      if (!is.na(artist) && nzchar(artist) && str_detect(artist, fixed(tolower(artist_name)))) {
          # If the artist matched, get the URL from its column
          song_url_found <- hits[[url_col]][[i]]
          if (!is.na(song_url_found) && nzchar(song_url_found)) {
              # Match found! Return the URL and exit the function.
              return(song_url_found)
          }
      }
  }

  # If the loop finishes without returning, no suitable match was found
  return(NA)
}

# --- Example: Find the URL ---
song_url <- search_song_final(song_name, artist_band)
# This should print the found URL...
print(song_url)

## [1] "https://genius.com/The-beatles-yesterday-lyrics"

Now we need to create functions that will get all the words for a song, get_words_df, and get the song emotions, get_song_sentiments:

# Returns a dataframe where each observation is a word and the corresponding in a song
get_words_df <- function(artist_name, song_title) {
  song_url <- search_song_final(song_title, artist_name)
  if (is.na(song_url)) {
    return(NULL)
  }

  # Using the song_url found previously to scrape the lyrics

  # Only run this if a valid song_url was found
  if (!is.na(song_url) && nzchar(song_url)) {

    # Read the HTML content from the URL
    page_html <- tryCatch(
      read_html(song_url),
      error = function(e) {
        warning("Failed to read URL: ", song_url, "\nError: ", e$message)
        return(NULL) # Return NULL on page read failure
      }
    )

    # Only proceed if the webpage HTML was read successfully
    if (!is.null(page_html)) {

      # Find the part of the HTML containing the lyrics
      lyrics_selector <- "div[data-lyrics-container='true']" # Verify this selector!

      # Use rvest to find the HTML element(s) matching the selector
      lyrics_nodes <- page_html |>
        html_elements(css = lyrics_selector)

      # Check if any nodes were found using the selector
      if (length(lyrics_nodes) == 0) {
        # If not found, the selector was likely wrong or the page structure changed.
        warning("Could not find lyrics container using selector: '", lyrics_selector,
              "'. The website structure may have changed or the selector is wrong. ",
              "Please inspect the page HTML.")
        return(NULL) # Cannot proceed.
      } else {

        # Extract the text content from the found node(s)
        # html_text2 often handles line breaks better
        lyrics_text_raw <- lyrics_nodes |>
          html_text2()

        # Clean up the extracted raw text
        temp_cleaned_vector <- lyrics_text_raw %>%
          str_split("\n") %>% # Split into lines based on the newline character (\n)
          unlist() %>% # The result of str_split is a list, so unlist it
          str_remove_all("\\[[^\\]]+\\]") %>% # Remove structural markers like [Verse 1], [Chorus] etc.
          str_trim() # Remove leading/trailing whitespace from lines

        # Now, apply the filtering using standard base R subsetting (outside the pipe)
        lyrics_text_cleaned <- temp_cleaned_vector[nzchar(temp_cleaned_vector)]

        # Structure the cleaned lines into a data frame
        scraped_lyrics_df <- tibble(
          line = 1:length(lyrics_text_cleaned),
          text = lyrics_text_cleaned
        )

      }
    } else {
      # Scraping skipped because reading the URL failed earlier
      warning("Skipping scraping due to error reading URL.")
      return(NULL)
    }

  } else {
    warning("Cannot scrape lyrics because song_url was not found successfully.")
    return(NULL) # Ensure variable exists even if scraping is skipped
  }

  lyrics_df <- scraped_lyrics_df

  words <- lyrics_df %>%
    unnest_tokens(word, text)
  return(words)
}

# Returns top n song emotions for a particular song
get_song_sentiments <- function(words, artist_name, song_title, n, year = NULL) {
  null_top_emotion_df <- tibble(artist = artist_name, song = song_title, top_emotion = NA)
  
  if (is.null(words)) {
    return(null_top_emotion_df)
  }

  top_emotions_count <- words %>%
    inner_join(nrc, by = "word", relationship = "many-to-many") %>%
    filter(!sentiment %in% c("positive", "negative")) %>%
    count(sentiment, sort = TRUE) %>%
    slice_head(n = n)
  top_emotions <- top_emotions_count |>
    pull(sentiment)
  
  top_emotions_count$artist <- artist_name
  top_emotions_count$song <- song_title
  colnames(top_emotions_count)[1] <- "top_emotion"
  if (length(top_emotions) == 0) return(null_top_emotion_df)
  return(top_emotions_count)
}

# Testing
words_test <- get_words_df(artist_band, song_name)
head(words_test, 15)

## # A tibble: 15 × 2
##     line word        
##    <int> <chr>       
##  1     1 113         
##  2     1 contributors
##  3     2 translations
##  4     3 türkçe      
##  5     4 español     
##  6     5 português   
##  7     6 deutsch     
##  8     7 česky       
##  9     8 yesterday   
## 10     8 lyrics      
## 11     9 yesterday   
## 12     9 is          
## 13     9 the         
## 14     9 most        
## 15     9 covered

scraped_lyrics_test <- get_song_sentiments(words_test, artist_band, song_name, 1)
head(scraped_lyrics_test, 15)

## # A tibble: 1 × 4
##   top_emotion     n artist      song     
##   <chr>       <int> <chr>       <chr>    
## 1 fear            3 The Beatles Yesterday

Now let’s apply those functions to our tidy artist_song_list dataframe to create a top emotion variable:

# Grab 5 songs from each of the top 10 artists
get_top5_valid_sentiments <- function(artist_name, song_titles, max_songs = 5) {
  results <- list()

  for (song in song_titles) {
    lyrics_words <- get_words_df(artist_name, song)
    res <- get_song_sentiments(lyrics_words, artist_name, song, 1)
    # keep only song + top_emotion
    if (!is.na(res$top_emotion)) {
      results <- append(results, list(
        tibble(song       = res$song,
               top_emotion = res$top_emotion)
      ))
    }

    if (length(results) >= max_songs) break
  }

  if (length(results) > 0) {
    return(bind_rows(results))
  } else {
    return(tibble(song = NA, top_emotion = NA))
  }
}

get_top5_valid_sentiments_year <- function(artist_name, title) {
  words <- get_words_df(artist_name, title)
  sent <- get_song_sentiments(words, artist_name, title, 1)
  return(sent)
}

artist_song_list <- all_artists_songs %>%
  filter(!is.na(artist), artist != "artist") %>%
  group_by(artist) %>%
  summarise(songs = list(title), .groups = "drop")

song_list_by_year <- all_artists_songs %>%
  filter(!is.na(artist), artist != "artist", year != "Source:[42]") %>%
  group_by(artist, year)

# Now apply get_top5_valid_sentiments to each row
# Use foreach to speed up the process
valid_sentiments <- foreach(i = 1:nrow(artist_song_list),
                            .combine = bind_rows,
                            .packages = c(
                              "dplyr", "purrr", "tibble", "httr", "jsonlite", 
                              "stringr", "rvest", "tidytext", "textdata", "readr", "xml2"
                            )) %dopar% {
  artist <- artist_song_list$artist[i]
  songs  <- artist_song_list$songs[[i]]
  result <- get_top5_valid_sentiments(artist, songs)
  result$artist <- artist
  result
                            }

# Final tidy dataframe
head(valid_sentiments)

## # A tibble: 6 × 3
##   song                                 top_emotion  artist       
##   <chr>                                <chr>        <chr>        
## 1 Mixed-Up Confusion                   anger        Bob Dylan    
## 2 Corrina Corrina                      anticipation Bob Dylan    
## 3 Blowin' in the Wind                  joy          Bob Dylan    
## 4 The Times They Are a-Changin'        anger        Bob Dylan    
## 5 Honey, Just Allow Me One More Chance surprise     Bob Dylan    
## 6 I Believe My Time Ain't Long         anticipation Fleetwood Mac

We now have a tidy data frame, valid_sentiments where each observation represents a song, artist, and the top emotion for that song.

This allows for the data to be easily summarized and grouped without reshaping. We are also able to make very straightforward joins with lexicons like ‘nrc’ and ‘bing’ or other lookup tables. Tidying up this table future-proofed it with the scalability factor since we can append new columns or stack new tables if needed.

Analysis

Now we’re ready to perform our analysis. Let’s look at word frequency for the top songs and analyze the emotions. Note, when looking at decades, we only look at a subset of data, and pick only one year per decade since the code runs so slowly.

## # A tibble: 5 × 3
##   year  top_emotion      n
##   <chr> <chr>        <int>
## 1 1970  joy             12
## 2 1980  joy             15
## 3 1990  joy              6
## 4 2000  anticipation     2
## 5 2010  joy              2

## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 love      215
##  2 baby      113
##  3 time       69
##  4 home       65
##  5 summer     54
##  6 music      48
##  7 mind       46
##  8 world      46
##  9 grandma    43
## 10 girl       38

Let’s graph these charts and analyze:

Word frequency:

# Word Frequency
word_freq_decade_years <- word_freq_decade_years |>
  left_join(bing, by = "word", relationship = "many-to-many")

word_freq_decade_years |>
  ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title = "Top 10 Words 1960 - 2010",
         x = "Word",
         y = "Count")

We can see here that love is the most frequent word, and is the only word that has a positive or negative emotion attached to it. For this we tried the “bing” lexicon which categorizes into either positive or negative sentiment (binary), and it could not determine a sentiment for most of these words.

Possibly the most interesting takeaway from this graph is that grandma is one of the top words.

Sentiments for each decade:

# Top Emotion in Songs Per Year
valid_sentiments_by_decade |>
  ggplot(aes(x = year, y = n, fill = top_emotion)) +
  geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title = "Top Emotion in Songs Per Year (One Year Per Decade)",
       x = "Year", y = "Count")

Over time, the emotion represented for each decade is joy excluding 2000. The year 2000 had more anticipation songs than joyful.

Top Emotions in Songs by Top 10 Artists:

# Top Emotions in Songs by Top 10 Artists
ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = top_emotion)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Top Emotions in Songs by Top 10 Artists",
       x = "Emotion", y = "Count") +
  theme_minimal()

This bar chart summarizes the most frequent dominant emotion detected in songs by the top 10 Rolling Stone artists.

Joy is by far the most common top emotion, appearing in nearly 25 songs, suggesting that uplifting and positive themes are prominent across these artists’ lyrics.
Anticipation is the second most frequent, often associated with themes of hope, excitement, or longing.
Emotions like trust, anger, and fear are present but much less frequent.
Sadness and surprise appear rarely as the primary emotion, indicating they are less likely to be the dominant tone in top songs.

Overall, this distribution highlights the emotional leaning of popular music from legendary artists toward positive or forward-looking sentiments.

Top Emotions by Artist:

valid_sentiments %>%
  group_by(artist, top_emotion) %>%
  summarise(n = n(), .groups = "drop") %>%
  arrange(artist, desc(n))

## # A tibble: 27 × 3
##    artist        top_emotion      n
##    <chr>         <chr>        <int>
##  1 Bob Dylan     anger            2
##  2 Bob Dylan     anticipation     1
##  3 Bob Dylan     joy              1
##  4 Bob Dylan     surprise         1
##  5 Fleetwood Mac joy              3
##  6 Fleetwood Mac anticipation     2
##  7 Joni Mitchell anticipation     2
##  8 Joni Mitchell joy              2
##  9 Joni Mitchell anger            1
## 10 Lauryn Hill   joy              2
## # ℹ 17 more rows

ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = artist)) +
  geom_bar(position = "dodge") +
  labs(title = "Top Emotions by Artist", x = "Emotion", y = "Number of Songs") +
  theme_minimal()

This bar chart shows the most common emotional tone for songs by each of the top 10 Rolling Stone artists based on NRC sentiment analysis.

Joy and anticipation are the most dominant emotions across artists, particularly for Marvin Gaye, The Beatles, and Fleetwood Mac.
Fear and sadness appear most often in Lauryn Hill’s songs, indicating a more emotionally intense or somber tone.
Bob Dylan stands out with a more diverse emotional profile that includes anger and surprise.

The chart highlights how different artists tend to gravitate toward different emotional themes in their songwriting.

Emotional Variety by Artist:

#Artists with the most emotionally varied songs
valid_sentiments %>%
  group_by(artist) %>%
  summarise(num_unique_emotions = n_distinct(top_emotion)) %>%
  arrange(desc(num_unique_emotions))

## # A tibble: 10 × 2
##    artist         num_unique_emotions
##    <chr>                        <int>
##  1 Bob Dylan                        4
##  2 Lauryn Hill                      4
##  3 The Beach Boys                   4
##  4 Joni Mitchell                    3
##  5 Fleetwood Mac                    2
##  6 Marvin Gaye                      2
##  7 Nirvana                          2
##  8 Prince                           2
##  9 Stevie Wonder                    2
## 10 The Beatles                      2

This chart shows us that Bob Dylan and The Beach Boys show the greatest diversity of emotions. Most artists span 2–3 emotions in their top 5 songs.

Conclusions

Our analysis showed us that:

Joy is the most frequent emotion overall, especially in the 1970s
Anticipation peaks in the 1980s/90s, then sadness rises in the 2000s

Some challenges we faced were:

Web scraping
Each Wikipedia page set up differently
Wikipedia page updated
Code sometimes slow to run
Genius API doesn’t provide song lyrics directly which added a step

Potential Future Analysis:

Analyze full discographies per artist
Compare results using multiple sentiment dictionaries
Explore how emotions differ by genre or rank position

Extra Analysis

Billbord and geniusr libraries to get more songs

Focusing on just the Rolling Stone’s top artists did not provide us artists and songs from recent decades, so to expand our data, we used the billboard library which provides us the Billboard Top 100 songs for each year from 1960 - 2016.

Get the Data

Let’s load billboard’s wiki_hot_100s dataframe which contains the Billboard Top 100 songs:

library(billboard)
library(geniusr)

#Genius API
Sys.setenv(GENIUS_API_TOKEN = genius_token)

#Creating Billboard dataframe (Includes top100 songs for each year in range of 1960 - 2016)
if (require("dplyr")) {
  data(wiki_hot_100s)
  wiki_hot_100s |>
    glimpse()
}

## Rows: 5,701
## Columns: 4
## $ no     <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", …
## $ title  <chr> "Theme from A Summer Place", "He'll Have to Go", "Cathy's Clown…
## $ artist <chr> "Percy Faith", "Jim Reeves", "The Everly Brothers", "Johnny Pre…
## $ year   <chr> "1960", "1960", "1960", "1960", "1960", "1960", "1960", "1960",…

# This is a random sample for random 20 songs from each year
# Only look at 20 due to slow process time to obtain song lyrics
set.seed(146)
sample_billboard_1960_2016 <- wiki_hot_100s |>
  mutate(year = as.numeric(year))|>
  group_by(year) |>
  sample_n(size = 20, replace = FALSE) |> #size can be decreased to speed up the process
  ungroup()

# Grab the top 20 songs from each year
# Only look at 20 due to slow process time to obtain song lyrics
top_20_billboard_1960_2016 <- wiki_hot_100s |>
  mutate(year = as.numeric(year))|>
  group_by(year) |>
  slice_head(n = 20) |> #size can be decreased to speed up the process
  ungroup()

#this removes duplicate rows
song_jobs <- top_20_billboard_1960_2016 |>
  distinct(artist, title, year, .keep_all = TRUE)

sample_song_jobs <- sample_billboard_1960_2016 |>
  distinct(artist, title, year, .keep_all = TRUE)

head(song_jobs)

## # A tibble: 6 × 4
##   no    title                     artist               year
##   <chr> <chr>                     <chr>               <dbl>
## 1 1     Theme from A Summer Place Percy Faith          1960
## 2 2     He'll Have to Go          Jim Reeves           1960
## 3 3     Cathy's Clown             The Everly Brothers  1960
## 4 4     Running Bear              Johnny Preston       1960
## 5 5     Teen Angel                Mark Dinning         1960
## 6 6     I'm Sorry                 Brenda Lee           1960

head(sample_song_jobs)

## # A tibble: 6 × 4
##   no    title                         artist             year
##   <chr> <chr>                         <chr>             <dbl>
## 1 24    Why                           Frankie Avalon     1960
## 2 94    My Little Corner of the World Anita Bryant       1960
## 3 100   Sandy                         Larry Hall         1960
## 4 95    Doggin' Around                Jackie Wilson      1960
## 5 84    Happy-Go-Lucky Me             Paul Evans         1960
## 6 88    O Dio Mio                     Annette Funicello  1960

These dataframes, one for the top 20, and the other for a group of random 20 songs, are tidy since each observation represents a single song on the Billboard Top 100. Each cell contains a single value as well.

Genius API Integration

We tapped into the Genius API using the geniusr package, so that OAuth tokens, request construction, and JSON parsing all happen behind the scenes.
geniusr returns perfectly tidy data frames from nested JSON and even handles token refreshes and rate‑limit retries automatically.
That let us pull song metadata and lyric URLs from Genius with a single function call—no low‑level HTTP or manual JSON wrangling required.

Now let’s add on the top emotion for each song:

#doParallel
num_cores <- parallel::detectCores() - 1  # leave 1 core for safety
cl <- makeCluster(num_cores)
registerDoParallel(cl)


#helper function to get sentiments
get_song_emotions <- function(artist, title, n = 1) {
  words <- get_words_df(artist, title)
  sentiments <- get_song_sentiments(words, artist, title, n)
  return(sentiments)
}

#here we are getting sentiments for each song, and storing it in new df. 
#<<<______this part can take from 8 to 15 mins to complete.______>>>>>>>
songs_with_emotions <- foreach(i = 1:nrow(song_jobs), 
                               .combine = bind_rows, 
                               .packages = c("dplyr", "tibble", "purrr", "stringr", 
                                             "tidytext", "rvest", "httr", "jsonlite")) %dopar% {
  tryCatch({
    artist <- song_jobs$artist[i]
    title  <- song_jobs$title[i]
    year   <- song_jobs$year[i]
    
    sent_df <- get_song_emotions(artist, title, n = 1)
    sent_df$year <- year
    sent_df
  }, error = function(e) {
    message("Failed for ", song_jobs$title[i], " by ", song_jobs$artist[i])
    tibble(artist = song_jobs$artist[i], song = song_jobs$title[i], top_emotion = NA, year = song_jobs$year[i])
  })
                                             }

sample_songs_with_emotions <- foreach(i = 1:nrow(sample_song_jobs), 
                               .combine = bind_rows, 
                               .packages = c("dplyr", "tibble", "purrr", "stringr", 
                                             "tidytext", "rvest", "httr", "jsonlite")) %dopar% {
  tryCatch({
    artist <- sample_song_jobs$artist[i]
    title  <- sample_song_jobs$title[i]
    year   <- sample_song_jobs$year[i]
    
    sent_df <- get_song_emotions(artist, title, n = 1)
    sent_df$year <- year
    sent_df
  }, error = function(e) {
    message("Failed for ", sample_song_jobs$title[i], " by ", sample_song_jobs$artist[i])
    tibble(artist = sample_song_jobs$artist[i], song = sample_song_jobs$title[i], top_emotion = NA, year = sample_song_jobs$year[i])
  })
}


stopCluster(cl)
registerDoSEQ() 

# Final tidy dataframe

songs_with_emotions |>
  filter(is.na(top_emotion))

## # A tibble: 248 × 5
##    top_emotion     n artist                         song                   year
##    <chr>       <int> <chr>                          <chr>                 <dbl>
##  1 <NA>           NA The String-A-Longs             Wheels                 1961
##  2 <NA>           NA Lawrence Welk                  Calcutta               1961
##  3 <NA>           NA The Mar-Keys                   Last Night             1961
##  4 <NA>           NA Ferrante & Teicher             Exodus                 1961
##  5 <NA>           NA Acker Bilk                     Stranger on the Shore  1962
##  6 <NA>           NA David Rose                     The Stripper           1962
##  7 <NA>           NA Chubby Checker & Dee Dee Sharp Slow Twistin'          1962
##  8 <NA>           NA Little Stevie Wonder           Fingertips             1963
##  9 <NA>           NA The Village Stompers           Washington Square      1963
## 10 <NA>           NA Kyu Sakamoto                   Sukiyaki               1963
## # ℹ 238 more rows

sample_songs_with_emotions |>
  filter(is.na(top_emotion))

## # A tibble: 230 × 5
##    top_emotion     n artist                   song                   year
##    <chr>       <int> <chr>                    <chr>                 <dbl>
##  1 <NA>           NA The Ventures             Walk, Don't Run        1960
##  2 <NA>           NA Skip and Flip            Cherry Pie             1960
##  3 <NA>           NA The Dave Brubeck Quartet Take Five              1961
##  4 <NA>           NA The Mar-Keys             Last Night             1961
##  5 <NA>           NA The String-A-Longs       Wheels                 1961
##  6 <NA>           NA Jimmy Smith              Walk on the Wild Side  1962
##  7 <NA>           NA Bent Fabric              Alley Cat              1962
##  8 <NA>           NA Nat King Cole            Ramblin' Rose          1962
##  9 <NA>           NA Booker T. & the M.G.'s   Green Onions           1962
## 10 <NA>           NA Acker Bilk               Stranger on the Shore  1962
## # ℹ 220 more rows

head(songs_with_emotions)

## # A tibble: 6 × 5
##   top_emotion     n artist              song                       year
##   <chr>       <int> <chr>               <chr>                     <dbl>
## 1 joy             9 Percy Faith         Theme from A Summer Place  1960
## 2 joy             8 Jim Reeves          He'll Have to Go           1960
## 3 joy            10 The Everly Brothers Cathy's Clown              1960
## 4 joy            21 Johnny Preston      Running Bear               1960
## 5 joy            20 Mark Dinning        Teen Angel                 1960
## 6 joy             3 Brenda Lee          I'm Sorry                  1960

head(sample_songs_with_emotions)

## # A tibble: 6 × 5
##   top_emotion     n artist            song                           year
##   <chr>       <int> <chr>             <chr>                         <dbl>
## 1 joy            31 Frankie Avalon    Why                            1960
## 2 joy             4 Anita Bryant      My Little Corner of the World  1960
## 3 joy             7 Larry Hall        Sandy                          1960
## 4 joy             3 Jackie Wilson     Doggin' Around                 1960
## 5 joy            33 Paul Evans        Happy-Go-Lucky Me              1960
## 6 joy            10 Annette Funicello O Dio Mio                      1960

We now have tidy data frames where each observation represent a song, and it’s top emotion.

Analysis

Now we can start our analysis. Let’s see how the emotions of popular songs have changed through the years by decade:

songs_with_emotions_grouped_by_decade <- songs_with_emotions |>
  filter(!is.na(top_emotion)) |>
  mutate(year = as.numeric(year)) |>
  group_by(decade = floor(year / 10) * 10, top_emotion)|>
  summarise(n = n(), .groups = "drop")

#charts for each decade
songs_with_emotions_grouped_by_decade |>
  ggplot(aes(x = factor(decade), y = n, fill = top_emotion)) +
  geom_col(position = "dodge") +
  labs(title = "Emotion Trends in Top 20 Billboard Song Lyrics by Decade", x = "Decade", y = "Count")

# Sampled Data
sample_songs_with_emotions_grouped_by_decade <- sample_songs_with_emotions |>
  filter(!is.na(top_emotion)) |>
  mutate(year = as.numeric(year)) |>
  group_by(decade = floor(year / 10) * 10, top_emotion)|>
  summarise(n = n(), .groups = "drop")

#charts for each decade
sample_songs_with_emotions_grouped_by_decade |>
  ggplot(aes(x = factor(decade), y = n, fill = top_emotion)) +
  geom_col(position = "dodge") +
  labs(title = "Emotion Trends in Sampled Billboard Song Lyrics by Decade", x = "Decade", y = "Count")

Top 20 Songs:

Looking at the count bar graph, Joy is still by far the most common top emotion, appearing in nearly 25 songs. Yet, the count gets much small after the 90s. This makes sense because the data we have access to only goes up to 2016, so the 2010 would have less data. For the 2000s decade, it’s possible for some reason we could not find the lyrics to the songs, or a top emotion could not be specified.

Sampled:

Joy is still the top emotion for random sampled data. It looks like sampled data has the same issue.

Let’s check where the missing emotions are:

songs_with_emotions |>
  filter(is.na(top_emotion)) |>
  group_by(decade = floor(year / 10) * 10) |>
  summarise(total = n())

## # A tibble: 6 × 2
##   decade total
##    <dbl> <int>
## 1   1960    31
## 2   1970    19
## 3   1980    28
## 4   1990    34
## 5   2000    80
## 6   2010    56

sample_songs_with_emotions |>
  filter(is.na(top_emotion)) |>
  group_by(decade = floor(year / 10) * 10) |>
  summarise(total = n())

## # A tibble: 6 × 2
##   decade total
##    <dbl> <int>
## 1   1960    33
## 2   1970    31
## 3   1980    17
## 4   1990    31
## 5   2000    66
## 6   2010    52

As you can see, the decade with the most missing emotions is the 2000s decade. If we were doing further analysis with more time, we could pinpoint exactly why this data is missing. The sampled data has a bit less missing values in the 2000s.

Let’s make another percentage graph to look at the proportions:

# Emotion Trends in Billboard Lyrics by Decade
# Focus on percentage
songs_with_emotions_grouped_by_decade_pct <- songs_with_emotions_grouped_by_decade |>
  group_by(decade) |>
  mutate(percent = n / sum(n)) |>
  ungroup()

songs_with_emotions_grouped_by_decade_pct |>
  ggplot(aes(x = factor(decade), y = percent, fill = top_emotion)) +
  geom_col(position = "dodge") +
  labs(title = "Emotion Trends in Top 20 Billboard Song Lyrics by Decade", x = "Decade", y = "Percentage")

# Sampled Data
sample_songs_with_emotions_grouped_by_decade_pct <- sample_songs_with_emotions_grouped_by_decade |>
  group_by(decade) |>
  mutate(percent = n / sum(n)) |>
  ungroup()

sample_songs_with_emotions_grouped_by_decade_pct |>
  ggplot(aes(x = factor(decade), y = percent, fill = top_emotion)) +
  geom_col(position = "dodge") +
  labs(title = "Proportions Of Emotion Trends in Sampled Billboard Song Lyrics by Decade", x = "Decade", y = "Percentage")

Top 20 Songs:

This graph looks a lot more stabilized. This graph still shows Joy as the top emotion for each decade, yet this percentage decreases in the 2000s. We can also see that Anger percentage increases in the 2000s, then drops in the 2010s. This makes one wonder if world events occurring the 2000s changed overall population sentiment, which is shown in the art/music that was released during that time. Joy showed the highest percentage in 90s. Additionally, Trust decreases in the 90s and 2000s, then increases again in the 2010 decade.

Sampled:

The sampled data shows pretty similar results. Anticpation decreases over the decades, and fear has a lower percentage in the 2010s. Trust remains a bit more constant over the decades compared to the top 20 songs.

Now let’s look at overall Positive vs Negative emotions, and how they’ve changed over time:

# Positive vs. Negative Emotional Trends Over Time
songs_with_emotions_polarity <- songs_with_emotions |>
  filter(!is.na(top_emotion)) |>
  mutate(polarity = case_when(
    top_emotion %in% c("joy", "trust", "anticipation") ~ "Positive",
    top_emotion %in% c("sadness", "anger", "fear", "disgust") ~ "Negative",
    TRUE ~ "Neutral"
  ))

songs_with_emotions_polarity |>
  count(year, polarity) |>
  group_by(year) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = as.numeric(year), y = prop, color = polarity)) +
  geom_line(size = 1.2) +
  labs(title = "Positive vs. Negative Emotional Trends Over Time (Top 20 Songs per Year)",
       x = "Year", y = "Proportion of Songs") +
  theme_minimal(base_size = 14)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Sampled Data
sample_songs_with_emotions_polarity <- sample_songs_with_emotions |>
  filter(!is.na(top_emotion)) |>
  mutate(polarity = case_when(
    top_emotion %in% c("joy", "trust", "anticipation") ~ "Positive",
    top_emotion %in% c("sadness", "anger", "fear", "disgust") ~ "Negative",
    TRUE ~ "Neutral"
  ))

sample_songs_with_emotions_polarity |>
  count(year, polarity) |>
  group_by(year) |>
  mutate(prop = n / sum(n)) |>
  ggplot(aes(x = as.numeric(year), y = prop, color = polarity)) +
  geom_line(size = 1.2) +
  labs(title = "Positive vs. Negative Emotional Trends Over Time (Sampled Data)",
       x = "Year", y = "Proportion of Songs") +
  theme_minimal(base_size = 14)

Top 20 songs:

Overall there’s no major trend shown in this graph, but we can see in the 2000s positive emotions decreased, while negative emotions increased surpassing positive. This corresponds with our bar graph data. Also, neutral songs do not appear in the 2010s.

Sampled:

Overall the sampled data looks very similar to the top 20 songs.

Extra Analysis Conclusions

This analysis showed us that:

Joy is the most frequent emotion overall
Anticipation consistently takes up the 2nd highest percentage of emotions
Joy decreased while Anger increased in the 2000s
The major change in positive vs negative trend was in the 2000s where negative emotions were higher than positive

DATA607 Final Project

Gillian McGovern, Aaliyah John-Harry, Farhod Ibragimov

2025-05-04

Overview

Motivation

Libraries

Find the Top Artists

Read the Data

Tidy the Data

Filter the Data

Find the Songs for Each Artist

Read the Data

Tidy and Prepare the Data

Marvin Gaye

The Beach Boys

Joni Mitchell

Stevie Wonder

The Beatles

Nirvana

Fleetwood Mac

Prince

Bob Dylan

Lauryn Hill

Creating a Single Dataframe for Artist/Song/Album/Year

Clean the Data

Creating messy SQL database from all_artists_songs and returning it as tidy df

Get the Song Lyrics Using Genius.com API

Analysis

Conclusions

Extra Analysis

Billbord and geniusr libraries to get more songs

Get the Data

Genius API Integration

Analysis

Extra Analysis Conclusions