DATA607 Final Project

Overview

For this project, we want to analyze the emotions of the top Rolling Stone’s artists via sentiment analysis of song lyrics.

Data sources:

Top artists data: * Rolling Stone Album Rankings CSV: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07

Songs data: * Wikipedia Singles Discography (List of singles, with selected chart positions and certifications chart). Example: https://en.wikipedia.org/wiki/The_Beatles_singles_discography

Song Lyrics data: * Genius API and genius.com: https://docs.genius.com/

Motivation

We chose these data sets so we can:

Identify the emotions of songs by top artists via the lyrics
Explore how emotions of popular songs have changed over time
Explore how emotions of popular songs vary by genre

Additionally, these data sets would give us data preparation experience such as:

Tidying and transformation
API
Web scraping
MySQL

Libraries

library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
library(rvest)
library(janitor)
library(stringr)
library(tidytext)
library(tibble)
library(textdata)
library(tidyr)
library(readr)
library(purrr)
library(forcats)
library(ggplot2)
library(doParallel)
library(foreach)
library(DBI)  
library(RMariaDB)


# Load sentiment lexicons
nrc  <- get_sentiments("nrc")
bing  <- get_sentiments("bing")

# Detect number of cores and create cluster
num_cores <- parallel::detectCores() - 1  # leave one core free
cl <- makeCluster(num_cores)
registerDoParallel(cl)

Find the Top Artists

Read the Data

rs <- read_csv("https://raw.githubusercontent.com/aaliyahmjh/DATA607Project/refs/heads/main/rolling_stone.csv", show_col_types = FALSE)

Tidy the Data

This data set is untidy because it has a wide structure with duplicate columns such as name and 3 separate rank columns that can be combined and referenced with a “year” column:

# Tidy the data
tidy_rs <- rs %>%
  pivot_longer(
    cols = starts_with("rank_"),    
    names_to = "year",               
    values_to = "rank",             
    names_pattern = "rank_(\\d+)" 
  )

tidy_rs

## # A tibble: 2,073 × 20
##    sort_name      clean_name    album      differential release_year genre type 
##    <chr>          <chr>         <chr>             <dbl>        <dbl> <chr> <chr>
##  1 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  2 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  3 Sinatra, Frank Frank Sinatra In the We…         -182         1955 Big … Stud…
##  4 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  5 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  6 Diddley, Bo    Bo Diddley    Bo Diddle…         -241         1955 Rock… Stud…
##  7 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  8 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
##  9 Presley, Elvis Elvis Presley Elvis Pre…         -277         1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for…         -195         1956 Big … Stud…
## # ℹ 2,063 more rows
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## #   spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## #   artist_gender <chr>, artist_birth_year_sum <dbl>,
## #   debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## #   years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>

Now each observation represents a year making the data set tidy.

Filter the Data

Next we want to filter the data set to only focus on albums ranked in 2020:

filter_rs <- tidy_rs %>%
  select(clean_name, album, genre, year, rank)

# Filter dataset to only focus on albums ranked in 2020
filter_rs <- filter_rs %>%
  filter(year == 2020) %>%
  filter(!is.na(rank))

head(filter_rs)

## # A tibble: 6 × 5
##   clean_name     album                      genre                    year   rank
##   <chr>          <chr>                      <chr>                    <chr> <dbl>
## 1 Frank Sinatra  In the Wee Small Hours     Big Band/Jazz            2020    282
## 2 Bo Diddley     Bo Diddley / Go Bo Diddley Rock n' Roll/Rhythm & B… 2020    455
## 3 Elvis Presley  Elvis Presley              Rock n' Roll/Rhythm & B… 2020    332
## 4 Little Richard Here's Little Richard      <NA>                     2020    227
## 5 Beyonce        Lemonade                   <NA>                     2020     32
## 6 Amy Winehouse  Back to Black              Soul/Gospel/R&B          2020     33

Now we want to grab only the top 10 artists:

# Select the top 10 artists & albums by rank 
top10_by_rank <- filter_rs %>%
  arrange(rank) %>%      # sort ascending
  slice_head(n = 10)      # take the first 10 rows

top10_by_rank

## # A tibble: 10 × 5
##    clean_name     album                           genre              year   rank
##    <chr>          <chr>                           <chr>              <chr> <dbl>
##  1 Marvin Gaye    What's Going On                 Soul/Gospel/R&B    2020      1
##  2 The Beach Boys Pet Sounds                      <NA>               2020      2
##  3 Joni Mitchell  Blue                            Country/Folk/Coun… 2020      3
##  4 Stevie Wonder  Songs in the Key of Life        Soul/Gospel/R&B    2020      4
##  5 The Beatles    Abbey Road                      <NA>               2020      5
##  6 Nirvana        Nevermind                       Indie/Alternative… 2020      6
##  7 Fleetwood Mac  Rumours                         Singer-Songwriter… 2020      7
##  8 Prince         Purple Rain                     <NA>               2020      8
##  9 Bob Dylan      Blood on the Tracks             Country/Folk/Coun… 2020      9
## 10 Lauryn Hill    The Miseducation of Lauryn Hill Hip-Hop/Rap        2020     10

The top artists are:

Marvin Gaye
The Beach Boys
Joni Mitchell
Stevie Wonder
The Beatles
Nirvana
Fleetwood Mac
Prince
Bob Dylan
Lauryn Hill

Find the Songs for Each Artist

Now that we have our artists to focus on, next is grabbing the songs for each artist. We’ll be grabbing this data from Wikipedia for us to get web scraping experience.

Read the Data

Each Wikipedia page is set up a bit differently, so we’re going to get the songs for each artist individually:

# Get the html
html_marvin_gaye <- read_html("https://en.wikipedia.org/wiki/Marvin_Gaye_discography#Singles")
html_beach_boys <- read_html("https://en.wikipedia.org/wiki/The_Beach_Boys_discography#Singles")
html_joni_mitchell <- read_html("https://en.wikipedia.org/wiki/Joni_Mitchell_discography#Singles")
html_stevie_wonder <- read_html("https://en.wikipedia.org/wiki/Stevie_Wonder_discography#Singles")
html_the_beatles <- read_html("https://en.wikipedia.org/wiki/The_Beatles_singles_discography#Singles")
html_nirvana <- read_html("https://en.wikipedia.org/wiki/Nirvana_discography#Singles")
html_fleetwood_mac <- read_html("https://en.wikipedia.org/wiki/Fleetwood_Mac_discography#Singles")
html_prince <- read_html("https://en.wikipedia.org/wiki/Prince_singles_discography#Singles")
html_bob_dylan <- read_html("https://en.wikipedia.org/wiki/Bob_Dylan_discography#Singles")
html_lauryn_hill <- read_html("https://en.wikipedia.org/wiki/Lauryn_Hill_discography#Singles")

Next we select the tables we want from the HTML. Note, some artist’s pages have multiple tables for songs where each table represents a decade. For these artists, we’ll need to gather multiple tables:

# Scrape the songs table from wikipedia
get_songs_df <- function(html, css_selector) {
  songs <- html |> html_elements(css_selector) |>
    html_table()
  return(as.data.frame(songs[[1]]))
}

# Note: these Wikipedia pages can be updated, so it's possible this code won't run in the future (css selector would need to be updated)
marvin_gaye_df <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(30)") # 1960s
marvin_gaye_df2 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(32)") # 1970 - 1984
marvin_gaye_df3 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(34)") # Posthumous
beach_boys_df <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(21)") # 1960s
beach_boys_df2 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)") # 1970s
beach_boys_df3 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(25)") # 1980s
beach_boys_df4 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(27)") # 1990s - present
joni_mitchell_df <- get_songs_df(html_joni_mitchell, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(19)")
stevie_wonder_df <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1960s
stevie_wonder_df2 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1970s
stevie_wonder_df3 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 1980s
stevie_wonder_df4 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 1990s - present
the_beatles_df <- get_songs_df(html_the_beatles, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(10)")
nirvana_df <- get_songs_df(html_nirvana, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)")
fleetwood_mac_df <- get_songs_df(html_fleetwood_mac, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)")
prince_df <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1970s & 80s
prince_df2 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1990s
prince_df3 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 2000s
prince_df4 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 2010s - 20s
bob_dylan_df <- get_songs_df(html_bob_dylan, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
lauryn_hill_df <- get_songs_df(html_lauryn_hill, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")

Tidy and Prepare the Data

Next we’ll go through each unique dataframe (note, each dataframe is slightly different from each other), and tidy the data. These dataframes are untidy because each chart category is a unique column. Instead there should be a general albums_chart category variable and a rank variable to represent this information.

Marvin Gaye

# Marvin Gaye
## 1960s ##
# Keep the important variables
marvin_gaye_df <- marvin_gaye_df[, 1:11]

# First row is actually the column names
marvin_gaye_df <- marvin_gaye_df |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df_clean <- marvin_gaye_df %>%
  clean_names()
colnames(marvin_gaye_df_clean)[4:9] <- c("us", "us_r_and_b", "can", "ire", "ned", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Clean up
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df_clean$title <- str_extract(marvin_gaye_df_clean$title , "^(.).*\\1")
head(marvin_gaye_df_clean)

## # A tibble: 6 × 7
##   year  title       b_side certifications album albums_chart peak_chart_position
##   <chr> <chr>       <chr>  <chr>          <chr> <chr>        <chr>              
## 1 1961  "\"Let You… "\"Ne… ""             The … us           —                  
## 2 1961  "\"Let You… "\"Ne… ""             The … us_r_and_b   —                  
## 3 1961  "\"Let You… "\"Ne… ""             The … can          —                  
## 4 1961  "\"Let You… "\"Ne… ""             The … ire          —                  
## 5 1961  "\"Let You… "\"Ne… ""             The … ned          —                  
## 6 1961  "\"Let You… "\"Ne… ""             The … uk           —

## 1970 - 1984 ##
# First row is actually the column names
marvin_gaye_df2 <- marvin_gaye_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df2_clean <- marvin_gaye_df2 %>%
  clean_names()
colnames(marvin_gaye_df2_clean)[4:12] <- c("us", "us_r_and_b", "aus", "can", "ger", "ire", "ned", "swe", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(marvin_gaye_df2_clean)

## # A tibble: 6 × 8
##   year  title b_side certifications album na    albums_chart peak_chart_position
##   <chr> <chr> <chr>  <chr>          <chr> <chr> <chr>        <chr>              
## 1 1970  "\"H… "\"Go… ""             That… <NA>  us           41                 
## 2 1970  "\"H… "\"Go… ""             That… <NA>  us_r_and_b   18                 
## 3 1970  "\"H… "\"Go… ""             That… <NA>  aus          37                 
## 4 1970  "\"H… "\"Go… ""             That… <NA>  can          37                 
## 5 1970  "\"H… "\"Go… ""             That… <NA>  ger          —                  
## 6 1970  "\"H… "\"Go… ""             That… <NA>  ire          —

# Clean up
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df2_clean$title <- str_extract(marvin_gaye_df2_clean$title , "^(.).*\\1")

## Posthumous ##
# First row is actually the column names
marvin_gaye_df3 <- marvin_gaye_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
marvin_gaye_df3_clean <- marvin_gaye_df3 %>%
  clean_names()
colnames(marvin_gaye_df3_clean)[4:12] <- c("us", "us_r_and_b", "aus", "ger", "ire", "ned", "sco", "swe", "uk")

# Make tidy - make the albums chart it's own variable
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Clean up
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
  filter(title != '"—" denotes items that did not chart or were not released in that territory. "N/A" indicates chart not yet published.')
marvin_gaye_df3_clean$title <- str_extract(marvin_gaye_df3_clean$title , "^(.).*\\1")

head(marvin_gaye_df3_clean)

## # A tibble: 6 × 8
##   year  title b_side certifications album na    albums_chart peak_chart_position
##   <chr> <chr> <chr>  <chr>          <chr> <chr> <chr>        <chr>              
## 1 1985  "\"S… "\"Sa… ""             Drea… <NA>  us           [K]                
## 2 1985  "\"S… "\"Sa… ""             Drea… <NA>  us_r_and_b   2                  
## 3 1985  "\"S… "\"Sa… ""             Drea… <NA>  aus          —                  
## 4 1985  "\"S… "\"Sa… ""             Drea… <NA>  ger          —                  
## 5 1985  "\"S… "\"Sa… ""             Drea… <NA>  ire          —                  
## 6 1985  "\"S… "\"Sa… ""             Drea… <NA>  ned          —

The Beach Boys

# The Beach Boys

## 1960s ##
beach_boys_df <- beach_boys_df |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df_clean <- beach_boys_df %>%
  clean_names()
colnames(beach_boys_df_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "aus", "can", "nl", "nor", "swe", "uk", "phi")

# Make tidy - make the albums chart it's own variable
beach_boys_df_clean <- beach_boys_df_clean |>
  pivot_longer(cols = us:phi, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df_clean <- separate_rows(beach_boys_df_clean, title, sep = '""')
beach_boys_df_clean

## # A tibble: 710 × 6
##    year  title     certifications_sales…¹ album albums_chart peak_chart_position
##    <chr> <chr>     <chr>                  <chr> <chr>        <chr>              
##  1 1961  "\"Surfi… —                      Surf… us           75—                
##  2 1961  "Luau\" … —                      Surf… us           75—                
##  3 1961  "\"Surfi… —                      Surf… us_cash_box  85—                
##  4 1961  "Luau\" … —                      Surf… us_cash_box  85—                
##  5 1961  "\"Surfi… —                      Surf… us_record_w… ——                 
##  6 1961  "Luau\" … —                      Surf… us_record_w… ——                 
##  7 1961  "\"Surfi… —                      Surf… aus          ——                 
##  8 1961  "Luau\" … —                      Surf… aus          ——                 
##  9 1961  "\"Surfi… —                      Surf… can          ——                 
## 10 1961  "Luau\" … —                      Surf… can          ——                 
## # ℹ 700 more rows
## # ℹ abbreviated name: ¹certifications_sales_thresholds

# Clean up
beach_boys_df_clean <- beach_boys_df_clean |>
  filter(title != '"—" denotes a release that did not chart, was not released in the country or the information is unknown (* - US Record World chart data incomplete for early 1964)')
beach_boys_df_clean$title <- str_extract(beach_boys_df_clean$title , "^(.).*\\1")

## 1970s ##
beach_boys_df2 <- beach_boys_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df2_clean <- beach_boys_df2 %>%
  clean_names()
colnames(beach_boys_df2_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "us_ac", "aus", "can", "nl", "nor", "swe", "uk")

# Make tidy - make the albums chart it's own variable
beach_boys_df2_clean <- beach_boys_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df2_clean <- separate_rows(beach_boys_df2_clean, title, sep = '""')

## 1980s ##
beach_boys_df3 <- beach_boys_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df3_clean <- beach_boys_df3 %>%
  clean_names()
colnames(beach_boys_df3_clean)[2:9] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk")

# Make tidy - make the albums chart it's own variable
beach_boys_df3_clean <- beach_boys_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '""')
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '" "')

## 1990s - Present
beach_boys_df4 <- beach_boys_df4 |>
  row_to_names(row_number = 1)

# Clean up the names
beach_boys_df4_clean <- beach_boys_df4 %>%
  clean_names()
colnames(beach_boys_df4_clean)[2:10] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk", "us_cou")

# Make tidy - make the albums chart it's own variable
beach_boys_df4_clean <- beach_boys_df4_clean |>
  pivot_longer(cols = us:us_cou, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '""')
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '" "')

the_beach_boys_df_clean <- beach_boys_df_clean
the_beach_boys_df2_clean <- beach_boys_df2_clean
the_beach_boys_df3_clean <- beach_boys_df3_clean
the_beach_boys_df4_clean <- beach_boys_df4_clean
head(the_beach_boys_df_clean)

## # A tibble: 6 × 6
##   year  title certifications_sales_thre…¹ album albums_chart peak_chart_position
##   <chr> <chr> <chr>                       <chr> <chr>        <chr>              
## 1 1961  <NA>  —                           Surf… us           75—                
## 2 1961  <NA>  —                           Surf… us           75—                
## 3 1961  <NA>  —                           Surf… us_cash_box  85—                
## 4 1961  <NA>  —                           Surf… us_cash_box  85—                
## 5 1961  <NA>  —                           Surf… us_record_w… ——                 
## 6 1961  <NA>  —                           Surf… us_record_w… ——                 
## # ℹ abbreviated name: ¹certifications_sales_thresholds

head(the_beach_boys_df2_clean)

## # A tibble: 6 × 5
##   year  title                             album albums_chart peak_chart_position
##   <chr> <chr>                             <chr> <chr>        <chr>              
## 1 1970  "\"Add Some Music to Your Day"    Sunf… us           64—                
## 2 1970  "Susie Cincinnati\" (non-album B… Sunf… us           64—                
## 3 1970  "\"Add Some Music to Your Day"    Sunf… us_cash_box  49—                
## 4 1970  "Susie Cincinnati\" (non-album B… Sunf… us_cash_box  49—                
## 5 1970  "\"Add Some Music to Your Day"    Sunf… us_record_w… 39—                
## 6 1970  "Susie Cincinnati\" (non-album B… Sunf… us_record_w… 39—

head(the_beach_boys_df3_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1980  "\"Goin' On"       —              Keep… us           83—                
## 2 1980  "Endless Harmony\… —              Keep… us           83—                
## 3 1980  "\"Goin' On"       —              Keep… us_cash_box  103—               
## 4 1980  "Endless Harmony\… —              Keep… us_cash_box  103—               
## 5 1980  "\"Goin' On"       —              Keep… us_ac        ——                 
## 6 1980  "Endless Harmony\… —              Keep… us_ac        ——

head(the_beach_boys_df4_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1990  "\"Somewhere Near… —              Stil… us           ——                 
## 2 1990  "Kokomo\""         —              Stil… us           ——                 
## 3 1990  "\"Somewhere Near… —              Stil… us_cash_box  ——                 
## 4 1990  "Kokomo\""         —              Stil… us_cash_box  ——                 
## 5 1990  "\"Somewhere Near… —              Stil… us_ac        ——                 
## 6 1990  "Kokomo\""         —              Stil… us_ac        ——

Joni Mitchell

# Joni Mitchell
joni_mitchell_df <- joni_mitchell_df |>
  row_to_names(row_number = 1)

# Clean up the variable names
joni_mitchell_df_clean <- joni_mitchell_df %>%
  clean_names()
colnames(joni_mitchell_df_clean)[3:9] <- c("can", "can_uc", "aus", "uk", "us", "us_ac", "us_main")

# Make tidy - make the albums chart it's own variable
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
  pivot_longer(cols = can:us_main, names_to = "albums_chart", values_to = "peak_chart_position")

joni_mitchell_df_clean <- joni_mitchell_df_clean |>
  filter(title != '"—" denotes releases that did not chart')

# Clean up song names
joni_mitchell_df_clean$title <- str_extract(joni_mitchell_df_clean$title , "^(.).*\\1")
head(joni_mitchell_df_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"Night in the C… 1968  ""             Song… can          —                  
## 2 "\"Night in the C… 1968  ""             Song… can_uc       —                  
## 3 "\"Night in the C… 1968  ""             Song… aus          —                  
## 4 "\"Night in the C… 1968  ""             Song… uk           —                  
## 5 "\"Night in the C… 1968  ""             Song… us           —                  
## 6 "\"Night in the C… 1968  ""             Song… us_ac        —

Stevie Wonder

# Stevie Wonder
stevie_wonder_df <- stevie_wonder_df |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
stevie_wonder_df_clean <- stevie_wonder_df %>%
  clean_names()
colnames(stevie_wonder_df_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df_clean <- stevie_wonder_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df_clean <- separate_rows(stevie_wonder_df_clean, title, sep = 'b/w')
head(stevie_wonder_df_clean)

## # A tibble: 6 × 10
##   title         year  swi_39 uk_40 certifications album na    na_2  albums_chart
##   <chr>         <chr> <chr>  <chr> <chr>          <chr> <chr> <chr> <chr>       
## 1 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us          
## 2 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us          
## 3 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us_r_and_b  
## 4 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us_r_and_b  
## 5 "\"I Call It… 1962  —      —     ""             Non-… <NA>  <NA>  us_ac       
## 6 " Part 2"     1962  —      —     ""             Non-… <NA>  <NA>  us_ac       
## # ℹ 1 more variable: peak_chart_position <chr>

## 1970s ##
stevie_wonder_df2 <- stevie_wonder_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
stevie_wonder_df2_clean <- stevie_wonder_df2 %>%
  clean_names()
colnames(stevie_wonder_df2_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df2_clean <- stevie_wonder_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df2_clean <- separate_rows(stevie_wonder_df2_clean, title, sep = 'b/w')
head(stevie_wonder_df2_clean)

## # A tibble: 6 × 8
##   title  year  uk_40 certifications album na    albums_chart peak_chart_position
##   <chr>  <chr> <chr> <chr>          <chr> <chr> <chr>        <chr>              
## 1 "\"Ne… 1970  6     ""             Sign… <NA>  us           26                 
## 2 " \"S… 1970  6     ""             Sign… <NA>  us           26                 
## 3 "\"Ne… 1970  6     ""             Sign… <NA>  us_r_and_b   11                 
## 4 " \"S… 1970  6     ""             Sign… <NA>  us_r_and_b   11                 
## 5 "\"Ne… 1970  6     ""             Sign… <NA>  us_ac        31                 
## 6 " \"S… 1970  6     ""             Sign… <NA>  us_ac        31

## 1980s ##
stevie_wonder_df3 <- stevie_wonder_df3 |>
  row_to_names(row_number = 1)

# Clean up the names
stevie_wonder_df3_clean <- stevie_wonder_df3 %>%
  clean_names()
colnames(stevie_wonder_df3_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df3_clean <- stevie_wonder_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df3_clean <- separate_rows(stevie_wonder_df3_clean, title, sep = 'b/w')
head(stevie_wonder_df3_clean)

## # A tibble: 6 × 9
##   title               year  swi_39 uk_40 certifications album na    albums_chart
##   <chr>               <chr> <chr>  <chr> <chr>          <chr> <chr> <chr>       
## 1 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us          
## 2 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us          
## 3 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us_r_and_b  
## 4 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us_r_and_b  
## 5 "\"Black Orchid\""  1980  —      63    ""             "Ste… <NA>  us_ac       
## 6 " \"Blame It on th… 1980  —      63    ""             "Ste… <NA>  us_ac       
## # ℹ 1 more variable: peak_chart_position <chr>

## 1990s - present ##
stevie_wonder_df4 <- stevie_wonder_df4 |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
stevie_wonder_df4_clean <- stevie_wonder_df4 %>%
  clean_names()
colnames(stevie_wonder_df4_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")

# Make tidy - make the albums chart it's own variable
stevie_wonder_df4_clean <- stevie_wonder_df4_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df4_clean <- separate_rows(stevie_wonder_df4_clean, title, sep = 'b/w')
head(stevie_wonder_df4_clean)

## # A tibble: 6 × 9
##   title     year  uk_40 album na    na_2  na_3  albums_chart peak_chart_position
##   <chr>     <chr> <chr> <chr> <chr> <chr> <chr> <chr>        <chr>              
## 1 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us           —                  
## 2 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us           —                  
## 3 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us_r_and_b   24                 
## 4 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us_r_and_b   24                 
## 5 "\"Keep … 1990  77    Non-… <NA>  <NA>  <NA>  us_ac        —                  
## 6 " Instru… 1990  77    Non-… <NA>  <NA>  <NA>  us_ac        —

The Beatles

# The Beatles
the_beatles_df <- the_beatles_df |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
the_beatles_df_clean <- the_beatles_df %>%
  clean_names()
colnames(the_beatles_df_clean)[1:15] <- c("title", "year", "uk", "aus", "aut", "bel", "can", "ger", "nld", "nor", "nz", "swi", "us", "us_cash_box", "us_record_world")

# Make tidy - make the albums chart it's own variable
the_beatles_df_clean <- the_beatles_df_clean |>
  pivot_longer(cols = uk:us_record_world, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
the_beatles_df_clean <- separate_rows(the_beatles_df_clean, title, sep = '""')
head(the_beatles_df_clean)

## # A tibble: 6 × 9
##   title          year  certifications uk_album us_album na    na_2  albums_chart
##   <chr>          <chr> <chr>          <chr>    <chr>    <chr> <chr> <chr>       
## 1 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  uk          
## 2 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  uk          
## 3 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  aus         
## 4 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  aus         
## 5 "\"My Bonnie"  1962  ""             My Bonn… ""       <NA>  <NA>  aut         
## 6 "The Saints\"… 1962  ""             My Bonn… ""       <NA>  <NA>  aut         
## # ℹ 1 more variable: peak_chart_position <chr>

Nirvana

# Nirvana
nirvana_df <- nirvana_df |>
  row_to_names(row_number = 1)

# Clean up the names
nirvana_df_clean <- nirvana_df %>%
  clean_names()
colnames(nirvana_df_clean)[3:12] <- c("us", "aus", "bel", "fin", "fra", "ire", "nz", "prt", "swe", "uk")

# Make tidy - make the albums chart it's own variable
nirvana_df_clean <- nirvana_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(nirvana_df_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"Love Buzz\"[A]" 1988  ""             Blea… us           —                  
## 2 "\"Love Buzz\"[A]" 1988  ""             Blea… aus          —                  
## 3 "\"Love Buzz\"[A]" 1988  ""             Blea… bel          —                  
## 4 "\"Love Buzz\"[A]" 1988  ""             Blea… fin          —                  
## 5 "\"Love Buzz\"[A]" 1988  ""             Blea… fra          —                  
## 6 "\"Love Buzz\"[A]" 1988  ""             Blea… ire          —

Fleetwood Mac

# Fleetwood Mac
fleetwood_mac_df <- fleetwood_mac_df |>
  row_to_names(row_number = 1)

# Clean up the names
fleetwood_mac_df_clean <- fleetwood_mac_df %>%
  clean_names()
colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", "us_ac", "aus", "can", "ger", "irl", "nl", "nz")

## Warning in colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", :
## number of items to replace is not a multiple of replacement length

# Make tidy - make the albums chart it's own variable
fleetwood_mac_df_clean <- fleetwood_mac_df_clean |>
  pivot_longer(cols = uk:nz, names_to = "albums_chart", values_to = "peak_chart_position")
head(fleetwood_mac_df_clean)

## # A tibble: 6 × 5
##   title                             year  album albums_chart peak_chart_position
##   <chr>                             <chr> <chr> <chr>        <chr>              
## 1 "\"I Believe My Time Ain't Long\… 1967  Non-… uk           "—"                
## 2 "\"I Believe My Time Ain't Long\… 1967  Non-… us           ""                 
## 3 "\"I Believe My Time Ain't Long\… 1967  Non-… us_rock      ""                 
## 4 "\"I Believe My Time Ain't Long\… 1967  Non-… us_ac        "—"                
## 5 "\"I Believe My Time Ain't Long\… 1967  Non-… aus          ""                 
## 6 "\"I Believe My Time Ain't Long\… 1967  Non-… can          ""

Prince

# Prince

## 70s & 80s ##
prince_df <- prince_df |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df_clean <- prince_df %>%
  clean_names()
colnames(prince_df_clean)[3:7] <- c("us", "us_r_and_b", "us_dance", "nld_tip", "uk")

# Make tidy - make the albums chart it's own variable
prince_df_clean <- prince_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df_clean)

## # A tibble: 6 × 11
##   title year  ger_17 nz_18 swe_19 swi_20 uk_21 certifications album albums_chart
##   <chr> <chr> <chr>  <chr> <chr>  <chr>  <chr> <chr>          <chr> <chr>       
## 1 "\"S… 1978  —      —     —      —      —     ""             For … us          
## 2 "\"S… 1978  —      —     —      —      —     ""             For … us_r_and_b  
## 3 "\"S… 1978  —      —     —      —      —     ""             For … us_dance    
## 4 "\"S… 1978  —      —     —      —      —     ""             For … nld_tip     
## 5 "\"S… 1978  —      —     —      —      —     ""             For … uk          
## 6 "\"J… 1978  —      —     —      —      —     ""             For … us          
## # ℹ 1 more variable: peak_chart_position <chr>

## 1990s ##
prince_df2 <- prince_df2 |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df2_clean <- prince_df2 %>%
  clean_names()
colnames(prince_df2_clean)[3:12] <- c("us", "us_r_and_b", "aus", "can", "fra", "ger", "nz", "swe", "swi", "uk")

# Make tidy - make the albums chart it's own variable
prince_df2_clean <- prince_df2_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df2_clean)

## # A tibble: 6 × 6
##   title              year  certifications album albums_chart peak_chart_position
##   <chr>              <chr> <chr>          <chr> <chr>        <chr>              
## 1 "\"The Future\" (… 1990  ""             Batm… us           —                  
## 2 "\"The Future\" (… 1990  ""             Batm… us_r_and_b   —                  
## 3 "\"The Future\" (… 1990  ""             Batm… aus          —                  
## 4 "\"The Future\" (… 1990  ""             Batm… can          —                  
## 5 "\"The Future\" (… 1990  ""             Batm… fra          —                  
## 6 "\"The Future\" (… 1990  ""             Batm… ger          39

## 2000s ##
prince_df3 <- prince_df3 |>
  row_to_names(row_number = 1)

## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().

# Clean up the names
prince_df3_clean <- prince_df3 %>%
  clean_names()
colnames(prince_df3_clean)[3:11] <- c("us", "us_r_and_b", "aus", "ger", "ire", "nld", "nor", "swi", "uk")

# Make tidy - make the albums chart it's own variable
prince_df3_clean <- prince_df3_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df3_clean)

## # A tibble: 6 × 7
##   title                 year  album na    na_2  albums_chart peak_chart_position
##   <chr>                 <chr> <chr> <chr> <chr> <chr>        <chr>              
## 1 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  us           —[G]               
## 2 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  us_r_and_b   108                
## 3 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  aus          —                  
## 4 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  ger          —                  
## 5 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  ire          —                  
## 6 "\"U Make My Sun Shi… 2000  The … <NA>  <NA>  nld          —

## 2010s - present ##
prince_df4 <- prince_df4 |>
  row_to_names(row_number = 1)

# Clean up the names
prince_df4_clean <- prince_df4 %>%
  clean_names()
colnames(prince_df4_clean)[3:7] <- c("us_adult_r_and_b", "us_r_and_b", "us_r_and_b_digital", "us_r_and_b_sales", "uk")

# Make tidy - make the albums chart it's own variable
prince_df4_clean <- prince_df4_clean |>
  pivot_longer(cols = us_adult_r_and_b:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# Wikipedia added a "Posthumous" section within the same table, so need to remove that row to make the data frame tidy
prince_df4_clean <- prince_df4_clean |>
  filter(title != "Posthumous", title != '"—" denotes a single that was not released or did not chart in the region.')
head(prince_df4_clean)

## # A tibble: 6 × 5
##   title                             year  album albums_chart peak_chart_position
##   <chr>                             <chr> <chr> <chr>        <chr>              
## 1 "\"Extraloveable\" (featuring An… 2011  Hit … us_adult_r_… —                  
## 2 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b   —                  
## 3 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b_… —                  
## 4 "\"Extraloveable\" (featuring An… 2011  Hit … us_r_and_b_… —                  
## 5 "\"Extraloveable\" (featuring An… 2011  Hit … uk           —                  
## 6 "\"Rock and Roll Love Affair\" [… 2012  Hit … us_adult_r_… —[J]

Bob Dylan

# Bob Dylan
bob_dylan_df <- bob_dylan_df |>
  row_to_names(row_number = 1)

# Clean up the names
bob_dylan_df_clean <- bob_dylan_df %>%
  clean_names()
colnames(bob_dylan_df_clean)[2:9] <- c("title", "us", "us_main", "us_cash", "aus", "ire", "nl", "uk")

# Make tidy - make the albums chart it's own variable
bob_dylan_df_clean <- bob_dylan_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")

# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
bob_dylan_df_clean <- separate_rows(bob_dylan_df_clean, title, sep = 'b/w')
head(bob_dylan_df_clean)

## # A tibble: 6 × 6
##   year  title              certifications album albums_chart peak_chart_position
##   <chr> <chr>              <chr>          <chr> <chr>        <chr>              
## 1 1962  "\"Mixed-Up Confu… ""             Non-… us           —                  
## 2 1962  " \"Corrina Corri… ""             Non-… us           —                  
## 3 1962  "\"Mixed-Up Confu… ""             Non-… us_main      —                  
## 4 1962  " \"Corrina Corri… ""             Non-… us_main      —                  
## 5 1962  "\"Mixed-Up Confu… ""             Non-… us_cash      —                  
## 6 1962  " \"Corrina Corri… ""             Non-… us_cash      —

Lauryn Hill

# Lauryn Hill
lauryn_hill_df <- lauryn_hill_df |>
  row_to_names(row_number = 1)

# Clean up the names
lauryn_hill_df_clean <- lauryn_hill_df %>%
  clean_names()
colnames(lauryn_hill_df_clean)[3:12] <- c("us", "us_r_and_b", "us_rhy", "aus", "fra", "nl", "nz", "swe", "swi", "uk")

# Make tidy - make the albums chart it's own variable
lauryn_hill_df_clean <- lauryn_hill_df_clean |>
  pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(lauryn_hill_df_clean)

## # A tibble: 6 × 7
##   title        year  certifications album na    albums_chart peak_chart_position
##   <chr>        <chr> <chr>          <chr> <chr> <chr>        <chr>              
## 1 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us           1                  
## 2 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us_r_and_b   2                  
## 3 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  us_rhy       1                  
## 4 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  aus          35                 
## 5 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  fra          23                 
## 6 "\"Doo Wop … 1998  "RIAA: Gold[3… The … <NA>  nl           4

We now have tidy song dataframe for each artist.

Creating a Single Dataframe for Artist/Song/Album/Year

Now let’s combine all the dataframes into one called artist_dfs:

artist_dfs <- c(
  # Marvin Gaye
  "marvin_gaye_df_clean",
  "marvin_gaye_df2_clean",
  "marvin_gaye_df3_clean",
  
  # The Beach Boys
  "the_beach_boys_df_clean",
  "the_beach_boys_df2_clean",
  "the_beach_boys_df3_clean",
  "the_beach_boys_df4_clean",  
  
  
  # Joni Mitchell
  "joni_mitchell_df_clean",
  
  # Stevie Wonder
  "stevie_wonder_df_clean",
  "stevie_wonder_df2_clean",
  "stevie_wonder_df3_clean",
  "stevie_wonder_df4_clean",
  
  # The Beatles
  "the_beatles_df_clean",
  
  # Nirvana
  "nirvana_df_clean",
  
  # Fleetwood Mac
  "fleetwood_mac_df_clean",
  
  # Prince
  "prince_df_clean",
  "prince_df2_clean",
  "prince_df3_clean",
  "prince_df4_clean",
  
  # Bob Dylan
  "bob_dylan_df_clean",
  
  # Lauryn Hill
  "lauryn_hill_df_clean"
)

Clean the Data

Since some of the values for song title variable are messy, let’s clean the data frame and write the data frame to a CSV:

# This function:
# - extracts artist name from object name (everything before "_df")
# - cleans song titles
# - adds artist column
# - saves cleaned df to global env with name like "artist_cleaned_songs"

clean_and_save_songs_df <- function(df_name_string) {
  # get df by name
  df <- get(df_name_string, envir = .GlobalEnv)

  # get artist name from df name (before "_df")
  artist_id <- str_extract(df_name_string, "^(.*?)(?=_df)")
  artist_title <- str_to_title(str_replace_all(artist_id, "_", " "))

  cleaned_df <- df |>
    mutate(
      # clean title — remove quotes, () [] {}, cut after cd, us, uk etc.
      title = title |>
        str_trim() |>
        str_replace_all('"', "") |>
        str_remove_all("\\(.*?\\)") |>
        str_remove_all("\\[.*?\\]") |>
        str_remove_all("\\{.*?\\}") |>
        str_remove_all("(?i)[ ]?(cd|us|uk|canadian)[\\s\\-:]+.*$") |>
        str_squish(),
      # add artist column
      artist = artist_title
    ) |>
    # drop junk rows like "denotes", "not released", etc.
    filter(
      !str_detect(title, "(?i)denotes|indicates|refers to|represents|not released")
    ) |>
    # keep only one row per title
    distinct(title, .keep_all = TRUE)

  # save cleaned df to global env
  cleaned_df_name <- paste0(artist_id, "_cleaned_songs")
  assign(cleaned_df_name, cleaned_df, envir = .GlobalEnv)

  # if album/year column missing, add NA
  if (!"album" %in% names(cleaned_df)) cleaned_df$album <- NA
  if (!"year" %in% names(cleaned_df)) cleaned_df$year <- NA

  # keep only needed columns
  minimal_df <- cleaned_df |>
    select(artist, title, album, year)

  # if all_artists_songs doesn’t exist, create it
  if (!exists("all_artists_songs", envir = .GlobalEnv)) {
    assign("all_artists_songs", minimal_df, envir = .GlobalEnv)
  } else {
    # if exists, add to it
    existing <- get("all_artists_songs", envir = .GlobalEnv)
    updated <- bind_rows(existing, minimal_df)
    assign("all_artists_songs", updated, envir = .GlobalEnv)
  }

  return(cleaned_df)
}


artists_songs_album <- invisible(lapply(artist_dfs, clean_and_save_songs_df))

write.csv(all_artists_songs, "all_artists_songs.csv")
all_artists_songs |>
  group_by(artist, title) |>
  filter(n() > 1) |>
  arrange(artist, title)

## # A tibble: 32 × 4
## # Groups:   artist, title [15]
##    artist      title                         album             year 
##    <chr>       <chr>                         <chr>             <chr>
##  1 Marvin Gaye Ain't No Mountain High Enough United            1967 
##  2 Marvin Gaye Ain't No Mountain High Enough Non-album singles 2013 
##  3 Marvin Gaye Let's Get It On               Let's Get It On   1973 
##  4 Marvin Gaye Let's Get It On               Non-album singles 2005 
##  5 Marvin Gaye Sexual Healing                Midnight Love     1982 
##  6 Marvin Gaye Sexual Healing                Non-album singles 2015 
##  7 Prince      1999                          1999              1982 
##  8 Prince      1999                          1999              1998 
##  9 Prince      7                             Love Symbol       1992 
## 10 Prince      7                             Non-album singles 2023 
## # ℹ 22 more rows

all_artists_songs <- all_artists_songs |>
  distinct(artist, title, .keep_all = TRUE)
all_artists_songs |>
  group_by(artist, title) |>
  filter(n() > 1) |>
  arrange(artist, title)

## # A tibble: 0 × 4
## # Groups:   artist, title [0]
## # ℹ 4 variables: artist <chr>, title <chr>, album <chr>, year <chr>

all_artists_songs <- all_artists_songs |>
  distinct(artist, title, .keep_all = TRUE)

Creating messy SQL database from all_artists_songs and returning it as tidy df

Get the Song Lyrics Using Genius.com API

Next to gather song lyrics, we’ll use the Genuis API. Unfortunately, the Genius API can’t provide the lyrics directly, but it can provide the URL that contains the lyrics to a song. We’ll then have to web scrape again to create a lyrics dataframe.

# Genius API token (token needs to be keyed before submitting the project). I'll leave it as is for now.
genius_token <- "9em6wi6o9dAzPw5U_6qjYCOqRa2nQKZ26DjIe4anzXgCsJJT0hP5ZZsZK3UdsrzH"

#For example we can use this song
song_name <- "Yesterday"
artist_band <- "The Beatles"

# This function searches Genius for a song URL
search_song_final <- function(song_title, artist_name) {
  # Heads up: Best practice is to load the token securely, not hardcode it.
  # e.g., using Sys.getenv("GENIUS_API_TOKEN")
  genius_token <- "9em6wi6o9dAzPw5U_6qjYCOqRa2nQKZ26DjIe4anzXgCsJJT0hP5ZZsZK3UdsrzH" # Using hardcoded token for this example

  # Setting up to call the Genius API's search endpoint
 base_url <- "https://api.genius.com/search"
  query <- list(q = paste(song_title, artist_name))

  # Making the request to the API
  res <- GET(
    url = base_url,
    add_headers(Authorization = paste("Bearer", genius_token)), # Authorization header includes the token
    query = query
  )

  # Check if the request worked (Status 200 means OK)
  if (status_code(res) != 200) {
    warning("API request failed. Status code: ", status_code(res))
    return(NA) # Return NA on API call failure
  }

  # Now parse the JSON results. Using flatten=TRUE based on previous findings.
  content_data <- tryCatch(
      fromJSON(content(res, "text", encoding = "UTF-8"), flatten = TRUE),
      error = function(e) {
          warning("Failed to parse JSON response: ", e$message)
          return(NULL) # Return NULL on JSON parsing error
      }
  )
  # Ensure a valid response structure was received
  if (is.null(content_data) || is.null(content_data$response)) return(NA)

  # Search results are typically inside response$hits
  hits <- content_data$response$hits

  # Process the 'hits' data frame (assuming it's a data frame)
  # Handle cases with no results or if 'hits' isn't a data frame
  if (is.null(hits) || !inherits(hits, "data.frame") || nrow(hits) == 0) {
    return(NA)
  }

  # Define expected column names based on flatten=TRUE producing specific columns
  artist_col <- "result.primary_artist.name"
  url_col <- "result.url"

  # Check if the required columns exist in the data frame
  if (!all(c(artist_col, url_col) %in% names(hits))) {
     warning("Required columns ('", artist_col, "', '", url_col, "') not found in flattened API response.")
     return(NA) # Cannot proceed without these columns
  }

  # Loop through each row (each potential song hit)
  for (i in 1:nrow(hits)) {
      # Get the artist name for the current row from the flattened column
      artist_raw <- hits[[artist_col]][[i]]
      artist <- tolower(as.character(artist_raw))

      # Check if the artist name is valid and matches the search query
      if (!is.na(artist) && nzchar(artist) && str_detect(artist, tolower(artist_name))) {
          # If the artist matched, get the URL from its column
          song_url_found <- hits[[url_col]][[i]]
          if (!is.na(song_url_found) && nzchar(song_url_found)) {
              # Match found! Return the URL and exit the function.
              return(song_url_found)
          }
      }
  }

  # If the loop finishes without returning, no suitable match was found
  return(NA)
}

# --- Example: Find the URL ---
song_url <- search_song_final(song_name, artist_band)
# This should print the found URL...
print(song_url)

## [1] "https://genius.com/The-beatles-yesterday-lyrics"

Now we need to create functions that will get all the words for a song, get_words_df, and get the song emotions, get_song_sentiments:

# Returns a dataframe where each observation is a word and the corresponding in a song
get_words_df <- function(artist_name, song_title) {
  song_url <- search_song_final(song_title, artist_name)
  if (is.na(song_url)) {
    return(NULL)
  }

  # Using the song_url found previously to scrape the lyrics

  # Only run this if a valid song_url was found
  if (!is.na(song_url) && nzchar(song_url)) {

    # Read the HTML content from the URL
    page_html <- tryCatch(
      read_html(song_url),
      error = function(e) {
        warning("Failed to read URL: ", song_url, "\nError: ", e$message)
        return(NULL) # Return NULL on page read failure
      }
    )

    # Only proceed if the webpage HTML was read successfully
    if (!is.null(page_html)) {

      # Find the part of the HTML containing the lyrics
      lyrics_selector <- "div[data-lyrics-container='true']" # Verify this selector!

      # Use rvest to find the HTML element(s) matching the selector
      lyrics_nodes <- page_html |>
        html_elements(css = lyrics_selector)

      # Check if any nodes were found using the selector
      if (length(lyrics_nodes) == 0) {
        # If not found, the selector was likely wrong or the page structure changed.
        warning("Could not find lyrics container using selector: '", lyrics_selector,
              "'. The website structure may have changed or the selector is wrong. ",
              "Please inspect the page HTML.")
        return(NULL) # Cannot proceed.
      } else {

        # Extract the text content from the found node(s)
        # html_text2 often handles line breaks better
        lyrics_text_raw <- lyrics_nodes |>
          html_text2()

        # Clean up the extracted raw text
        temp_cleaned_vector <- lyrics_text_raw %>%
          str_split("\n") %>% # Split into lines based on the newline character (\n)
          unlist() %>% # The result of str_split is a list, so unlist it
          str_remove_all("\\[[^\\]]+\\]") %>% # Remove structural markers like [Verse 1], [Chorus] etc.
          str_trim() # Remove leading/trailing whitespace from lines

        # Now, apply the filtering using standard base R subsetting (outside the pipe)
        lyrics_text_cleaned <- temp_cleaned_vector[nzchar(temp_cleaned_vector)]

        # Structure the cleaned lines into a data frame
        scraped_lyrics_df <- tibble(
          line = 1:length(lyrics_text_cleaned),
          text = lyrics_text_cleaned
        )

      }
    } else {
      # Scraping skipped because reading the URL failed earlier
      warning("Skipping scraping due to error reading URL.")
      return(NULL)
    }

  } else {
    warning("Cannot scrape lyrics because song_url was not found successfully.")
    return(NULL) # Ensure variable exists even if scraping is skipped
  }

  lyrics_df <- scraped_lyrics_df

  words <- lyrics_df %>%
    unnest_tokens(word, text)
  return(words)
}

# Returns top n song emotions for a particular song
get_song_sentiments <- function(words, artist_name, song_title, n, year = NULL) {
  null_top_emotion_df <- tibble(artist = artist_name, song = song_title, top_emotion = NA)
  
  if (is.null(words)) {
    return(null_top_emotion_df)
  }

  top_emotions_count <- words %>%
    inner_join(nrc, by = "word", relationship = "many-to-many") %>%
    filter(!sentiment %in% c("positive", "negative")) %>%
    count(sentiment, sort = TRUE) %>%
    slice_head(n = n)
  top_emotions <- top_emotions_count |>
    pull(sentiment)
  
  top_emotions_count$artist <- artist_name
  top_emotions_count$song <- song_title
  colnames(top_emotions_count)[1] <- "top_emotion"
  if (length(top_emotions) == 0) return(null_top_emotion_df)
  return(top_emotions_count)
}

# Testing
words_test <- get_words_df(artist_band, song_name)
head(words_test, 15)

## # A tibble: 15 × 2
##     line word        
##    <int> <chr>       
##  1     1 113         
##  2     1 contributors
##  3     2 translations
##  4     3 türkçe      
##  5     4 español     
##  6     5 português   
##  7     6 deutsch     
##  8     7 yesterday   
##  9     7 lyrics      
## 10     8 yesterday   
## 11     8 is          
## 12     8 the         
## 13     8 most        
## 14     8 covered     
## 15     8 song

scraped_lyrics_test <- get_song_sentiments(words_test, artist_band, song_name, 1)
head(scraped_lyrics_test, 15)

## # A tibble: 1 × 4
##   top_emotion     n artist      song     
##   <chr>       <int> <chr>       <chr>    
## 1 fear            3 The Beatles Yesterday

Analysis

Now we’re ready to perform our analysis.

Let’s start by finding the top emotions overall, over the years, and for each artist:

# Grab 5 songs from each of the top 10 artists
get_top5_valid_sentiments <- function(artist_name, song_titles, max_songs = 5) {
  results <- list()

  for (song in song_titles) {
    lyrics_words <- get_words_df(artist_name, song)
    res <- get_song_sentiments(lyrics_words, artist_name, song, 1)
    # keep only song + top_emotion
    if (!is.na(res$top_emotion)) {
      results <- append(results, list(
        tibble(song       = res$song,
               top_emotion = res$top_emotion)
      ))
    }

    if (length(results) >= max_songs) break
  }

  if (length(results) > 0) {
    return(bind_rows(results))
  } else {
    return(tibble(song = NA, top_emotion = NA))
  }
}

get_top5_valid_sentiments_year <- function(artist_name, title) {
  words <- get_words_df(artist_name, title)
  sent <- get_song_sentiments(words, artist_name, title, 1)
  return(sent)
}

artist_song_list <- all_artists_songs %>%
  filter(!is.na(artist), artist != "artist") %>%
  group_by(artist) %>%
  summarise(songs = list(title), .groups = "drop")

song_list_by_year <- all_artists_songs %>%
  filter(!is.na(artist), artist != "artist", year != "Source:[42]") %>%
  group_by(artist, year)

# Now apply get_top5_valid_sentiments to each row
# Use foreach to speed up the process
valid_sentiments <- foreach(i = 1:nrow(artist_song_list),
                            .combine = bind_rows,
                            .packages = c(
                              "dplyr", "purrr", "tibble", "httr", "jsonlite", 
                              "stringr", "rvest", "tidytext", "textdata", "readr", "xml2"
                            )) %dopar% {
  artist <- artist_song_list$artist[i]
  songs  <- artist_song_list$songs[[i]]
  result <- get_top5_valid_sentiments(artist, songs)
  result$artist <- artist
  result
}
print(valid_sentiments)

## # A tibble: 50 × 3
##    song                                 top_emotion  artist       
##    <chr>                                <chr>        <chr>        
##  1 Mixed-Up Confusion                   anger        Bob Dylan    
##  2 Corrina Corrina                      anticipation Bob Dylan    
##  3 Blowin' in the Wind                  joy          Bob Dylan    
##  4 The Times They Are a-Changin'        anger        Bob Dylan    
##  5 Honey, Just Allow Me One More Chance surprise     Bob Dylan    
##  6 I Believe My Time Ain't Long         anticipation Fleetwood Mac
##  7 Black Magic Woman                    joy          Fleetwood Mac
##  8 Need Your Love So Bad                joy          Fleetwood Mac
##  9 Man of the World                     joy          Fleetwood Mac
## 10 Oh Well                              anticipation Fleetwood Mac
## # ℹ 40 more rows

# Let's take a subset of the songs and just look at the years 1970, 1980, 1990, etc.
song_list_by_year_filtered <- song_list_by_year %>%
  filter(year %in% c("1970", "1980", "1990", "2000", "2010"))

valid_sentiments_by_decade <- foreach(i = 1:nrow(song_list_by_year_filtered), .combine = bind_rows, .packages = c("dplyr", "tibble")) %dopar% {
  artist <- song_list_by_year_filtered$artist[i]
  title  <- song_list_by_year_filtered$title[i]
  year   <- song_list_by_year_filtered$year[i]
  result <- get_top5_valid_sentiments_year(artist, title)
  if (!is.null(result) && nrow(result) > 0) {
    result$year <- year
    result
  } else {
    tibble(artist = artist, song = title, top_emotion = NA, year = year)
  }
}

valid_sentiments_by_decade <- valid_sentiments_by_decade |>
  count(year, top_emotion, sort = TRUE) |>
  group_by(year) |>
  slice_head(n = 1) |>
  ungroup()

print(valid_sentiments_by_decade)

## # A tibble: 5 × 3
##   year  top_emotion      n
##   <chr> <chr>        <int>
## 1 1970  joy             12
## 2 1980  joy             15
## 3 1990  joy              6
## 4 2000  anticipation     2
## 5 2010  joy              2

data("stop_words")
word_freq_decade_years <- foreach(i = 1:nrow(song_list_by_year_filtered), .combine = bind_rows, .packages = c("dplyr", "tibble")) %dopar% {
  artist <- song_list_by_year_filtered$artist[i]
  title  <- song_list_by_year_filtered$title[i]
  year   <- song_list_by_year_filtered$year[i]
  result <- get_words_df(artist, title)
  if (!is.null(result) && nrow(result) > 0) {
    result$year <- year
    result
  } else {
    tibble(artist = artist, song = title, word = NA, year = year)
  }
}


word_freq_decade_years <- word_freq_decade_years |>
  anti_join(stop_words, by=c("word"="word")) %>% # don't include stop words
  filter(!(word %in% c("it’s", "da", "dah", "duh", "ba", "shoo", "yeah", "ooh", "oh", "nuh", "i’m", "gonna", "bop", "you’re", "contributors", "lyrics", "la", "wanna"))) %>%
  count(word, sort = TRUE) %>%
  slice_head(n = 10)
word_freq_decade_years

## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 love      215
##  2 baby      113
##  3 time       69
##  4 home       65
##  5 summer     54
##  6 music      48
##  7 mind       46
##  8 world      46
##  9 grandma    43
## 10 girl       38

stopCluster(cl)

Let’s graph these charts and analyze:

Word frequency:

word_freq_decade_years <- word_freq_decade_years |>
  left_join(bing, by = "word", relationship = "many-to-many")

word_freq_decade_years |>
  ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
  geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title = "Top 10 Words 1960 - 2010",
         x = "Word",
         y = "Count")

We can see here that love is the most frequent word, and is the only word that has a positve or negative emotion attached to it. Possibly the most interesting takeaway from this graph is that grandma is in top words.

Sentiments for each decade:

valid_sentiments_by_decade |>
  ggplot(aes(x = year, y = n, fill = top_emotion)) +
  geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
  labs(title = "Top Emotion in Songs Per Year (One Year Per Decade)",
       x = "Year", y = "Count")

Over time, the emotion represented for each decade is joy excluding 2010. That particular year had more sad songs than joyful.

Top Emotions in Songs by Top 10 Artists:

ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = top_emotion)) +
  geom_bar() +
  coord_flip() +
  labs(title = "Top Emotions in Songs by Top 10 Artists",
       x = "Emotion", y = "Count") +
  theme_minimal()

This bar chart summarizes the most frequent dominant emotion detected in songs by the top 10 Rolling Stone artists.

Joy is by far the most common top emotion, appearing in nearly 25 songs, suggesting that uplifting and positive themes are prominent across these artists’ lyrics.
Anticipation is the second most frequent, often associated with themes of hope, excitement, or longing.
Emotions like trust, anger, and fear are present but much less frequent.

Sadness and surprise appear rarely as the primary emotion, indicating they are less likely to be the dominant tone in top songs.

Overall, this distribution highlights the emotional leaning of popular music from legendary artists toward positive or forward-looking sentiments.

Top Emotions by Artist:

valid_sentiments %>%
  group_by(artist, top_emotion) %>%
  summarise(n = n(), .groups = "drop") %>%
  arrange(artist, desc(n))

## # A tibble: 26 × 3
##    artist        top_emotion      n
##    <chr>         <chr>        <int>
##  1 Bob Dylan     anger            2
##  2 Bob Dylan     anticipation     1
##  3 Bob Dylan     joy              1
##  4 Bob Dylan     surprise         1
##  5 Fleetwood Mac joy              3
##  6 Fleetwood Mac anticipation     2
##  7 Joni Mitchell anticipation     2
##  8 Joni Mitchell joy              2
##  9 Joni Mitchell anger            1
## 10 Lauryn Hill   fear             2
## # ℹ 16 more rows

ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = artist)) +
  geom_bar(position = "dodge") +
  labs(title = "Top Emotions by Artist", x = "Emotion", y = "Number of Songs") +
  theme_minimal()

This bar chart shows the most common emotional tone for songs by each of the top 10 Rolling Stone artists based on NRC sentiment analysis.

Joy and anticipation are the most dominant emotions across artists, particularly for Marvin Gaye, The Beatles, and Fleetwood Mac.
Fear and sadness appear most often in Lauryn Hill’s songs, indicating a more emotionally intense or somber tone.
Bob Dylan stands out with a more diverse emotional profile that includes anger and surprise.

The chart highlights how different artists tend to gravitate toward different emotional themes in their songwriting.

Emotional Variety by Artist:

#Artists with the most emotionally varied songs
valid_sentiments %>%
  group_by(artist) %>%
  summarise(num_unique_emotions = n_distinct(top_emotion)) %>%
  arrange(desc(num_unique_emotions))

## # A tibble: 10 × 2
##    artist         num_unique_emotions
##    <chr>                        <int>
##  1 Bob Dylan                        4
##  2 The Beach Boys                   4
##  3 Joni Mitchell                    3
##  4 Lauryn Hill                      3
##  5 Fleetwood Mac                    2
##  6 Marvin Gaye                      2
##  7 Nirvana                          2
##  8 Prince                           2
##  9 Stevie Wonder                    2
## 10 The Beatles                      2

Conclusions

Our analysis showed us that:

Joy is the most frequent emotion overall, especially in the 1970s
Anticipation peaks in the 1980s/90s, then sadness rises in the 2000s

Some challenges we faced were:

Web scraping
Each Wikipedia page set up differently
Wikipedia page updated
Code sometimes slow to run
Genius API doesn’t provide song lyrics directly which added a step

Potential Future Analysis:

Analyze full discographies per artist
Compare results using multiple sentiment dictionaries
Explore how emotions differ by genre or rank position

DATA607 Final Project

Gillian McGovern, Aaliyah John-Harry, Farhod Ibragimov

2025-05-04

Overview

Motivation

Libraries

Find the Top Artists

Read the Data

Tidy the Data

Filter the Data

Find the Songs for Each Artist

Read the Data

Tidy and Prepare the Data

Marvin Gaye

The Beach Boys

Joni Mitchell

Stevie Wonder

The Beatles

Nirvana

Fleetwood Mac

Prince

Bob Dylan

Lauryn Hill

Creating a Single Dataframe for Artist/Song/Album/Year

Clean the Data

Creating messy SQL database from all_artists_songs and returning it as tidy df

Get the Song Lyrics Using Genius.com API

Analysis

Conclusions