For this project, we want to analyze the emotions of the top Rolling Stone’s artists via sentiment analysis of song lyrics.
Data sources:
Top artists data: * Rolling Stone Album Rankings CSV: https://github.com/rfordatascience/tidytuesday/tree/main/data/2024/2024-05-07
Songs data: * Wikipedia Singles Discography (List of singles, with selected chart positions and certifications chart). Example: https://en.wikipedia.org/wiki/The_Beatles_singles_discography
Song Lyrics data: * Genius API and genius.com: https://docs.genius.com/
We chose these data sets so we can:
Additionally, these data sets would give us data preparation experience such as:
library(httr)
library(jsonlite)
library(dplyr)
library(stringr)
library(rvest)
library(janitor)
library(stringr)
library(tidytext)
library(tibble)
library(textdata)
library(tidyr)
library(readr)
library(purrr)
library(forcats)
library(ggplot2)
library(doParallel)
library(foreach)
library(DBI)
library(RMariaDB)
# Load sentiment lexicons
nrc <- get_sentiments("nrc")
bing <- get_sentiments("bing")
# Detect number of cores and create cluster
num_cores <- parallel::detectCores() - 1 # leave one core free
cl <- makeCluster(num_cores)
registerDoParallel(cl)
rs <- read_csv("https://raw.githubusercontent.com/aaliyahmjh/DATA607Project/refs/heads/main/rolling_stone.csv", show_col_types = FALSE)
This data set is untidy because it has a wide structure with duplicate columns such as name and 3 separate rank columns that can be combined and referenced with a “year” column:
# Tidy the data
tidy_rs <- rs %>%
pivot_longer(
cols = starts_with("rank_"),
names_to = "year",
values_to = "rank",
names_pattern = "rank_(\\d+)"
)
tidy_rs
## # A tibble: 2,073 × 20
## sort_name clean_name album differential release_year genre type
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 2 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 3 Sinatra, Frank Frank Sinatra In the We… -182 1955 Big … Stud…
## 4 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 5 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 6 Diddley, Bo Bo Diddley Bo Diddle… -241 1955 Rock… Stud…
## 7 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 8 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 9 Presley, Elvis Elvis Presley Elvis Pre… -277 1956 Rock… Stud…
## 10 Sinatra, Frank Frank Sinatra Songs for… -195 1956 Big … Stud…
## # ℹ 2,063 more rows
## # ℹ 13 more variables: weeks_on_billboard <dbl>, peak_billboard_position <dbl>,
## # spotify_popularity <dbl>, spotify_url <chr>, artist_member_count <dbl>,
## # artist_gender <chr>, artist_birth_year_sum <dbl>,
## # debut_album_release_year <dbl>, ave_age_at_top_500 <dbl>,
## # years_between <dbl>, album_id <chr>, year <chr>, rank <dbl>
Now each observation represents a year making the data set tidy.
Next we want to filter the data set to only focus on albums ranked in 2020:
filter_rs <- tidy_rs %>%
select(clean_name, album, genre, year, rank)
# Filter dataset to only focus on albums ranked in 2020
filter_rs <- filter_rs %>%
filter(year == 2020) %>%
filter(!is.na(rank))
head(filter_rs)
## # A tibble: 6 × 5
## clean_name album genre year rank
## <chr> <chr> <chr> <chr> <dbl>
## 1 Frank Sinatra In the Wee Small Hours Big Band/Jazz 2020 282
## 2 Bo Diddley Bo Diddley / Go Bo Diddley Rock n' Roll/Rhythm & B… 2020 455
## 3 Elvis Presley Elvis Presley Rock n' Roll/Rhythm & B… 2020 332
## 4 Little Richard Here's Little Richard <NA> 2020 227
## 5 Beyonce Lemonade <NA> 2020 32
## 6 Amy Winehouse Back to Black Soul/Gospel/R&B 2020 33
Now we want to grab only the top 10 artists:
# Select the top 10 artists & albums by rank
top10_by_rank <- filter_rs %>%
arrange(rank) %>% # sort ascending
slice_head(n = 10) # take the first 10 rows
top10_by_rank
## # A tibble: 10 × 5
## clean_name album genre year rank
## <chr> <chr> <chr> <chr> <dbl>
## 1 Marvin Gaye What's Going On Soul/Gospel/R&B 2020 1
## 2 The Beach Boys Pet Sounds <NA> 2020 2
## 3 Joni Mitchell Blue Country/Folk/Coun… 2020 3
## 4 Stevie Wonder Songs in the Key of Life Soul/Gospel/R&B 2020 4
## 5 The Beatles Abbey Road <NA> 2020 5
## 6 Nirvana Nevermind Indie/Alternative… 2020 6
## 7 Fleetwood Mac Rumours Singer-Songwriter… 2020 7
## 8 Prince Purple Rain <NA> 2020 8
## 9 Bob Dylan Blood on the Tracks Country/Folk/Coun… 2020 9
## 10 Lauryn Hill The Miseducation of Lauryn Hill Hip-Hop/Rap 2020 10
The top artists are:
Now that we have our artists to focus on, next is grabbing the songs for each artist. We’ll be grabbing this data from Wikipedia for us to get web scraping experience.
Each Wikipedia page is set up a bit differently, so we’re going to get the songs for each artist individually:
# Get the html
html_marvin_gaye <- read_html("https://en.wikipedia.org/wiki/Marvin_Gaye_discography#Singles")
html_beach_boys <- read_html("https://en.wikipedia.org/wiki/The_Beach_Boys_discography#Singles")
html_joni_mitchell <- read_html("https://en.wikipedia.org/wiki/Joni_Mitchell_discography#Singles")
html_stevie_wonder <- read_html("https://en.wikipedia.org/wiki/Stevie_Wonder_discography#Singles")
html_the_beatles <- read_html("https://en.wikipedia.org/wiki/The_Beatles_singles_discography#Singles")
html_nirvana <- read_html("https://en.wikipedia.org/wiki/Nirvana_discography#Singles")
html_fleetwood_mac <- read_html("https://en.wikipedia.org/wiki/Fleetwood_Mac_discography#Singles")
html_prince <- read_html("https://en.wikipedia.org/wiki/Prince_singles_discography#Singles")
html_bob_dylan <- read_html("https://en.wikipedia.org/wiki/Bob_Dylan_discography#Singles")
html_lauryn_hill <- read_html("https://en.wikipedia.org/wiki/Lauryn_Hill_discography#Singles")
Next we select the tables we want from the HTML. Note, some artist’s pages have multiple tables for songs where each table represents a decade. For these artists, we’ll need to gather multiple tables:
# Scrape the songs table from wikipedia
get_songs_df <- function(html, css_selector) {
songs <- html |> html_elements(css_selector) |>
html_table()
return(as.data.frame(songs[[1]]))
}
# Note: these Wikipedia pages can be updated, so it's possible this code won't run in the future (css selector would need to be updated)
marvin_gaye_df <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(30)") # 1960s
marvin_gaye_df2 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(32)") # 1970 - 1984
marvin_gaye_df3 <- get_songs_df(html_marvin_gaye, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(34)") # Posthumous
beach_boys_df <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(21)") # 1960s
beach_boys_df2 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)") # 1970s
beach_boys_df3 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(25)") # 1980s
beach_boys_df4 <- get_songs_df(html_beach_boys, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(27)") # 1990s - present
joni_mitchell_df <- get_songs_df(html_joni_mitchell, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(19)")
stevie_wonder_df <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1960s
stevie_wonder_df2 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1970s
stevie_wonder_df3 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 1980s
stevie_wonder_df4 <- get_songs_df(html_stevie_wonder, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 1990s - present
the_beatles_df <- get_songs_df(html_the_beatles, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(10)")
nirvana_df <- get_songs_df(html_nirvana, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)")
fleetwood_mac_df <- get_songs_df(html_fleetwood_mac, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(23)")
prince_df <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)") # 1970s & 80s
prince_df2 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(20)") # 1990s
prince_df3 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(22)") # 2000s
prince_df4 <- get_songs_df(html_prince, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(24)") # 2010s - 20s
bob_dylan_df <- get_songs_df(html_bob_dylan, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
lauryn_hill_df <- get_songs_df(html_lauryn_hill, "#mw-content-text > div.mw-content-ltr.mw-parser-output > table:nth-child(18)")
Next we’ll go through each unique dataframe (note, each dataframe is
slightly different from each other), and tidy the data. These dataframes
are untidy because each chart category is a unique column. Instead there
should be a general albums_chart category variable and a
rank variable to represent this information.
# Marvin Gaye
## 1960s ##
# Keep the important variables
marvin_gaye_df <- marvin_gaye_df[, 1:11]
# First row is actually the column names
marvin_gaye_df <- marvin_gaye_df |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df_clean <- marvin_gaye_df %>%
clean_names()
colnames(marvin_gaye_df_clean)[4:9] <- c("us", "us_r_and_b", "can", "ire", "ned", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Clean up
marvin_gaye_df_clean <- marvin_gaye_df_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df_clean$title <- str_extract(marvin_gaye_df_clean$title , "^(.).*\\1")
head(marvin_gaye_df_clean)
## # A tibble: 6 × 7
## year title b_side certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 "\"Let You… "\"Ne… "" The … us —
## 2 1961 "\"Let You… "\"Ne… "" The … us_r_and_b —
## 3 1961 "\"Let You… "\"Ne… "" The … can —
## 4 1961 "\"Let You… "\"Ne… "" The … ire —
## 5 1961 "\"Let You… "\"Ne… "" The … ned —
## 6 1961 "\"Let You… "\"Ne… "" The … uk —
## 1970 - 1984 ##
# First row is actually the column names
marvin_gaye_df2 <- marvin_gaye_df2 |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df2_clean <- marvin_gaye_df2 %>%
clean_names()
colnames(marvin_gaye_df2_clean)[4:12] <- c("us", "us_r_and_b", "aus", "can", "ger", "ire", "ned", "swe", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(marvin_gaye_df2_clean)
## # A tibble: 6 × 8
## year title b_side certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1970 "\"H… "\"Go… "" That… <NA> us 41
## 2 1970 "\"H… "\"Go… "" That… <NA> us_r_and_b 18
## 3 1970 "\"H… "\"Go… "" That… <NA> aus 37
## 4 1970 "\"H… "\"Go… "" That… <NA> can 37
## 5 1970 "\"H… "\"Go… "" That… <NA> ger —
## 6 1970 "\"H… "\"Go… "" That… <NA> ire —
# Clean up
marvin_gaye_df2_clean <- marvin_gaye_df2_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory.')
marvin_gaye_df2_clean$title <- str_extract(marvin_gaye_df2_clean$title , "^(.).*\\1")
## Posthumous ##
# First row is actually the column names
marvin_gaye_df3 <- marvin_gaye_df3 |>
row_to_names(row_number = 1)
# Clean up the names
marvin_gaye_df3_clean <- marvin_gaye_df3 %>%
clean_names()
colnames(marvin_gaye_df3_clean)[4:12] <- c("us", "us_r_and_b", "aus", "ger", "ire", "ned", "sco", "swe", "uk")
# Make tidy - make the albums chart it's own variable
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Clean up
marvin_gaye_df3_clean <- marvin_gaye_df3_clean |>
filter(title != '"—" denotes items that did not chart or were not released in that territory. "N/A" indicates chart not yet published.')
marvin_gaye_df3_clean$title <- str_extract(marvin_gaye_df3_clean$title , "^(.).*\\1")
head(marvin_gaye_df3_clean)
## # A tibble: 6 × 8
## year title b_side certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1985 "\"S… "\"Sa… "" Drea… <NA> us [K]
## 2 1985 "\"S… "\"Sa… "" Drea… <NA> us_r_and_b 2
## 3 1985 "\"S… "\"Sa… "" Drea… <NA> aus —
## 4 1985 "\"S… "\"Sa… "" Drea… <NA> ger —
## 5 1985 "\"S… "\"Sa… "" Drea… <NA> ire —
## 6 1985 "\"S… "\"Sa… "" Drea… <NA> ned —
# The Beach Boys
## 1960s ##
beach_boys_df <- beach_boys_df |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df_clean <- beach_boys_df %>%
clean_names()
colnames(beach_boys_df_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "aus", "can", "nl", "nor", "swe", "uk", "phi")
# Make tidy - make the albums chart it's own variable
beach_boys_df_clean <- beach_boys_df_clean |>
pivot_longer(cols = us:phi, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df_clean <- separate_rows(beach_boys_df_clean, title, sep = '""')
beach_boys_df_clean
## # A tibble: 710 × 6
## year title certifications_sales…¹ album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 "\"Surfi… — Surf… us 75—
## 2 1961 "Luau\" … — Surf… us 75—
## 3 1961 "\"Surfi… — Surf… us_cash_box 85—
## 4 1961 "Luau\" … — Surf… us_cash_box 85—
## 5 1961 "\"Surfi… — Surf… us_record_w… ——
## 6 1961 "Luau\" … — Surf… us_record_w… ——
## 7 1961 "\"Surfi… — Surf… aus ——
## 8 1961 "Luau\" … — Surf… aus ——
## 9 1961 "\"Surfi… — Surf… can ——
## 10 1961 "Luau\" … — Surf… can ——
## # ℹ 700 more rows
## # ℹ abbreviated name: ¹certifications_sales_thresholds
# Clean up
beach_boys_df_clean <- beach_boys_df_clean |>
filter(title != '"—" denotes a release that did not chart, was not released in the country or the information is unknown (* - US Record World chart data incomplete for early 1964)')
beach_boys_df_clean$title <- str_extract(beach_boys_df_clean$title , "^(.).*\\1")
## 1970s ##
beach_boys_df2 <- beach_boys_df2 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df2_clean <- beach_boys_df2 %>%
clean_names()
colnames(beach_boys_df2_clean)[2:12] <- c("title", "us", "us_cash_box", "us_record_world", "us_ac", "aus", "can", "nl", "nor", "swe", "uk")
# Make tidy - make the albums chart it's own variable
beach_boys_df2_clean <- beach_boys_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df2_clean <- separate_rows(beach_boys_df2_clean, title, sep = '""')
## 1980s ##
beach_boys_df3 <- beach_boys_df3 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df3_clean <- beach_boys_df3 %>%
clean_names()
colnames(beach_boys_df3_clean)[2:9] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk")
# Make tidy - make the albums chart it's own variable
beach_boys_df3_clean <- beach_boys_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '""')
beach_boys_df3_clean <- separate_rows(beach_boys_df3_clean, title, sep = '" "')
## 1990s - Present
beach_boys_df4 <- beach_boys_df4 |>
row_to_names(row_number = 1)
# Clean up the names
beach_boys_df4_clean <- beach_boys_df4 %>%
clean_names()
colnames(beach_boys_df4_clean)[2:10] <- c("title", "us", "us_cash_box", "us_ac", "aus", "can", "ger", "uk", "us_cou")
# Make tidy - make the albums chart it's own variable
beach_boys_df4_clean <- beach_boys_df4_clean |>
pivot_longer(cols = us:us_cou, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '""')
beach_boys_df4_clean <- separate_rows(beach_boys_df4_clean, title, sep = '" "')
the_beach_boys_df_clean <- beach_boys_df_clean
the_beach_boys_df2_clean <- beach_boys_df2_clean
the_beach_boys_df3_clean <- beach_boys_df3_clean
the_beach_boys_df4_clean <- beach_boys_df4_clean
head(the_beach_boys_df_clean)
## # A tibble: 6 × 6
## year title certifications_sales_thre…¹ album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1961 <NA> — Surf… us 75—
## 2 1961 <NA> — Surf… us 75—
## 3 1961 <NA> — Surf… us_cash_box 85—
## 4 1961 <NA> — Surf… us_cash_box 85—
## 5 1961 <NA> — Surf… us_record_w… ——
## 6 1961 <NA> — Surf… us_record_w… ——
## # ℹ abbreviated name: ¹certifications_sales_thresholds
head(the_beach_boys_df2_clean)
## # A tibble: 6 × 5
## year title album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 1970 "\"Add Some Music to Your Day" Sunf… us 64—
## 2 1970 "Susie Cincinnati\" (non-album B… Sunf… us 64—
## 3 1970 "\"Add Some Music to Your Day" Sunf… us_cash_box 49—
## 4 1970 "Susie Cincinnati\" (non-album B… Sunf… us_cash_box 49—
## 5 1970 "\"Add Some Music to Your Day" Sunf… us_record_w… 39—
## 6 1970 "Susie Cincinnati\" (non-album B… Sunf… us_record_w… 39—
head(the_beach_boys_df3_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1980 "\"Goin' On" — Keep… us 83—
## 2 1980 "Endless Harmony\… — Keep… us 83—
## 3 1980 "\"Goin' On" — Keep… us_cash_box 103—
## 4 1980 "Endless Harmony\… — Keep… us_cash_box 103—
## 5 1980 "\"Goin' On" — Keep… us_ac ——
## 6 1980 "Endless Harmony\… — Keep… us_ac ——
head(the_beach_boys_df4_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1990 "\"Somewhere Near… — Stil… us ——
## 2 1990 "Kokomo\"" — Stil… us ——
## 3 1990 "\"Somewhere Near… — Stil… us_cash_box ——
## 4 1990 "Kokomo\"" — Stil… us_cash_box ——
## 5 1990 "\"Somewhere Near… — Stil… us_ac ——
## 6 1990 "Kokomo\"" — Stil… us_ac ——
# Joni Mitchell
joni_mitchell_df <- joni_mitchell_df |>
row_to_names(row_number = 1)
# Clean up the variable names
joni_mitchell_df_clean <- joni_mitchell_df %>%
clean_names()
colnames(joni_mitchell_df_clean)[3:9] <- c("can", "can_uc", "aus", "uk", "us", "us_ac", "us_main")
# Make tidy - make the albums chart it's own variable
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
pivot_longer(cols = can:us_main, names_to = "albums_chart", values_to = "peak_chart_position")
joni_mitchell_df_clean <- joni_mitchell_df_clean |>
filter(title != '"—" denotes releases that did not chart')
# Clean up song names
joni_mitchell_df_clean$title <- str_extract(joni_mitchell_df_clean$title , "^(.).*\\1")
head(joni_mitchell_df_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Night in the C… 1968 "" Song… can —
## 2 "\"Night in the C… 1968 "" Song… can_uc —
## 3 "\"Night in the C… 1968 "" Song… aus —
## 4 "\"Night in the C… 1968 "" Song… uk —
## 5 "\"Night in the C… 1968 "" Song… us —
## 6 "\"Night in the C… 1968 "" Song… us_ac —
# Stevie Wonder
stevie_wonder_df <- stevie_wonder_df |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
stevie_wonder_df_clean <- stevie_wonder_df %>%
clean_names()
colnames(stevie_wonder_df_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df_clean <- stevie_wonder_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df_clean <- separate_rows(stevie_wonder_df_clean, title, sep = 'b/w')
head(stevie_wonder_df_clean)
## # A tibble: 6 × 10
## title year swi_39 uk_40 certifications album na na_2 albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us
## 2 " Part 2" 1962 — — "" Non-… <NA> <NA> us
## 3 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us_r_and_b
## 4 " Part 2" 1962 — — "" Non-… <NA> <NA> us_r_and_b
## 5 "\"I Call It… 1962 — — "" Non-… <NA> <NA> us_ac
## 6 " Part 2" 1962 — — "" Non-… <NA> <NA> us_ac
## # ℹ 1 more variable: peak_chart_position <chr>
## 1970s ##
stevie_wonder_df2 <- stevie_wonder_df2 |>
row_to_names(row_number = 1)
# Clean up the names
stevie_wonder_df2_clean <- stevie_wonder_df2 %>%
clean_names()
colnames(stevie_wonder_df2_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df2_clean <- stevie_wonder_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df2_clean <- separate_rows(stevie_wonder_df2_clean, title, sep = 'b/w')
head(stevie_wonder_df2_clean)
## # A tibble: 6 × 8
## title year uk_40 certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Ne… 1970 6 "" Sign… <NA> us 26
## 2 " \"S… 1970 6 "" Sign… <NA> us 26
## 3 "\"Ne… 1970 6 "" Sign… <NA> us_r_and_b 11
## 4 " \"S… 1970 6 "" Sign… <NA> us_r_and_b 11
## 5 "\"Ne… 1970 6 "" Sign… <NA> us_ac 31
## 6 " \"S… 1970 6 "" Sign… <NA> us_ac 31
## 1980s ##
stevie_wonder_df3 <- stevie_wonder_df3 |>
row_to_names(row_number = 1)
# Clean up the names
stevie_wonder_df3_clean <- stevie_wonder_df3 %>%
clean_names()
colnames(stevie_wonder_df3_clean)[1:12] <- c("title", "year", "us", "us_r_and_b", "us_ac", "aus", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df3_clean <- stevie_wonder_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df3_clean <- separate_rows(stevie_wonder_df3_clean, title, sep = 'b/w')
head(stevie_wonder_df3_clean)
## # A tibble: 6 × 9
## title year swi_39 uk_40 certifications album na albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us
## 2 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us
## 3 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us_r_and_b
## 4 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us_r_and_b
## 5 "\"Black Orchid\"" 1980 — 63 "" "Ste… <NA> us_ac
## 6 " \"Blame It on th… 1980 — 63 "" "Ste… <NA> us_ac
## # ℹ 1 more variable: peak_chart_position <chr>
## 1990s - present ##
stevie_wonder_df4 <- stevie_wonder_df4 |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
stevie_wonder_df4_clean <- stevie_wonder_df4 %>%
clean_names()
colnames(stevie_wonder_df4_clean)[1:11] <- c("title", "year", "us", "us_r_and_b", "us_ac", "bel", "can", "ger", "ire", "swi", "uk")
# Make tidy - make the albums chart it's own variable
stevie_wonder_df4_clean <- stevie_wonder_df4_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
stevie_wonder_df4_clean <- separate_rows(stevie_wonder_df4_clean, title, sep = 'b/w')
head(stevie_wonder_df4_clean)
## # A tibble: 6 × 9
## title year uk_40 album na na_2 na_3 albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us —
## 2 " Instru… 1990 77 Non-… <NA> <NA> <NA> us —
## 3 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us_r_and_b 24
## 4 " Instru… 1990 77 Non-… <NA> <NA> <NA> us_r_and_b 24
## 5 "\"Keep … 1990 77 Non-… <NA> <NA> <NA> us_ac —
## 6 " Instru… 1990 77 Non-… <NA> <NA> <NA> us_ac —
# The Beatles
the_beatles_df <- the_beatles_df |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
the_beatles_df_clean <- the_beatles_df %>%
clean_names()
colnames(the_beatles_df_clean)[1:15] <- c("title", "year", "uk", "aus", "aut", "bel", "can", "ger", "nld", "nor", "nz", "swi", "us", "us_cash_box", "us_record_world")
# Make tidy - make the albums chart it's own variable
the_beatles_df_clean <- the_beatles_df_clean |>
pivot_longer(cols = uk:us_record_world, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
the_beatles_df_clean <- separate_rows(the_beatles_df_clean, title, sep = '""')
head(the_beatles_df_clean)
## # A tibble: 6 × 9
## title year certifications uk_album us_album na na_2 albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> uk
## 2 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> uk
## 3 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> aus
## 4 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> aus
## 5 "\"My Bonnie" 1962 "" My Bonn… "" <NA> <NA> aut
## 6 "The Saints\"… 1962 "" My Bonn… "" <NA> <NA> aut
## # ℹ 1 more variable: peak_chart_position <chr>
# Nirvana
nirvana_df <- nirvana_df |>
row_to_names(row_number = 1)
# Clean up the names
nirvana_df_clean <- nirvana_df %>%
clean_names()
colnames(nirvana_df_clean)[3:12] <- c("us", "aus", "bel", "fin", "fra", "ire", "nz", "prt", "swe", "uk")
# Make tidy - make the albums chart it's own variable
nirvana_df_clean <- nirvana_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(nirvana_df_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Love Buzz\"[A]" 1988 "" Blea… us —
## 2 "\"Love Buzz\"[A]" 1988 "" Blea… aus —
## 3 "\"Love Buzz\"[A]" 1988 "" Blea… bel —
## 4 "\"Love Buzz\"[A]" 1988 "" Blea… fin —
## 5 "\"Love Buzz\"[A]" 1988 "" Blea… fra —
## 6 "\"Love Buzz\"[A]" 1988 "" Blea… ire —
# Fleetwood Mac
fleetwood_mac_df <- fleetwood_mac_df |>
row_to_names(row_number = 1)
# Clean up the names
fleetwood_mac_df_clean <- fleetwood_mac_df %>%
clean_names()
colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", "us_ac", "aus", "can", "ger", "irl", "nl", "nz")
## Warning in colnames(fleetwood_mac_df_clean)[3:13] <- c("uk", "us", "us_rock", :
## number of items to replace is not a multiple of replacement length
# Make tidy - make the albums chart it's own variable
fleetwood_mac_df_clean <- fleetwood_mac_df_clean |>
pivot_longer(cols = uk:nz, names_to = "albums_chart", values_to = "peak_chart_position")
head(fleetwood_mac_df_clean)
## # A tibble: 6 × 5
## title year album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 "\"I Believe My Time Ain't Long\… 1967 Non-… uk "—"
## 2 "\"I Believe My Time Ain't Long\… 1967 Non-… us ""
## 3 "\"I Believe My Time Ain't Long\… 1967 Non-… us_rock ""
## 4 "\"I Believe My Time Ain't Long\… 1967 Non-… us_ac "—"
## 5 "\"I Believe My Time Ain't Long\… 1967 Non-… aus ""
## 6 "\"I Believe My Time Ain't Long\… 1967 Non-… can ""
# Prince
## 70s & 80s ##
prince_df <- prince_df |>
row_to_names(row_number = 1)
# Clean up the names
prince_df_clean <- prince_df %>%
clean_names()
colnames(prince_df_clean)[3:7] <- c("us", "us_r_and_b", "us_dance", "nld_tip", "uk")
# Make tidy - make the albums chart it's own variable
prince_df_clean <- prince_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df_clean)
## # A tibble: 6 × 11
## title year ger_17 nz_18 swe_19 swi_20 uk_21 certifications album albums_chart
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"S… 1978 — — — — — "" For … us
## 2 "\"S… 1978 — — — — — "" For … us_r_and_b
## 3 "\"S… 1978 — — — — — "" For … us_dance
## 4 "\"S… 1978 — — — — — "" For … nld_tip
## 5 "\"S… 1978 — — — — — "" For … uk
## 6 "\"J… 1978 — — — — — "" For … us
## # ℹ 1 more variable: peak_chart_position <chr>
## 1990s ##
prince_df2 <- prince_df2 |>
row_to_names(row_number = 1)
# Clean up the names
prince_df2_clean <- prince_df2 %>%
clean_names()
colnames(prince_df2_clean)[3:12] <- c("us", "us_r_and_b", "aus", "can", "fra", "ger", "nz", "swe", "swi", "uk")
# Make tidy - make the albums chart it's own variable
prince_df2_clean <- prince_df2_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df2_clean)
## # A tibble: 6 × 6
## title year certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"The Future\" (… 1990 "" Batm… us —
## 2 "\"The Future\" (… 1990 "" Batm… us_r_and_b —
## 3 "\"The Future\" (… 1990 "" Batm… aus —
## 4 "\"The Future\" (… 1990 "" Batm… can —
## 5 "\"The Future\" (… 1990 "" Batm… fra —
## 6 "\"The Future\" (… 1990 "" Batm… ger 39
## 2000s ##
prince_df3 <- prince_df3 |>
row_to_names(row_number = 1)
## Warning: Row 1 does not provide unique names. Consider running clean_names()
## after row_to_names().
# Clean up the names
prince_df3_clean <- prince_df3 %>%
clean_names()
colnames(prince_df3_clean)[3:11] <- c("us", "us_r_and_b", "aus", "ger", "ire", "nld", "nor", "swi", "uk")
# Make tidy - make the albums chart it's own variable
prince_df3_clean <- prince_df3_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(prince_df3_clean)
## # A tibble: 6 × 7
## title year album na na_2 albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"U Make My Sun Shi… 2000 The … <NA> <NA> us —[G]
## 2 "\"U Make My Sun Shi… 2000 The … <NA> <NA> us_r_and_b 108
## 3 "\"U Make My Sun Shi… 2000 The … <NA> <NA> aus —
## 4 "\"U Make My Sun Shi… 2000 The … <NA> <NA> ger —
## 5 "\"U Make My Sun Shi… 2000 The … <NA> <NA> ire —
## 6 "\"U Make My Sun Shi… 2000 The … <NA> <NA> nld —
## 2010s - present ##
prince_df4 <- prince_df4 |>
row_to_names(row_number = 1)
# Clean up the names
prince_df4_clean <- prince_df4 %>%
clean_names()
colnames(prince_df4_clean)[3:7] <- c("us_adult_r_and_b", "us_r_and_b", "us_r_and_b_digital", "us_r_and_b_sales", "uk")
# Make tidy - make the albums chart it's own variable
prince_df4_clean <- prince_df4_clean |>
pivot_longer(cols = us_adult_r_and_b:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# Wikipedia added a "Posthumous" section within the same table, so need to remove that row to make the data frame tidy
prince_df4_clean <- prince_df4_clean |>
filter(title != "Posthumous", title != '"—" denotes a single that was not released or did not chart in the region.')
head(prince_df4_clean)
## # A tibble: 6 × 5
## title year album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr>
## 1 "\"Extraloveable\" (featuring An… 2011 Hit … us_adult_r_… —
## 2 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b —
## 3 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b_… —
## 4 "\"Extraloveable\" (featuring An… 2011 Hit … us_r_and_b_… —
## 5 "\"Extraloveable\" (featuring An… 2011 Hit … uk —
## 6 "\"Rock and Roll Love Affair\" [… 2012 Hit … us_adult_r_… —[J]
# Bob Dylan
bob_dylan_df <- bob_dylan_df |>
row_to_names(row_number = 1)
# Clean up the names
bob_dylan_df_clean <- bob_dylan_df %>%
clean_names()
colnames(bob_dylan_df_clean)[2:9] <- c("title", "us", "us_main", "us_cash", "aus", "ire", "nl", "uk")
# Make tidy - make the albums chart it's own variable
bob_dylan_df_clean <- bob_dylan_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
# A side and B side single appear in the same cell in single variable, so we need to break them out into individual observations
bob_dylan_df_clean <- separate_rows(bob_dylan_df_clean, title, sep = 'b/w')
head(bob_dylan_df_clean)
## # A tibble: 6 × 6
## year title certifications album albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1962 "\"Mixed-Up Confu… "" Non-… us —
## 2 1962 " \"Corrina Corri… "" Non-… us —
## 3 1962 "\"Mixed-Up Confu… "" Non-… us_main —
## 4 1962 " \"Corrina Corri… "" Non-… us_main —
## 5 1962 "\"Mixed-Up Confu… "" Non-… us_cash —
## 6 1962 " \"Corrina Corri… "" Non-… us_cash —
# Lauryn Hill
lauryn_hill_df <- lauryn_hill_df |>
row_to_names(row_number = 1)
# Clean up the names
lauryn_hill_df_clean <- lauryn_hill_df %>%
clean_names()
colnames(lauryn_hill_df_clean)[3:12] <- c("us", "us_r_and_b", "us_rhy", "aus", "fra", "nl", "nz", "swe", "swi", "uk")
# Make tidy - make the albums chart it's own variable
lauryn_hill_df_clean <- lauryn_hill_df_clean |>
pivot_longer(cols = us:uk, names_to = "albums_chart", values_to = "peak_chart_position")
head(lauryn_hill_df_clean)
## # A tibble: 6 × 7
## title year certifications album na albums_chart peak_chart_position
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us 1
## 2 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us_r_and_b 2
## 3 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> us_rhy 1
## 4 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> aus 35
## 5 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> fra 23
## 6 "\"Doo Wop … 1998 "RIAA: Gold[3… The … <NA> nl 4
We now have tidy song dataframe for each artist.
Now let’s combine all the dataframes into one called
artist_dfs:
artist_dfs <- c(
# Marvin Gaye
"marvin_gaye_df_clean",
"marvin_gaye_df2_clean",
"marvin_gaye_df3_clean",
# The Beach Boys
"the_beach_boys_df_clean",
"the_beach_boys_df2_clean",
"the_beach_boys_df3_clean",
"the_beach_boys_df4_clean",
# Joni Mitchell
"joni_mitchell_df_clean",
# Stevie Wonder
"stevie_wonder_df_clean",
"stevie_wonder_df2_clean",
"stevie_wonder_df3_clean",
"stevie_wonder_df4_clean",
# The Beatles
"the_beatles_df_clean",
# Nirvana
"nirvana_df_clean",
# Fleetwood Mac
"fleetwood_mac_df_clean",
# Prince
"prince_df_clean",
"prince_df2_clean",
"prince_df3_clean",
"prince_df4_clean",
# Bob Dylan
"bob_dylan_df_clean",
# Lauryn Hill
"lauryn_hill_df_clean"
)
Since some of the values for song title variable are messy, let’s clean the data frame and write the data frame to a CSV:
# This function:
# - extracts artist name from object name (everything before "_df")
# - cleans song titles
# - adds artist column
# - saves cleaned df to global env with name like "artist_cleaned_songs"
clean_and_save_songs_df <- function(df_name_string) {
# get df by name
df <- get(df_name_string, envir = .GlobalEnv)
# get artist name from df name (before "_df")
artist_id <- str_extract(df_name_string, "^(.*?)(?=_df)")
artist_title <- str_to_title(str_replace_all(artist_id, "_", " "))
cleaned_df <- df |>
mutate(
# clean title — remove quotes, () [] {}, cut after cd, us, uk etc.
title = title |>
str_trim() |>
str_replace_all('"', "") |>
str_remove_all("\\(.*?\\)") |>
str_remove_all("\\[.*?\\]") |>
str_remove_all("\\{.*?\\}") |>
str_remove_all("(?i)[ ]?(cd|us|uk|canadian)[\\s\\-:]+.*$") |>
str_squish(),
# add artist column
artist = artist_title
) |>
# drop junk rows like "denotes", "not released", etc.
filter(
!str_detect(title, "(?i)denotes|indicates|refers to|represents|not released")
) |>
# keep only one row per title
distinct(title, .keep_all = TRUE)
# save cleaned df to global env
cleaned_df_name <- paste0(artist_id, "_cleaned_songs")
assign(cleaned_df_name, cleaned_df, envir = .GlobalEnv)
# if album/year column missing, add NA
if (!"album" %in% names(cleaned_df)) cleaned_df$album <- NA
if (!"year" %in% names(cleaned_df)) cleaned_df$year <- NA
# keep only needed columns
minimal_df <- cleaned_df |>
select(artist, title, album, year)
# if all_artists_songs doesn’t exist, create it
if (!exists("all_artists_songs", envir = .GlobalEnv)) {
assign("all_artists_songs", minimal_df, envir = .GlobalEnv)
} else {
# if exists, add to it
existing <- get("all_artists_songs", envir = .GlobalEnv)
updated <- bind_rows(existing, minimal_df)
assign("all_artists_songs", updated, envir = .GlobalEnv)
}
return(cleaned_df)
}
artists_songs_album <- invisible(lapply(artist_dfs, clean_and_save_songs_df))
write.csv(all_artists_songs, "all_artists_songs.csv")
all_artists_songs |>
group_by(artist, title) |>
filter(n() > 1) |>
arrange(artist, title)
## # A tibble: 32 × 4
## # Groups: artist, title [15]
## artist title album year
## <chr> <chr> <chr> <chr>
## 1 Marvin Gaye Ain't No Mountain High Enough United 1967
## 2 Marvin Gaye Ain't No Mountain High Enough Non-album singles 2013
## 3 Marvin Gaye Let's Get It On Let's Get It On 1973
## 4 Marvin Gaye Let's Get It On Non-album singles 2005
## 5 Marvin Gaye Sexual Healing Midnight Love 1982
## 6 Marvin Gaye Sexual Healing Non-album singles 2015
## 7 Prince 1999 1999 1982
## 8 Prince 1999 1999 1998
## 9 Prince 7 Love Symbol 1992
## 10 Prince 7 Non-album singles 2023
## # ℹ 22 more rows
all_artists_songs <- all_artists_songs |>
distinct(artist, title, .keep_all = TRUE)
all_artists_songs |>
group_by(artist, title) |>
filter(n() > 1) |>
arrange(artist, title)
## # A tibble: 0 × 4
## # Groups: artist, title [0]
## # ℹ 4 variables: artist <chr>, title <chr>, album <chr>, year <chr>
all_artists_songs <- all_artists_songs |>
distinct(artist, title, .keep_all = TRUE)
Next to gather song lyrics, we’ll use the Genuis API. Unfortunately, the Genius API can’t provide the lyrics directly, but it can provide the URL that contains the lyrics to a song. We’ll then have to web scrape again to create a lyrics dataframe.
# Genius API token (token needs to be keyed before submitting the project). I'll leave it as is for now.
genius_token <- "9em6wi6o9dAzPw5U_6qjYCOqRa2nQKZ26DjIe4anzXgCsJJT0hP5ZZsZK3UdsrzH"
#For example we can use this song
song_name <- "Yesterday"
artist_band <- "The Beatles"
# This function searches Genius for a song URL
search_song_final <- function(song_title, artist_name) {
# Heads up: Best practice is to load the token securely, not hardcode it.
# e.g., using Sys.getenv("GENIUS_API_TOKEN")
genius_token <- "9em6wi6o9dAzPw5U_6qjYCOqRa2nQKZ26DjIe4anzXgCsJJT0hP5ZZsZK3UdsrzH" # Using hardcoded token for this example
# Setting up to call the Genius API's search endpoint
base_url <- "https://api.genius.com/search"
query <- list(q = paste(song_title, artist_name))
# Making the request to the API
res <- GET(
url = base_url,
add_headers(Authorization = paste("Bearer", genius_token)), # Authorization header includes the token
query = query
)
# Check if the request worked (Status 200 means OK)
if (status_code(res) != 200) {
warning("API request failed. Status code: ", status_code(res))
return(NA) # Return NA on API call failure
}
# Now parse the JSON results. Using flatten=TRUE based on previous findings.
content_data <- tryCatch(
fromJSON(content(res, "text", encoding = "UTF-8"), flatten = TRUE),
error = function(e) {
warning("Failed to parse JSON response: ", e$message)
return(NULL) # Return NULL on JSON parsing error
}
)
# Ensure a valid response structure was received
if (is.null(content_data) || is.null(content_data$response)) return(NA)
# Search results are typically inside response$hits
hits <- content_data$response$hits
# Process the 'hits' data frame (assuming it's a data frame)
# Handle cases with no results or if 'hits' isn't a data frame
if (is.null(hits) || !inherits(hits, "data.frame") || nrow(hits) == 0) {
return(NA)
}
# Define expected column names based on flatten=TRUE producing specific columns
artist_col <- "result.primary_artist.name"
url_col <- "result.url"
# Check if the required columns exist in the data frame
if (!all(c(artist_col, url_col) %in% names(hits))) {
warning("Required columns ('", artist_col, "', '", url_col, "') not found in flattened API response.")
return(NA) # Cannot proceed without these columns
}
# Loop through each row (each potential song hit)
for (i in 1:nrow(hits)) {
# Get the artist name for the current row from the flattened column
artist_raw <- hits[[artist_col]][[i]]
artist <- tolower(as.character(artist_raw))
# Check if the artist name is valid and matches the search query
if (!is.na(artist) && nzchar(artist) && str_detect(artist, tolower(artist_name))) {
# If the artist matched, get the URL from its column
song_url_found <- hits[[url_col]][[i]]
if (!is.na(song_url_found) && nzchar(song_url_found)) {
# Match found! Return the URL and exit the function.
return(song_url_found)
}
}
}
# If the loop finishes without returning, no suitable match was found
return(NA)
}
# --- Example: Find the URL ---
song_url <- search_song_final(song_name, artist_band)
# This should print the found URL...
print(song_url)
## [1] "https://genius.com/The-beatles-yesterday-lyrics"
Now we need to create functions that will get all the words for a
song, get_words_df, and get the song emotions,
get_song_sentiments:
# Returns a dataframe where each observation is a word and the corresponding in a song
get_words_df <- function(artist_name, song_title) {
song_url <- search_song_final(song_title, artist_name)
if (is.na(song_url)) {
return(NULL)
}
# Using the song_url found previously to scrape the lyrics
# Only run this if a valid song_url was found
if (!is.na(song_url) && nzchar(song_url)) {
# Read the HTML content from the URL
page_html <- tryCatch(
read_html(song_url),
error = function(e) {
warning("Failed to read URL: ", song_url, "\nError: ", e$message)
return(NULL) # Return NULL on page read failure
}
)
# Only proceed if the webpage HTML was read successfully
if (!is.null(page_html)) {
# Find the part of the HTML containing the lyrics
lyrics_selector <- "div[data-lyrics-container='true']" # Verify this selector!
# Use rvest to find the HTML element(s) matching the selector
lyrics_nodes <- page_html |>
html_elements(css = lyrics_selector)
# Check if any nodes were found using the selector
if (length(lyrics_nodes) == 0) {
# If not found, the selector was likely wrong or the page structure changed.
warning("Could not find lyrics container using selector: '", lyrics_selector,
"'. The website structure may have changed or the selector is wrong. ",
"Please inspect the page HTML.")
return(NULL) # Cannot proceed.
} else {
# Extract the text content from the found node(s)
# html_text2 often handles line breaks better
lyrics_text_raw <- lyrics_nodes |>
html_text2()
# Clean up the extracted raw text
temp_cleaned_vector <- lyrics_text_raw %>%
str_split("\n") %>% # Split into lines based on the newline character (\n)
unlist() %>% # The result of str_split is a list, so unlist it
str_remove_all("\\[[^\\]]+\\]") %>% # Remove structural markers like [Verse 1], [Chorus] etc.
str_trim() # Remove leading/trailing whitespace from lines
# Now, apply the filtering using standard base R subsetting (outside the pipe)
lyrics_text_cleaned <- temp_cleaned_vector[nzchar(temp_cleaned_vector)]
# Structure the cleaned lines into a data frame
scraped_lyrics_df <- tibble(
line = 1:length(lyrics_text_cleaned),
text = lyrics_text_cleaned
)
}
} else {
# Scraping skipped because reading the URL failed earlier
warning("Skipping scraping due to error reading URL.")
return(NULL)
}
} else {
warning("Cannot scrape lyrics because song_url was not found successfully.")
return(NULL) # Ensure variable exists even if scraping is skipped
}
lyrics_df <- scraped_lyrics_df
words <- lyrics_df %>%
unnest_tokens(word, text)
return(words)
}
# Returns top n song emotions for a particular song
get_song_sentiments <- function(words, artist_name, song_title, n, year = NULL) {
null_top_emotion_df <- tibble(artist = artist_name, song = song_title, top_emotion = NA)
if (is.null(words)) {
return(null_top_emotion_df)
}
top_emotions_count <- words %>%
inner_join(nrc, by = "word", relationship = "many-to-many") %>%
filter(!sentiment %in% c("positive", "negative")) %>%
count(sentiment, sort = TRUE) %>%
slice_head(n = n)
top_emotions <- top_emotions_count |>
pull(sentiment)
top_emotions_count$artist <- artist_name
top_emotions_count$song <- song_title
colnames(top_emotions_count)[1] <- "top_emotion"
if (length(top_emotions) == 0) return(null_top_emotion_df)
return(top_emotions_count)
}
# Testing
words_test <- get_words_df(artist_band, song_name)
head(words_test, 15)
## # A tibble: 15 × 2
## line word
## <int> <chr>
## 1 1 113
## 2 1 contributors
## 3 2 translations
## 4 3 türkçe
## 5 4 español
## 6 5 português
## 7 6 deutsch
## 8 7 yesterday
## 9 7 lyrics
## 10 8 yesterday
## 11 8 is
## 12 8 the
## 13 8 most
## 14 8 covered
## 15 8 song
scraped_lyrics_test <- get_song_sentiments(words_test, artist_band, song_name, 1)
head(scraped_lyrics_test, 15)
## # A tibble: 1 × 4
## top_emotion n artist song
## <chr> <int> <chr> <chr>
## 1 fear 3 The Beatles Yesterday
Now we’re ready to perform our analysis.
Let’s start by finding the top emotions overall, over the years, and for each artist:
# Grab 5 songs from each of the top 10 artists
get_top5_valid_sentiments <- function(artist_name, song_titles, max_songs = 5) {
results <- list()
for (song in song_titles) {
lyrics_words <- get_words_df(artist_name, song)
res <- get_song_sentiments(lyrics_words, artist_name, song, 1)
# keep only song + top_emotion
if (!is.na(res$top_emotion)) {
results <- append(results, list(
tibble(song = res$song,
top_emotion = res$top_emotion)
))
}
if (length(results) >= max_songs) break
}
if (length(results) > 0) {
return(bind_rows(results))
} else {
return(tibble(song = NA, top_emotion = NA))
}
}
get_top5_valid_sentiments_year <- function(artist_name, title) {
words <- get_words_df(artist_name, title)
sent <- get_song_sentiments(words, artist_name, title, 1)
return(sent)
}
artist_song_list <- all_artists_songs %>%
filter(!is.na(artist), artist != "artist") %>%
group_by(artist) %>%
summarise(songs = list(title), .groups = "drop")
song_list_by_year <- all_artists_songs %>%
filter(!is.na(artist), artist != "artist", year != "Source:[42]") %>%
group_by(artist, year)
# Now apply get_top5_valid_sentiments to each row
# Use foreach to speed up the process
valid_sentiments <- foreach(i = 1:nrow(artist_song_list),
.combine = bind_rows,
.packages = c(
"dplyr", "purrr", "tibble", "httr", "jsonlite",
"stringr", "rvest", "tidytext", "textdata", "readr", "xml2"
)) %dopar% {
artist <- artist_song_list$artist[i]
songs <- artist_song_list$songs[[i]]
result <- get_top5_valid_sentiments(artist, songs)
result$artist <- artist
result
}
print(valid_sentiments)
## # A tibble: 50 × 3
## song top_emotion artist
## <chr> <chr> <chr>
## 1 Mixed-Up Confusion anger Bob Dylan
## 2 Corrina Corrina anticipation Bob Dylan
## 3 Blowin' in the Wind joy Bob Dylan
## 4 The Times They Are a-Changin' anger Bob Dylan
## 5 Honey, Just Allow Me One More Chance surprise Bob Dylan
## 6 I Believe My Time Ain't Long anticipation Fleetwood Mac
## 7 Black Magic Woman joy Fleetwood Mac
## 8 Need Your Love So Bad joy Fleetwood Mac
## 9 Man of the World joy Fleetwood Mac
## 10 Oh Well anticipation Fleetwood Mac
## # ℹ 40 more rows
# Let's take a subset of the songs and just look at the years 1970, 1980, 1990, etc.
song_list_by_year_filtered <- song_list_by_year %>%
filter(year %in% c("1970", "1980", "1990", "2000", "2010"))
valid_sentiments_by_decade <- foreach(i = 1:nrow(song_list_by_year_filtered), .combine = bind_rows, .packages = c("dplyr", "tibble")) %dopar% {
artist <- song_list_by_year_filtered$artist[i]
title <- song_list_by_year_filtered$title[i]
year <- song_list_by_year_filtered$year[i]
result <- get_top5_valid_sentiments_year(artist, title)
if (!is.null(result) && nrow(result) > 0) {
result$year <- year
result
} else {
tibble(artist = artist, song = title, top_emotion = NA, year = year)
}
}
valid_sentiments_by_decade <- valid_sentiments_by_decade |>
count(year, top_emotion, sort = TRUE) |>
group_by(year) |>
slice_head(n = 1) |>
ungroup()
print(valid_sentiments_by_decade)
## # A tibble: 5 × 3
## year top_emotion n
## <chr> <chr> <int>
## 1 1970 joy 12
## 2 1980 joy 15
## 3 1990 joy 6
## 4 2000 anticipation 2
## 5 2010 joy 2
data("stop_words")
word_freq_decade_years <- foreach(i = 1:nrow(song_list_by_year_filtered), .combine = bind_rows, .packages = c("dplyr", "tibble")) %dopar% {
artist <- song_list_by_year_filtered$artist[i]
title <- song_list_by_year_filtered$title[i]
year <- song_list_by_year_filtered$year[i]
result <- get_words_df(artist, title)
if (!is.null(result) && nrow(result) > 0) {
result$year <- year
result
} else {
tibble(artist = artist, song = title, word = NA, year = year)
}
}
word_freq_decade_years <- word_freq_decade_years |>
anti_join(stop_words, by=c("word"="word")) %>% # don't include stop words
filter(!(word %in% c("it’s", "da", "dah", "duh", "ba", "shoo", "yeah", "ooh", "oh", "nuh", "i’m", "gonna", "bop", "you’re", "contributors", "lyrics", "la", "wanna"))) %>%
count(word, sort = TRUE) %>%
slice_head(n = 10)
word_freq_decade_years
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 love 215
## 2 baby 113
## 3 time 69
## 4 home 65
## 5 summer 54
## 6 music 48
## 7 mind 46
## 8 world 46
## 9 grandma 43
## 10 girl 38
stopCluster(cl)
Let’s graph these charts and analyze:
Word frequency:
word_freq_decade_years <- word_freq_decade_years |>
left_join(bing, by = "word", relationship = "many-to-many")
word_freq_decade_years |>
ggplot(aes(x = reorder(word, n), y = n, fill = sentiment)) +
geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Top 10 Words 1960 - 2010",
x = "Word",
y = "Count")
We can see here that love is the most frequent word, and is the only word that has a positve or negative emotion attached to it. Possibly the most interesting takeaway from this graph is that grandma is in top words.
Sentiments for each decade:
valid_sentiments_by_decade |>
ggplot(aes(x = year, y = n, fill = top_emotion)) +
geom_col() + theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Top Emotion in Songs Per Year (One Year Per Decade)",
x = "Year", y = "Count")
Over time, the emotion represented for each decade is joy excluding 2010. That particular year had more sad songs than joyful.
Top Emotions in Songs by Top 10 Artists:
ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = top_emotion)) +
geom_bar() +
coord_flip() +
labs(title = "Top Emotions in Songs by Top 10 Artists",
x = "Emotion", y = "Count") +
theme_minimal()
This bar chart summarizes the most frequent dominant emotion detected in songs by the top 10 Rolling Stone artists.
Joy is by far the most common top emotion, appearing in nearly 25 songs, suggesting that uplifting and positive themes are prominent across these artists’ lyrics.
Anticipation is the second most frequent, often associated with themes of hope, excitement, or longing.
Emotions like trust, anger, and fear are present but much less frequent.
Overall, this distribution highlights the emotional leaning of popular music from legendary artists toward positive or forward-looking sentiments.
Top Emotions by Artist:
valid_sentiments %>%
group_by(artist, top_emotion) %>%
summarise(n = n(), .groups = "drop") %>%
arrange(artist, desc(n))
## # A tibble: 26 × 3
## artist top_emotion n
## <chr> <chr> <int>
## 1 Bob Dylan anger 2
## 2 Bob Dylan anticipation 1
## 3 Bob Dylan joy 1
## 4 Bob Dylan surprise 1
## 5 Fleetwood Mac joy 3
## 6 Fleetwood Mac anticipation 2
## 7 Joni Mitchell anticipation 2
## 8 Joni Mitchell joy 2
## 9 Joni Mitchell anger 1
## 10 Lauryn Hill fear 2
## # ℹ 16 more rows
ggplot(valid_sentiments, aes(x = fct_infreq(top_emotion), fill = artist)) +
geom_bar(position = "dodge") +
labs(title = "Top Emotions by Artist", x = "Emotion", y = "Number of Songs") +
theme_minimal()
This bar chart shows the most common emotional tone for songs by each of the top 10 Rolling Stone artists based on NRC sentiment analysis.
Joy and anticipation are the most dominant emotions across artists, particularly for Marvin Gaye, The Beatles, and Fleetwood Mac.
Fear and sadness appear most often in Lauryn Hill’s songs, indicating a more emotionally intense or somber tone.
Bob Dylan stands out with a more diverse emotional profile that includes anger and surprise.
The chart highlights how different artists tend to gravitate toward different emotional themes in their songwriting.
Emotional Variety by Artist:
#Artists with the most emotionally varied songs
valid_sentiments %>%
group_by(artist) %>%
summarise(num_unique_emotions = n_distinct(top_emotion)) %>%
arrange(desc(num_unique_emotions))
## # A tibble: 10 × 2
## artist num_unique_emotions
## <chr> <int>
## 1 Bob Dylan 4
## 2 The Beach Boys 4
## 3 Joni Mitchell 3
## 4 Lauryn Hill 3
## 5 Fleetwood Mac 2
## 6 Marvin Gaye 2
## 7 Nirvana 2
## 8 Prince 2
## 9 Stevie Wonder 2
## 10 The Beatles 2
Our analysis showed us that:
Some challenges we faced were:
Potential Future Analysis: