Background
Nickelback’s music is arguably one of the worst exports that Canada has ever produced. It’s rumored that Nickelback was created by God as payback for the sins of mankind. In a ranking by Gitmo detainees, Nickelback was ranked as the third-worst torture device (Acid drips on bare skin was #1 and electrocution was #2). But in all seriousness, my colleagues and I were curious about how Nickelback is actually a commercially-viable band. With all the recycled content and rhyming for the sake of rhyming, you’d think people would have just stopped listening.
To better understand Nickelback, I’ve decided to take one for the team and scrape their lyrics and perform some text analysis.
Setting-Up for Scraping and Text Analysis
Special functions
As always, special functions are specified first. wipe_html is handy for eliminating annoying HTML tags. It essentially takes a HTML string, cleans out the HTML tags, and returns the non-HTML contents of that string. It’ll come in handy in any web-scraping you do, so I highly recommend you hang onto it.
# function to scrub html out of text ----
wipe_html <- function(str_html) {
gsub("<.*?>", "", str_html)
}
Libraries
I’ll be using Hadley Wickham’s dplyr and rvest. dplyr is the gold standard for R data wrangling, and rvest is just convenient for scraping web data.
# libraries ----
pkgs <- c("dplyr", "rvest", "tm", "SnowballC", "topicmodels")
vapply(pkgs, require, character.only = TRUE, FUN.VALUE = 1)
Loading required package: rvest
Loading required package: xml2
Loading required package: SnowballC
Loading required package: topicmodels
dplyr rvest tm SnowballC topicmodels
1 1 1 1 1
Web-Scraping
Get the song list
This code chunk scrapes the table with the song titles, selects the name of the song and the year it was released, and creates a column called url_name that I’ll need in the next code chunk.
# get the HTML code from metrolyrics ----
songs_url <- "https://en.wikipedia.org/wiki/List_of_songs_recorded_by_Nickelback"
html_code <- read_html(songs_url)
# get songs from song title HTML table ----
html_songs <- html_node(html_code, xpath = '//*[@id="mw-content-text"]/div/table[3]')
songs_df <- html_table(html_songs, header = TRUE)
# drop name column to lowercase and remove columns ----
names(songs_df) <- tolower(gsub("\\s|[[:punct:]]", "_", names(songs_df)))
# clean up the songs_df "song" variable ----
songs_df$song <- gsub('\\"', "", songs_df$song)
# album `hesher` has unreliable data; nix it ----
songs_df <- songs_df %>% filter(release != 'Hesher')
Getting the lyrics
Looping over the url_name variable from the previous code chunk made the most sense for this.
In this code chunk, I’ll create two empty vectors for the lyrics and album names. My for-loop will then loop over each element in url_name, paste it into a hyperlink, scrape the lyrics and album name, scrub out the HTML, and place each captured piece into their respective vectors. The data frame album_lyrics will hold the contents of both vectors.
# navigate to each song's URL and scrape the album name and lyrics ----
## empty vectors
lyrics <- c()
## specify row number to add to data frame
number <- 1
## for-loop and create data frame from the two vectors
for(i in seq_along(songs_df$song)) {
for_url_name <- songs_df$song[i]
## clean up song name for URL
for_url_name <- tolower(gsub("[[:punct:]]|\\s", "", for_url_name))
## create url
paste_url <- paste0("http://www.azlyrics.com/lyrics/nickelback/",
for_url_name, ".html")
## open connection to url
for_html_code <- read_html(paste_url)
## scrape lyrics via selector path
for_lyrics <- html_node(for_html_code,
xpath = "/html/body/div[3]/div/div[2]/div[5]")
## scrub html and control characters out of the lyrics
for_lyrics <- wipe_html(for_lyrics)
for_lyrics <- gsub("[[:cntrl:]]", " ", for_lyrics)
## add for_lyrics to respective vectors
lyrics[number] <- for_lyrics
number <- number + 1
## status check
show(paste0(for_url_name, " scrape complete!"))
## optional: add in 10 second delay to avoid IP block
Sys.sleep(10)
}
# bind data frames together and strip "Lyrics" out of name variable ----
if(nrow(songs_df_clean) != nrow(album_lyrics)) {
stop("data frames have different number of rows")
} else {
nb_data <- bind_cols(songs_df_clean, album_lyrics) %>%
mutate(name = gsub("Lyrics", "", name))
}
# filter out missing album names and show the first 10 rows of nb_data ----
nb_data <- nb_data %>% filter(!is.na(album))
nb_data[1:10, ]
LS0tCnRpdGxlOiAiQW5hbHl6aW5nIE5pY2tlbGJhY2sncyBMeXJpY3MiCmF1dGhvcjogIk1hdHQgTSIKc3VidGl0bGU6ICJBIFdlYi1TY3JhcGluZyBhbmQgVGV4dCBBbmFseXNpcyBUdXRvcmlhbCIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyMgQmFja2dyb3VuZApOaWNrZWxiYWNrJ3MgbXVzaWMgaXMgYXJndWFibHkgb25lIG9mIHRoZSB3b3JzdCBleHBvcnRzIHRoYXQgQ2FuYWRhIGhhcyBldmVyIHByb2R1Y2VkLiBJdCdzIHJ1bW9yZWQgdGhhdCBOaWNrZWxiYWNrIHdhcyBjcmVhdGVkIGJ5IEdvZCBhcyBwYXliYWNrIGZvciB0aGUgc2lucyBvZiBtYW5raW5kLiBJbiBhIHJhbmtpbmcgYnkgR2l0bW8gZGV0YWluZWVzLCBOaWNrZWxiYWNrIHdhcyByYW5rZWQgYXMgdGhlIHRoaXJkLXdvcnN0IHRvcnR1cmUgZGV2aWNlIChBY2lkIGRyaXBzIG9uIGJhcmUgc2tpbiB3YXMgIzEgYW5kIGVsZWN0cm9jdXRpb24gd2FzICMyKS4gQnV0IGluIGFsbCBzZXJpb3VzbmVzcywgbXkgY29sbGVhZ3VlcyBhbmQgSSB3ZXJlIGN1cmlvdXMgYWJvdXQgaG93IE5pY2tlbGJhY2sgaXMgYWN0dWFsbHkgYSBjb21tZXJjaWFsbHktdmlhYmxlIGJhbmQuIFdpdGggYWxsIHRoZSByZWN5Y2xlZCBjb250ZW50IGFuZCByaHltaW5nIGZvciB0aGUgc2FrZSBvZiByaHltaW5nLCB5b3UnZCB0aGluayBwZW9wbGUgd291bGQgaGF2ZSBqdXN0IHN0b3BwZWQgbGlzdGVuaW5nLiAgCgpUbyBiZXR0ZXIgdW5kZXJzdGFuZCBOaWNrZWxiYWNrLCBJJ3ZlIGRlY2lkZWQgdG8gdGFrZSBvbmUgZm9yIHRoZSB0ZWFtIGFuZCBzY3JhcGUgdGhlaXIgbHlyaWNzIGFuZCBwZXJmb3JtIHNvbWUgdGV4dCBhbmFseXNpcy4gIAoKIyMgU2V0dGluZy1VcCBmb3IgU2NyYXBpbmcgYW5kIFRleHQgQW5hbHlzaXMKIyMjIFNwZWNpYWwgZnVuY3Rpb25zCkFzIGFsd2F5cywgc3BlY2lhbCBmdW5jdGlvbnMgYXJlIHNwZWNpZmllZCBmaXJzdC4gYHdpcGVfaHRtbGAgaXMgaGFuZHkgZm9yIGVsaW1pbmF0aW5nIGFubm95aW5nIEhUTUwgdGFncy4gSXQgZXNzZW50aWFsbHkgdGFrZXMgYSBIVE1MIHN0cmluZywgY2xlYW5zIG91dCB0aGUgSFRNTCB0YWdzLCBhbmQgcmV0dXJucyB0aGUgbm9uLUhUTUwgY29udGVudHMgb2YgdGhhdCBzdHJpbmcuIEl0J2xsIGNvbWUgaW4gaGFuZHkgaW4gYW55IHdlYi1zY3JhcGluZyB5b3UgZG8sIHNvIEkgaGlnaGx5IHJlY29tbWVuZCB5b3UgaGFuZyBvbnRvIGl0LgpgYGB7cn0KIyBmdW5jdGlvbiB0byBzY3J1YiBodG1sIG91dCBvZiB0ZXh0IC0tLS0Kd2lwZV9odG1sIDwtIGZ1bmN0aW9uKHN0cl9odG1sKSB7CiAgICAgICAgZ3N1YigiPC4qPz4iLCAiIiwgc3RyX2h0bWwpCn0KYGBgCgojIyMgTGlicmFyaWVzCkknbGwgYmUgdXNpbmcgSGFkbGV5IFdpY2toYW0ncyBgZHBseXJgIGFuZCBgcnZlc3RgLiBgZHBseXJgIGlzIHRoZSBnb2xkIHN0YW5kYXJkIGZvciBSIGRhdGEgd3JhbmdsaW5nLCBhbmQgYHJ2ZXN0YCBpcyBqdXN0IGNvbnZlbmllbnQgZm9yIHNjcmFwaW5nIHdlYiBkYXRhLgpgYGB7cn0KIyBsaWJyYXJpZXMgLS0tLQpwa2dzIDwtIGMoImRwbHlyIiwgInJ2ZXN0IiwgInRtIiwgIlNub3diYWxsQyIsICJ0b3BpY21vZGVscyIpCnZhcHBseShwa2dzLCByZXF1aXJlLCBjaGFyYWN0ZXIub25seSA9IFRSVUUsIEZVTi5WQUxVRSA9IDEpCmBgYAoKIyMjIFdlYi1TY3JhcGluZwojIyMjIEdldCB0aGUgc29uZyBsaXN0ClRoaXMgY29kZSBjaHVuayBzY3JhcGVzIHRoZSB0YWJsZSB3aXRoIHRoZSBzb25nIHRpdGxlcywgc2VsZWN0cyB0aGUgbmFtZSBvZiB0aGUgc29uZyBhbmQgdGhlIHllYXIgaXQgd2FzIHJlbGVhc2VkLCBhbmQgY3JlYXRlcyBhIGNvbHVtbiBjYWxsZWQgYHVybF9uYW1lYCB0aGF0IEknbGwgbmVlZCBpbiB0aGUgbmV4dCBjb2RlIGNodW5rLiAgCmBgYHtyfQojIGdldCB0aGUgSFRNTCBjb2RlIGZyb20gbWV0cm9seXJpY3MgLS0tLQpzb25nc191cmwgPC0gImh0dHBzOi8vZW4ud2lraXBlZGlhLm9yZy93aWtpL0xpc3Rfb2Zfc29uZ3NfcmVjb3JkZWRfYnlfTmlja2VsYmFjayIKaHRtbF9jb2RlIDwtIHJlYWRfaHRtbChzb25nc191cmwpCgojIGdldCBzb25ncyBmcm9tIHNvbmcgdGl0bGUgSFRNTCB0YWJsZSAtLS0tCmh0bWxfc29uZ3MgPC0gaHRtbF9ub2RlKGh0bWxfY29kZSwgeHBhdGggPSAnLy8qW0BpZD0ibXctY29udGVudC10ZXh0Il0vZGl2L3RhYmxlWzNdJykKc29uZ3NfZGYgPC0gaHRtbF90YWJsZShodG1sX3NvbmdzLCBoZWFkZXIgPSBUUlVFKQoKIyBkcm9wIG5hbWUgY29sdW1uIHRvIGxvd2VyY2FzZSBhbmQgcmVtb3ZlIGNvbHVtbnMgLS0tLQpuYW1lcyhzb25nc19kZikgPC0gdG9sb3dlcihnc3ViKCJcXHN8W1s6cHVuY3Q6XV0iLCAiXyIsIG5hbWVzKHNvbmdzX2RmKSkpCgojIGNsZWFuIHVwIHRoZSBzb25nc19kZiAic29uZyIgdmFyaWFibGUgLS0tLQpzb25nc19kZiRzb25nIDwtIGdzdWIoJ1xcIicsICIiLCBzb25nc19kZiRzb25nKQoKIyBhbGJ1bSBgaGVzaGVyYCBoYXMgdW5yZWxpYWJsZSBkYXRhOyBuaXggaXQgLS0tLQpzb25nc19kZiA8LSBzb25nc19kZiAlPiUgZmlsdGVyKHJlbGVhc2UgIT0gJ0hlc2hlcicpCmBgYAoKIyMjIEdldHRpbmcgdGhlIGx5cmljcwpMb29waW5nIG92ZXIgdGhlIGB1cmxfbmFtZWAgdmFyaWFibGUgZnJvbSB0aGUgcHJldmlvdXMgY29kZSBjaHVuayBtYWRlIHRoZSBtb3N0IHNlbnNlIGZvciB0aGlzLiAgCgpJbiB0aGlzIGNvZGUgY2h1bmssIEknbGwgY3JlYXRlIHR3byBlbXB0eSB2ZWN0b3JzIGZvciB0aGUgbHlyaWNzIGFuZCBhbGJ1bSBuYW1lcy4gTXkgYGZvcmAtbG9vcCB3aWxsIHRoZW4gbG9vcCBvdmVyIGVhY2ggZWxlbWVudCBpbiBgdXJsX25hbWVgLCBwYXN0ZSBpdCBpbnRvIGEgaHlwZXJsaW5rLCBzY3JhcGUgdGhlIGx5cmljcyBhbmQgYWxidW0gbmFtZSwgc2NydWIgb3V0IHRoZSBIVE1MLCBhbmQgcGxhY2UgZWFjaCBjYXB0dXJlZCBwaWVjZSBpbnRvIHRoZWlyIHJlc3BlY3RpdmUgdmVjdG9ycy4gVGhlIGRhdGEgZnJhbWUgYGFsYnVtX2x5cmljc2Agd2lsbCBob2xkIHRoZSBjb250ZW50cyBvZiBib3RoIHZlY3RvcnMuCmBgYHtyfQojIG5hdmlnYXRlIHRvIGVhY2ggc29uZydzIFVSTCBhbmQgc2NyYXBlIHRoZSBhbGJ1bSBuYW1lIGFuZCBseXJpY3MgLS0tLQojIyBlbXB0eSB2ZWN0b3JzCmx5cmljcyA8LSBjKCkKIyMgc3BlY2lmeSByb3cgbnVtYmVyIHRvIGFkZCB0byBkYXRhIGZyYW1lCm51bWJlciA8LSAxCiMjIGZvci1sb29wIGFuZCBjcmVhdGUgZGF0YSBmcmFtZSBmcm9tIHRoZSB0d28gdmVjdG9ycwpmb3IoaSBpbiBzZXFfYWxvbmcoc29uZ3NfZGYkc29uZykpIHsKICAgICAgICBmb3JfdXJsX25hbWUgPC0gc29uZ3NfZGYkc29uZ1tpXQogICAgICAgICMjIGNsZWFuIHVwIHNvbmcgbmFtZSBmb3IgVVJMCiAgICAgICAgZm9yX3VybF9uYW1lIDwtIHRvbG93ZXIoZ3N1YigiW1s6cHVuY3Q6XV18XFxzIiwgIiIsIGZvcl91cmxfbmFtZSkpCiAgICAgICAgIyMgY3JlYXRlIHVybAogICAgICAgIHBhc3RlX3VybCA8LSBwYXN0ZTAoImh0dHA6Ly93d3cuYXpseXJpY3MuY29tL2x5cmljcy9uaWNrZWxiYWNrLyIsCiAgICAgICAgICAgICAgICAgICAgICAgICAgICBmb3JfdXJsX25hbWUsICIuaHRtbCIpCiAgICAgICAgIyMgb3BlbiBjb25uZWN0aW9uIHRvIHVybAogICAgICAgIGZvcl9odG1sX2NvZGUgPC0gcmVhZF9odG1sKHBhc3RlX3VybCkKICAgICAgICAjIyBzY3JhcGUgbHlyaWNzIHZpYSBzZWxlY3RvciBwYXRoCiAgICAgICAgZm9yX2x5cmljcyA8LSBodG1sX25vZGUoZm9yX2h0bWxfY29kZSwgCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgeHBhdGggPSAiL2h0bWwvYm9keS9kaXZbM10vZGl2L2RpdlsyXS9kaXZbNV0iKQogICAgICAgICMjIHNjcnViIGh0bWwgYW5kIGNvbnRyb2wgY2hhcmFjdGVycyBvdXQgb2YgdGhlIGx5cmljcwogICAgICAgIGZvcl9seXJpY3MgPC0gd2lwZV9odG1sKGZvcl9seXJpY3MpCiAgICAgICAgZm9yX2x5cmljcyA8LSBnc3ViKCJbWzpjbnRybDpdXSIsICIgIiwgZm9yX2x5cmljcykKICAgICAgICAjIyBhZGQgZm9yX2x5cmljcyB0byByZXNwZWN0aXZlIHZlY3RvcnMgCiAgICAgICAgbHlyaWNzW251bWJlcl0gPC0gZm9yX2x5cmljcwogICAgICAgIG51bWJlciA8LSBudW1iZXIgKyAxCiAgICAgICAgIyMgc3RhdHVzIGNoZWNrCiAgICAgICAgc2hvdyhwYXN0ZTAoZm9yX3VybF9uYW1lLCAiIHNjcmFwZSBjb21wbGV0ZSEiKSkKICAgICAgICAjIyBvcHRpb25hbDogYWRkIGluIDEwIHNlY29uZCBkZWxheSB0byBhdm9pZCBJUCBibG9jawogICAgICAgIFN5cy5zbGVlcCgxMCkKfQoKIyBiaW5kIGRhdGEgZnJhbWVzIHRvZ2V0aGVyIGFuZCBzdHJpcCAiTHlyaWNzIiBvdXQgb2YgbmFtZSB2YXJpYWJsZSAtLS0tCmlmKG5yb3coc29uZ3NfZGZfY2xlYW4pICE9IG5yb3coYWxidW1fbHlyaWNzKSkgewogICAgICAgIHN0b3AoImRhdGEgZnJhbWVzIGhhdmUgZGlmZmVyZW50IG51bWJlciBvZiByb3dzIikKfSBlbHNlIHsKICAgICAgICBuYl9kYXRhIDwtIGJpbmRfY29scyhzb25nc19kZl9jbGVhbiwgYWxidW1fbHlyaWNzKSAlPiUgCiAgICAgICAgICAgICAgICBtdXRhdGUobmFtZSA9IGdzdWIoIkx5cmljcyIsICIiLCBuYW1lKSkKfQoKIyBmaWx0ZXIgb3V0IG1pc3NpbmcgYWxidW0gbmFtZXMgYW5kIHNob3cgdGhlIGZpcnN0IDEwIHJvd3Mgb2YgbmJfZGF0YSAtLS0tCm5iX2RhdGEgPC0gbmJfZGF0YSAlPiUgZmlsdGVyKCFpcy5uYShhbGJ1bSkpCm5iX2RhdGFbMToxMCwgXQoKYGBgCgojIFVwZGF0ZTogSSBnb3QgYnVzdGVkIGZvciBzY3JhcGluZyB0b28gcXVpY2tseSBieSBhemx5cmljcy5jb20gYW5kIHJlY2VpdmVkIGEgcGVybWFuZW50IGJhbiBvbiBteSBJUCBhZGRyZXNzLiBIZXJlJ3MgdG8gbXkgSVNQIHN3aXRjaGluZyB1cCBteSBhZGRyZXNzIHNvb24uIFRoaXMgY29kZSBpcyBzdGlsbCBlbnRpcmVseSByZXByb2R1Y2libGUsICoqc28gbWFrZSBzdXJlIHlvdSB1c2UgdGhlIFN5cy5zbGVlcCgpIGZ1bmN0aW9uKiogSSBpbnNlcnRlZCBpbnRvIHRoZSAiR2V0dGluZyBseXJpY3MiIGNodW5rIHNvIHlvdSBjYW4gZGVsYXkgdGhlIGV4ZWN1dGlvbiBvZiB0aGUgbG9vcCBieSAxMCBzZWNvbmRzLgoKCgoKCgoKCgo=