Objective

Today’s learning objective is to learn how to scrape data from a target website using R. Although, Python’s BeautifulSoup is the most preferred package for web scraping, it does not hurt to try to do the same thing in R, leveraging the power of abduncae packages in R.

Load Packages

# library(rvest) #For Scraping Web pages
# # library(magrittr) #Pipe operators
# # library(scales) # Scale function for visualization
# library(lubridate) #Date and time made easy
# library(dplyr) #Data manipulation package
# library(ggplot2) #Grammar of Graphics for visualization
# library(tidyverse)
# library(knitr) # for beautiful table display
# # library(xml2) #parse XML, avaialbe in rvest package

Simplest Scrapping Example

Let’s start by trying to scrape the rating of the Movie Joker(2019) from the IMDB website.

# IMDB URL for the Movie - JOKER (2019)
url <- "https://www.imdb.com/title/tt7286456/?ref_=hm_fanfav_tt_2_pd_fp1"

rating <- url %>% 
            read_html() %>% 
            html_nodes("strong span") %>% 
            html_text()

rating  
[1] "8.5"

Thus, we have successfullly managed to scrape the rating of The Joker movie correctly.
Now, let’s try to move on one step further.

Basic harvesting: The Billboard Hot 100 page

We are going to start with mining the Billboard Hot 100 page at https://www.billboard.com/charts/hot-100, as on June 24, 2020.
We will harvest basic info from this page: Position Number, Artist, Song Title for the Hot 100.

# Target Web site URL
hot100page <- "https://www.billboard.com/charts/hot-100"

# Capture the HTML and XML level website information
hot100 <- xml2::read_html(hot100page)
  • On looking at the XML level code of the Billboard website, we can observe that each chart entry appears to be in tag.
  • Class name "chart-element__rank__number" gives the Chart Ranking
  • Class name "chart-element__information__artist" gives the name of the artist(s)
  • Class name "chart-element__information__song" gives the name of the song
  • We will use the function xml_find_all() to find all nodes in the body of the document that have a class name containing the class names we want. xml_find_all() accepts xpath syntax.
# Extarct Rank of the song
rank <- hot100 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>% 
  rvest::html_text()

# Extract artist of the song
artist <- hot100 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//span[contains(@class, 'chart-element__information__artist')]") %>% 
  rvest::html_text()

# Extrat title of the song
title <- hot100 %>% 
  rvest::html_nodes('body') %>% 
  xml2::xml_find_all("//span[contains(@class, 'chart-element__information__song')]") %>% 
  rvest::html_text()


#charts > div > div.chart-list.container > ol > li:nth-child(1) > button > div > div.chart-element__meta.text--center.color--secondary.text--peak

chart_top_100 <- data.frame(rank, artist, title)
head(chart_top_100, 20)
   rank                                artist              title
1     1                 6ix9ine & Nicki Minaj             Trollz
2     2          DaBaby Featuring Roddy Ricch           Rockstar
3     3                              Lil Baby The Bigger Picture
4     4 Megan Thee Stallion Featuring Beyonce             Savage
5     5                            The Weeknd    Blinding Lights
6     6        Doja Cat Featuring Nicki Minaj             Say So
7     7         Justin Bieber Featuring Quavo         Intentions
8     8                             SAINt JHN              Roses
9     9             Lady Gaga & Ariana Grande         Rain On Me
10   10                           Roddy Ricch            The Box
11   11                             Lil Mosey    Blueberry Faygo
12   12                                 Drake       Toosie Slide
13   13                           Post Malone            Circles
14   14                              Dua Lipa    Don't Start Now
15   15                          Maren Morris          The Bones
16   16                          Harry Styles          Adore You
17   17                Future Featuring Drake       Life Is Good
18   18                           Jack Harlow       Whats Poppin
19   19                          Harry Styles   Watermelon Sugar
20   20                         Trevor Daniel            Falling

We were able to extarct the Top 100 Billboard songs for the week of June 27, 2020.

Automation to extract Top 100 as per user’s requirement

Now, we have already done the hardwork of finding the relevant information on the Billboard website.
If we play with the billboard website to extract Top 100 songs for each week, we will observe that the syntax of the website URL uses a standard format.

For example, to view the Top 100 list for the week of June 13, 2020, the URL is https://www.billboard.com/charts/hot-100/2020-06-13

Therefore, we can create a function to return the Top 100 songs based on any given date.

get_chart <- function(date = Sys.Date(), positions = c(1:10), type = "hot-100") {

  week_date <- as.Date(date)
  
  # get url from input and read html
  input <- paste0("https://www.billboard.com/charts/", type, "/", date) 
  chart_page <- xml2::read_html(input)
  
  # scrape data
  rank <- chart_page %>% 
    rvest::html_nodes('body') %>% 
    xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>% 
    rvest::html_text()
  
  artist <- chart_page %>% 
    rvest::html_nodes('body') %>% 
    xml2::xml_find_all("//span[contains(@class, 'chart-element__information__artist')]") %>% 
    rvest::html_text()
  
  title <- chart_page %>% 
    rvest::html_nodes('body') %>% 
    xml2::xml_find_all("//span[contains(@class, 'chart-element__information__song')]") %>% 
    rvest::html_text()

  # create dataframe, remove nas and return result
  chart_df <- data.frame(week_date, rank, artist, title)
  
  chart_df <- chart_df %>% 
    dplyr::filter(!is.na(rank), rank %in% positions)

  chart_df

}

# Testing the function

top_100 <- get_chart(date = "2020-06-13", positions = c(1:20), type = "hot-100")

head(top_100, 20)
    week_date rank                                artist           title
1  2020-06-13    1          DaBaby Featuring Roddy Ricch        Rockstar
2  2020-06-13    2 Megan Thee Stallion Featuring Beyonce          Savage
3  2020-06-13    3                            The Weeknd Blinding Lights
4  2020-06-13    4        Doja Cat Featuring Nicki Minaj          Say So
5  2020-06-13    5             Lady Gaga & Ariana Grande      Rain On Me
6  2020-06-13    6                                 Drake    Toosie Slide
7  2020-06-13    7                              Dua Lipa Don't Start Now
8  2020-06-13    8         Justin Bieber Featuring Quavo      Intentions
9  2020-06-13    9                           Roddy Ricch         The Box
10 2020-06-13   10                             SAINt JHN           Roses
11 2020-06-13   11                           Post Malone         Circles
12 2020-06-13   12                          Harry Styles       Adore You
13 2020-06-13   13                Future Featuring Drake    Life Is Good
14 2020-06-13   14                          Maren Morris       The Bones
15 2020-06-13   15                             Lil Mosey Blueberry Faygo
16 2020-06-13   16                         Morgan Wallen     Chasin' You
17 2020-06-13   17         Ariana Grande & Justin Bieber    Stuck With U
18 2020-06-13   18                         Trevor Daniel         Falling
19 2020-06-13   19                         Gabby Barrett          I Hope
20 2020-06-13   20                              Surfaces     Sunday Best

Creating a list for Top 100 for all the weeks starting from January 2020

# Saturday of week 1 of 2010
start_date <- as.Date("2020-01-04")

# end_date as today's date
end_date <- Sys.Date()

loop_date <- start_date

top_100_complete <- data.frame()

while (loop_date <= end_date) {
  
  top_100 <- get_chart(date = loop_date, positions = c(1:100), type = "hot-100")
  top_100_complete <- rbind(top_100_complete, top_100)
  loop_date <- loop_date + 7
  
}


str(top_100_complete)
'data.frame':   2500 obs. of  4 variables:
 $ week_date: Date, format: "2020-01-04" "2020-01-04" ...
 $ rank     : Factor w/ 100 levels "1","10","100",..: 1 13 24 35 46 57 68 79 90 2 ...
 $ artist   : Factor w/ 296 levels "Andy Williams",..: 56 9 8 10 65 3 1 47 57 53 ...
 $ title    : Factor w/ 466 levels "(There's No Place Like) Home For The Holidays",..: 6 72 47 4 19 73 46 80 58 30 ...
# Top 20 entries of the table
head(top_100_complete, 20)
    week_date rank                     artist
1  2020-01-04    1               Mariah Carey
2  2020-01-04    2                 Brenda Lee
3  2020-01-04    3                Bobby Helms
4  2020-01-04    4                  Burl Ives
5  2020-01-04    5                Post Malone
6  2020-01-04    6             Arizona Zervas
7  2020-01-04    7              Andy Williams
8  2020-01-04    8              Lewis Capaldi
9  2020-01-04    9                   Maroon 5
10 2020-01-04   10                      Lizzo
11 2020-01-04   11                      Wham!
12 2020-01-04   12             Jose Feliciano
13 2020-01-04   13                Roddy Ricch
14 2020-01-04   14                Tones And I
15 2020-01-04   15                Dean Martin
16 2020-01-04   16              Nat King Cole
17 2020-01-04   17 Dan + Shay & Justin Bieber
18 2020-01-04   18               Selena Gomez
19 2020-01-04   19      Mustard & Roddy Ricch
20 2020-01-04   20                     DaBaby
                                         title
1              All I Want For Christmas Is You
2            Rockin' Around The Christmas Tree
3                             Jingle Bell Rock
4                      A Holly Jolly Christmas
5                                      Circles
6                                      Roxanne
7     It's The Most Wonderful Time Of The Year
8                            Someone You Loved
9                                     Memories
10                                Good As Hell
11                              Last Christmas
12                               Feliz Navidad
13                                     The Box
14                                Dance Monkey
15       Let It Snow, Let It Snow, Let It Snow
16 The Christmas Song (Merry Christmas To You)
17                                10,000 Hours
18                         Lose You To Love Me
19                                     Ballin'
20                                         BOP
# Bottom 20 entries of the table
tail(top_100_complete, 20)
      week_date rank                                       artist
2481 2020-06-20   81                           Lil Baby & 42 Dugg
2482 2020-06-20   82                        Lady Gaga & BLACKPINK
2483 2020-06-20   83                                 Justin Moore
2484 2020-06-20   84                                     Rod Wave
2485 2020-06-20   85                                   Kane Brown
2486 2020-06-20   86                    NAV, Gunna & Travis Scott
2487 2020-06-20   87 HARDY Featuring Lauren Alaina & Devin Dawson
2488 2020-06-20   88                    Noah Cyrus & Leon Bridges
2489 2020-06-20   89                  Gunna Featuring Roddy Ricch
2490 2020-06-20   90          Black Eyed Peas, Ozuna + J.Rey Soul
2491 2020-06-20   91                                     Lil Baby
2492 2020-06-20   92                                 Lil Uzi Vert
2493 2020-06-20   93                               Ashley McBryde
2494 2020-06-20   94                                    Lee Brice
2495 2020-06-20   95                             Brantley Gilbert
2496 2020-06-20   96                                  Brett Young
2497 2020-06-20   97                   surf mesa Featuring Emilee
2498 2020-06-20   98                                       Polo G
2499 2020-06-20   99  Future Featuring YoungBoy Never Broke Again
2500 2020-06-20  100                                   Kane Brown
                   title
2481               Grace
2482          Sour Candy
2483        Why We Drink
2484   Girl Of My Dreams
2485          Cool Again
2486               Turks
2487            One Beer
2488                July
2489 Cooler Than A Bitch
2490            Mamacita
2491              All In
2492            That Way
2493 One Night Standards
2494   One Of Them Girls
2495           Hard Days
2496               Catch
2497                 ily
2498                  21
2499        Trillionaire
2500 Worldwide Beautiful

Conclusion

Thus, this script could be used to scrape the data from the Billboard website for any of the charts that you might be looking for.

You can access my other publishes from the URL: https://rpubs.com/Mayank7j_2020