Today’s learning objective is to learn how to scrape data from a target website using R. Although, Python’s BeautifulSoup is the most preferred package for web scraping, it does not hurt to try to do the same thing in R, leveraging the power of abduncae packages in R.
# library(rvest) #For Scraping Web pages
# # library(magrittr) #Pipe operators
# # library(scales) # Scale function for visualization
# library(lubridate) #Date and time made easy
# library(dplyr) #Data manipulation package
# library(ggplot2) #Grammar of Graphics for visualization
# library(tidyverse)
# library(knitr) # for beautiful table display
# # library(xml2) #parse XML, avaialbe in rvest package
Let’s start by trying to scrape the rating of the Movie Joker(2019) from the IMDB website.
# IMDB URL for the Movie - JOKER (2019)
url <- "https://www.imdb.com/title/tt7286456/?ref_=hm_fanfav_tt_2_pd_fp1"
rating <- url %>%
read_html() %>%
html_nodes("strong span") %>%
html_text()
rating
[1] "8.5"
Thus, we have successfullly managed to scrape the rating of The Joker movie correctly.
Now, let’s try to move on one step further.
We are going to start with mining the Billboard Hot 100 page at https://www.billboard.com/charts/hot-100, as on June 24, 2020.
We will harvest basic info from this page: Position Number, Artist, Song Title for the Hot 100.
# Target Web site URL
hot100page <- "https://www.billboard.com/charts/hot-100"
# Capture the HTML and XML level website information
hot100 <- xml2::read_html(hot100page)
# Extarct Rank of the song
rank <- hot100 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>%
rvest::html_text()
# Extract artist of the song
artist <- hot100 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__information__artist')]") %>%
rvest::html_text()
# Extrat title of the song
title <- hot100 %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__information__song')]") %>%
rvest::html_text()
#charts > div > div.chart-list.container > ol > li:nth-child(1) > button > div > div.chart-element__meta.text--center.color--secondary.text--peak
chart_top_100 <- data.frame(rank, artist, title)
head(chart_top_100, 20)
rank artist title
1 1 6ix9ine & Nicki Minaj Trollz
2 2 DaBaby Featuring Roddy Ricch Rockstar
3 3 Lil Baby The Bigger Picture
4 4 Megan Thee Stallion Featuring Beyonce Savage
5 5 The Weeknd Blinding Lights
6 6 Doja Cat Featuring Nicki Minaj Say So
7 7 Justin Bieber Featuring Quavo Intentions
8 8 SAINt JHN Roses
9 9 Lady Gaga & Ariana Grande Rain On Me
10 10 Roddy Ricch The Box
11 11 Lil Mosey Blueberry Faygo
12 12 Drake Toosie Slide
13 13 Post Malone Circles
14 14 Dua Lipa Don't Start Now
15 15 Maren Morris The Bones
16 16 Harry Styles Adore You
17 17 Future Featuring Drake Life Is Good
18 18 Jack Harlow Whats Poppin
19 19 Harry Styles Watermelon Sugar
20 20 Trevor Daniel Falling
We were able to extarct the Top 100 Billboard songs for the week of June 27, 2020.
Now, we have already done the hardwork of finding the relevant information on the Billboard website.
If we play with the billboard website to extract Top 100 songs for each week, we will observe that the syntax of the website URL uses a standard format.
For example, to view the Top 100 list for the week of June 13, 2020, the URL is https://www.billboard.com/charts/hot-100/2020-06-13
Therefore, we can create a function to return the Top 100 songs based on any given date.
get_chart <- function(date = Sys.Date(), positions = c(1:10), type = "hot-100") {
week_date <- as.Date(date)
# get url from input and read html
input <- paste0("https://www.billboard.com/charts/", type, "/", date)
chart_page <- xml2::read_html(input)
# scrape data
rank <- chart_page %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__rank__number')]") %>%
rvest::html_text()
artist <- chart_page %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__information__artist')]") %>%
rvest::html_text()
title <- chart_page %>%
rvest::html_nodes('body') %>%
xml2::xml_find_all("//span[contains(@class, 'chart-element__information__song')]") %>%
rvest::html_text()
# create dataframe, remove nas and return result
chart_df <- data.frame(week_date, rank, artist, title)
chart_df <- chart_df %>%
dplyr::filter(!is.na(rank), rank %in% positions)
chart_df
}
# Testing the function
top_100 <- get_chart(date = "2020-06-13", positions = c(1:20), type = "hot-100")
head(top_100, 20)
week_date rank artist title
1 2020-06-13 1 DaBaby Featuring Roddy Ricch Rockstar
2 2020-06-13 2 Megan Thee Stallion Featuring Beyonce Savage
3 2020-06-13 3 The Weeknd Blinding Lights
4 2020-06-13 4 Doja Cat Featuring Nicki Minaj Say So
5 2020-06-13 5 Lady Gaga & Ariana Grande Rain On Me
6 2020-06-13 6 Drake Toosie Slide
7 2020-06-13 7 Dua Lipa Don't Start Now
8 2020-06-13 8 Justin Bieber Featuring Quavo Intentions
9 2020-06-13 9 Roddy Ricch The Box
10 2020-06-13 10 SAINt JHN Roses
11 2020-06-13 11 Post Malone Circles
12 2020-06-13 12 Harry Styles Adore You
13 2020-06-13 13 Future Featuring Drake Life Is Good
14 2020-06-13 14 Maren Morris The Bones
15 2020-06-13 15 Lil Mosey Blueberry Faygo
16 2020-06-13 16 Morgan Wallen Chasin' You
17 2020-06-13 17 Ariana Grande & Justin Bieber Stuck With U
18 2020-06-13 18 Trevor Daniel Falling
19 2020-06-13 19 Gabby Barrett I Hope
20 2020-06-13 20 Surfaces Sunday Best
# Saturday of week 1 of 2010
start_date <- as.Date("2020-01-04")
# end_date as today's date
end_date <- Sys.Date()
loop_date <- start_date
top_100_complete <- data.frame()
while (loop_date <= end_date) {
top_100 <- get_chart(date = loop_date, positions = c(1:100), type = "hot-100")
top_100_complete <- rbind(top_100_complete, top_100)
loop_date <- loop_date + 7
}
str(top_100_complete)
'data.frame': 2500 obs. of 4 variables:
$ week_date: Date, format: "2020-01-04" "2020-01-04" ...
$ rank : Factor w/ 100 levels "1","10","100",..: 1 13 24 35 46 57 68 79 90 2 ...
$ artist : Factor w/ 296 levels "Andy Williams",..: 56 9 8 10 65 3 1 47 57 53 ...
$ title : Factor w/ 466 levels "(There's No Place Like) Home For The Holidays",..: 6 72 47 4 19 73 46 80 58 30 ...
# Top 20 entries of the table
head(top_100_complete, 20)
week_date rank artist
1 2020-01-04 1 Mariah Carey
2 2020-01-04 2 Brenda Lee
3 2020-01-04 3 Bobby Helms
4 2020-01-04 4 Burl Ives
5 2020-01-04 5 Post Malone
6 2020-01-04 6 Arizona Zervas
7 2020-01-04 7 Andy Williams
8 2020-01-04 8 Lewis Capaldi
9 2020-01-04 9 Maroon 5
10 2020-01-04 10 Lizzo
11 2020-01-04 11 Wham!
12 2020-01-04 12 Jose Feliciano
13 2020-01-04 13 Roddy Ricch
14 2020-01-04 14 Tones And I
15 2020-01-04 15 Dean Martin
16 2020-01-04 16 Nat King Cole
17 2020-01-04 17 Dan + Shay & Justin Bieber
18 2020-01-04 18 Selena Gomez
19 2020-01-04 19 Mustard & Roddy Ricch
20 2020-01-04 20 DaBaby
title
1 All I Want For Christmas Is You
2 Rockin' Around The Christmas Tree
3 Jingle Bell Rock
4 A Holly Jolly Christmas
5 Circles
6 Roxanne
7 It's The Most Wonderful Time Of The Year
8 Someone You Loved
9 Memories
10 Good As Hell
11 Last Christmas
12 Feliz Navidad
13 The Box
14 Dance Monkey
15 Let It Snow, Let It Snow, Let It Snow
16 The Christmas Song (Merry Christmas To You)
17 10,000 Hours
18 Lose You To Love Me
19 Ballin'
20 BOP
# Bottom 20 entries of the table
tail(top_100_complete, 20)
week_date rank artist
2481 2020-06-20 81 Lil Baby & 42 Dugg
2482 2020-06-20 82 Lady Gaga & BLACKPINK
2483 2020-06-20 83 Justin Moore
2484 2020-06-20 84 Rod Wave
2485 2020-06-20 85 Kane Brown
2486 2020-06-20 86 NAV, Gunna & Travis Scott
2487 2020-06-20 87 HARDY Featuring Lauren Alaina & Devin Dawson
2488 2020-06-20 88 Noah Cyrus & Leon Bridges
2489 2020-06-20 89 Gunna Featuring Roddy Ricch
2490 2020-06-20 90 Black Eyed Peas, Ozuna + J.Rey Soul
2491 2020-06-20 91 Lil Baby
2492 2020-06-20 92 Lil Uzi Vert
2493 2020-06-20 93 Ashley McBryde
2494 2020-06-20 94 Lee Brice
2495 2020-06-20 95 Brantley Gilbert
2496 2020-06-20 96 Brett Young
2497 2020-06-20 97 surf mesa Featuring Emilee
2498 2020-06-20 98 Polo G
2499 2020-06-20 99 Future Featuring YoungBoy Never Broke Again
2500 2020-06-20 100 Kane Brown
title
2481 Grace
2482 Sour Candy
2483 Why We Drink
2484 Girl Of My Dreams
2485 Cool Again
2486 Turks
2487 One Beer
2488 July
2489 Cooler Than A Bitch
2490 Mamacita
2491 All In
2492 That Way
2493 One Night Standards
2494 One Of Them Girls
2495 Hard Days
2496 Catch
2497 ily
2498 21
2499 Trillionaire
2500 Worldwide Beautiful
Thus, this script could be used to scrape the data from the Billboard website for any of the charts that you might be looking for.
You can access my other publishes from the URL: https://rpubs.com/Mayank7j_2020