Web Scraping with rvest

rvest

rvest gives you very simple web scraping methods.

For a webpage to be scrapable, it has to be static, i.e., the content must alreayd be there when the page is first loaded. If contents are uploaded dynamically using JavaScripts, then you cannot use rvest for such case. There are more sophisticated solutions (such as selemnium) that tries to mimic real user behaviors, and actually execute JavaScripts, but we will not talk about them here.

Example

library(dplyr)
library(stringr)
library(rvest)

movie <- read_html("http://www.imdb.com/title/tt1490017/")

ChildText <- function(parent, selector) {
  # get text from children of a html node that match the selector
  parent %>%
    html_nodes(selector) %>%
    html_text() %>%
    str_trim()
}

# isolate the list (in our case, a <table>)
# pay attention to difference between `html_node` and `html_nodes`
cast_list <- movie %>% html_nodes(".cast_list")

# go through each data columns needed
actor.names <- cast_list %>%
  ChildText('*[itemprop="actor"]')
actor.urls <- cast_list %>%
  html_nodes('*[itemprop="url"]') %>%
  html_attr("href") %>%
  str_replace("\\?.*$", "")
character.names <- cast_list %>%
  ChildText(".character") %>%
  str_replace_all("[ \\n]+", " ")
  
df <- tibble(
  actor = actor.names,
  actor.url = actor.urls,
  character = character.names
)

httr and xml2

rvest is built upon httr and xml2

xml2 and httr povide a much easier and more intuitive API for manipulation HTTP requests/responses and XML documents.

library(httr)
library(xml2)

r <- GET("http://www.imdb.com/title/tt1490017/")
status_code(r)

## [1] 200

headers(r)

## $date
## [1] "Thu, 02 Mar 2017 19:49:32 GMT"
## 
## $server
## [1] "Server"
## 
## $`x-frame-options`
## [1] "SAMEORIGIN"
## 
## $`content-security-policy`
## [1] "frame-ancestors 'self' imdb.com *.imdb.com *.media-imdb.com withoutabox.com *.withoutabox.com amazon.com *.amazon.com amazon.co.uk *.amazon.co.uk amazon.de *.amazon.de translate.google.com images.google.com www.google.com www.google.co.uk search.aol.com bing.com www.bing.com"
## 
## $`ad-unit`
## [1] "imdb.title_md.title.maindetails"
## 
## $`entity-id`
## [1] "tt1490017"
## 
## $`content-type`
## [1] "text/html;charset=UTF-8"
## 
## $`content-language`
## [1] "en-US"
## 
## $`content-encoding`
## [1] "gzip"
## 
## $vary
## [1] "Accept-Encoding,User-Agent"
## 
## $`set-cookie`
## [1] "uu=BCYhzGl7zjJvZgMommC0yOhrjGid1GdcsXU40PUsH6o-F-D-vNqxQ6WCFH6bdRd-11Y3_3O_hHvl%0D%0AZBn926Z0_prZZVFoskhdiQt-WJO_Qi5Kgo1fXzSdgbptO_aiiLhTYMJCmxV7vgkZ8597ak1Tybib%0D%0AnQYyKjDovU6LVytBGLWOu6AqsjbIGyjydK5S7i-VY7ahEj79FOlk77GDtWFVklafW2f7AYTJrCYf%0D%0AH8RVN52MmNczhJbjxvggueWJEM26YaI9s--_bDq4n9aZkm0KnHV2Xw%0D%0A; Domain=.imdb.com; Expires=Tue, 20-Mar-2085 23:03:39 GMT; Path=/"
## 
## $`set-cookie`
## [1] "session-id=744-9672599-8600451; Domain=.imdb.com; Expires=Tue, 20-Mar-2085 23:03:39 GMT; Path=/"
## 
## $`set-cookie`
## [1] "session-id-time=1646164171; Domain=.imdb.com; Expires=Tue, 20-Mar-2085 23:03:39 GMT; Path=/"
## 
## $p3p
## [1] "policyref=\"http://i.imdb.com/images/p3p.xml\",CP=\"CAO DSP LAW CUR ADM IVAo IVDo CONo OTPo OUR DELi PUBi OTRi BUS PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA HEA PRE LOC GOV OTC \""
## 
## $`transfer-encoding`
## [1] "chunked"
## 
## attr(,"class")
## [1] "insensitive" "list"

content(r) %>%
  xml_find_all('//*[@id="titleCast"]//td[@class="itemprop"]/a') %>%
  xml_text()

##  [1] " Will Arnett\n"     " Elizabeth Banks\n" " Craig Berry\n"    
##  [4] " Alison Brie\n"     " David Burrows\n"   " Anthony Daniels\n"
##  [7] " Charlie Day\n"     " Amanda Farinos\n"  " Keith Ferguson\n" 
## [10] " Will Ferrell\n"    " Will Forte\n"      " Dave Franco\n"    
## [13] " Morgan Freeman\n"  " Todd Hansen\n"     " Jonah Hill\n"

Web Scraping with rvest

Jianchao Yang

3/2/2017

rvest

Example

httr and xml2