rvest_tutorial

Practice from http://www.r-bloggers.com/rvest-easy-web-scraping-with-r/ tutorial

To install rvest use the following command:

install.packages(“rvest”)

Let us include the rvest package:

library(rvest)

Our idea is to scrape the movie name information from the following website (as suggested in the tutorial)

Open the following website via chrome. Enable selector gadget, selec the movie name (by clicking). If you click once the tag is selected, if you click on the place again, then the selector gadget unselects it. Once you select, copy the tag name http://www.imdb.com/title/tt0472043/?ref_=nv_sr_3 It got “.itemprop” as css selector. This will be supplied as input to html_node(), as follows:

apocalypto_movie <- html("http://www.imdb.com/title/tt0472043/?ref_=nv_sr_3")

apocalypto_movie %>% 
  html_node(".itemprop") %>%
  html_text()

## [1] "Apocalypto"

Let us scrape something from wiki:

maha <- html("http://en.wikipedia.org/wiki/Mahabharata")

x<-maha %>% 
  html_nodes("#toc") %>%
  html_text() 
  
cat(x)

## 
## 
## Contents
## 
## 1 Textual history and structure
## 1.1 Accretion and redaction
## 1.2 Historical references
## 1.3 The 18 parvas or books
## 
## 2 Historical context
## 3 Synopsis
## 3.1 The older generations
## 3.2 The Pandava and Kaurava princes
## 3.3 Lakshagraha (the house of lac)
## 3.4 Marriage to Draupadi
## 3.5 Indraprastha
## 3.6 The dice game
## 3.7 Exile and return
## 3.8 The battle at Kurukshetra
## 3.9 The end of the Pandavas
## 3.10 The reunion
## 
## 4 Themes
## 4.1 Just war
## 
## 5 Versions, translations, and derivative works
## 5.1 Critical Edition
## 5.2 Regional versions
## 5.3 Translations
## 5.4 Derivative literature
## 5.5 In film and television
## 
## 6 Jain version
## 7 Kuru family tree
## 8 Cultural influence
## 9 Notes
## 10 References
## 11 Sources
## 12 External links

rvest_tutorial

Sekhar Mekala

Wednesday, April 01, 2015