IS607 - Project 4 - Web Scraping

I approached a “classic” technique utilizing the httr and XML packages. This was done by using GET to obtain the HTML content, parse it, and then utilize a simple XPath expression to traverse the DOM tree and access the DIV container for the targeted content.

library(httr)
library(XML)
url <- GET("http://www.r-bloggers.com/search/web%20scraping")
url_html <- htmlParse(content(url, as="text"))
parsed_html <- xpathSApply(url_html, "//div[@id='post-85373']//a", xmlValue)
title <- parsed_html[1]
author <- parsed_html[2]
date <- xpathSApply(url_html, "//div[@id='post-85373']//div", xmlValue)[2]
frame1 <- data.frame(title, author, date)

This technique was more challenging to work with as it requires specifically naming the DIV container and attributes (title, author, date) to get the content. An alternative would be searching for a prefix e.g. ‘post’* and parsing through that way using additional Xpath expressions due to the hierarchy of the tree

The next technique I used was with the Hadley Wickham rvest package - specifically using the CSS Selector approach. It required less syntax, was more powerful, and provided additional options such as html_table, or by using XPath through html_nodes. Using the CSS Selector approach depends on page layout and structure - if the content you’re after is within a single named DIV or DIV class which is how most modern sites are constructed you are in luck as it makes accessing that content on a page relatively simple. For complex or non-standard layouts, you are better off using the Xpath approach with rvest.

Sys.sleep(1)
library(rvest)

## 
## Attaching package: 'rvest'
## 
## The following object is masked from 'package:XML':
## 
##     xml

vurl <- html("http://www.r-bloggers.com/search/web%20scraping")
v1 <- vurl %>%
    html_nodes("#leftcontent .date") %>%
    html_text()
frame2 <- data.frame(v1)
colnames(frame2)<-c('date')

Extra Analysis
R Bloggers does not explicilty provide an API for article search; it does however have search query capability using REST. This can be found from the following URL: http://www.r-bloggers.com/?s=searchterm and results are returned in HTML. Improving upon this would look something like the NY Times API where queries can be made directly to the content management system and results returned in JSON and XML formats.

IS607 - Project 4 - Web Scraping

Param Singh

April 28, 2015