Parsing

Loading and representing the contents of HTML/XML files in an R session

  1. Inspecting content on the Web: browser to display HTML content nicely

  2. Importing HTML files into R and extracting info. from them: parser in R to construct useful representations of HTML documents

What is parsing?

Reading vs. Parsing

Reading does not care to understand the formal grammar that underlies HTML but merely recognize the sequence of symbols included in the HTML file: Merely loading the content of an HTML file into an R session.

library(httr)
url <- "http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html"
fortune <- httr::GET(url)
fortune
## Response [http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html]
##   Date: 2020-10-05 15:50
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 776 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html> <head>
## <title>Collected R wisdoms</title>
## </head>
## 
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## ...
class(fortune)
## [1] "response"

GET() is agnostic about the different tag elements (name, attribute, values, etc.) and produces results that do not reflect the document’s internal hierarchy as implied by the nested tags in any sensible way.

To achieve a useful representation of HTML files, we need to employ a program that understands the special meaning of the markup structures and reconstructs the implied hierarchy of an HTML file within some R-specific data structure.

Transformation from any HTML file to a queryable Document Object Model: Parsing using XML package in two steps

  1. ```html_parse()``` first parses the entire target document and creates the DOM in a tree-like data structure of the C language.
  2. The C-level node structure is converted into an object of the R language through handler functions.
library(XML)
parsed_fortune <- htmlParse(fortune)
parsed_fortune <- htmlParse(fortune, encoding = "UTF-8")
class(parsed_fortune)
## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"
parsed_fortune
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
## 
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
## 
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
## 
## </body>
## </html>
## 

XPath

Web scraping process

  1. Asking what information we are interested in and identifying where the information is located in a specific document

  2. Tailoring a query to the document and obtaining the desired information

  3. (Re)casting the extracted values into a format that facilitates further analysis

XPath - a query language for web documents

XPath is a query language that is useful for addressing and extracting parts from HTML/XML documents.

library(XML)
parsed_fortune <- htmlParse(file=fortune)
print(parsed_fortune)
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
## 
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
## 
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
## 
## </body>
## </html>
## 
A tree perspective on parsed_fortune

A tree perspective on parsed_fortune

Identifying node sets with XPath

Basic structure of an XPath query

  1. Hierarchical addressing mechanism
getwd()
## [1] "/Users/shinlee/Dropbox/2020_Class/fall/big data journalism/R"
  1. Absolute paths
xpathSApply(doc = parsed_fortune, path = "/html/body/div/p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Relative paths
xpathSApply(parsed_fortune, "//body//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Deciding between relative and absolute paths

Relative path statements result in complete traversals of the document tree, which is rather expensive computationally and decreases the efficiency of the query. So, if speed is an issue to your code execution, it is advisable to express node locations by absolute paths

  1. Wildcard operator
xpathSApply(parsed_fortune, "/html/body/div/*/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Selection expressions
xpathSApply(parsed_fortune, "//title/..")
## [[1]]
## <head>
##   <title>Collected R wisdoms</title>
## </head>
  1. Multiple paths
xpathSApply(parsed_fortune, "//address | //title")
## [[1]]
## <title>Collected R wisdoms</title> 
## 
## [[2]]
## <address>
##   <a href="http://www.r-datacollectionbook.com">
##     <i>The book homepage</i>
##   </a>
##   <a/>
## </address>

Node relations

  • The family tree analogy

node1/relation::node2

xpathSApply(parsed_fortune, "//a/ancestor::div")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//a/ancestor::div//i")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/preceding-sibling::h1")
## [[1]]
## <h1>Robert Gentleman</h1> 
## 
## [[2]]
## <h1>Rolf Turner</h1>
xpathSApply(parsed_fortune, "//title/parent::*")
## [[1]]
## <head>
##   <title>Collected R wisdoms</title>
## </head>
Visualizing node relations

Visualizing node relations

XPath predicates

Predicates are simple functions that are applied to a node’s name, value, or attribute, and which evaluate whether a condition is true or false.

After a node(or node set) we specify the predicate in square brackets, node1[predicate]. We select all nodes in the document that comply with the condition formulated by the predicate.

  1. Numerical predicates
xpathSApply(parsed_fortune, "//div/p[position()=1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[last()]")
## [[1]]
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p> 
## 
## [[2]]
## <p>
##   <b>Source: </b>
##   <a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a>
## </p>
xpathSApply(parsed_fortune, "//div/p[last()-1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div[count(.//a)>0]")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>

@ element retrieves the attributes from a selected node. The ./@* expression returns all the attributes, regardless of their name, from the currently selected nodes.

xpathSApply(parsed_fortune, "//div[count(./@*)>2]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[not(count(./@*)>2)]")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
  1. Textual predicates
xpathSApply(parsed_fortune, "//div[@date='October/2011']")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//*[contains(text(), 'magic')]")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//div[starts-with(./@id, 'R')]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[substring-after(./@date, '/')='2003']//i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>

Extracting node elements

xpathSApply(parsed_fortune, "//title", fun = xmlValue)
## [1] "Collected R wisdoms"
xpathSApply(parsed_fortune, "//div", xmlAttrs)
## [[1]]
##           id         lang         date 
## "R Inventor"    "english"  "June/2003" 
## 
## [[2]]
##           lang           date 
##      "english" "October/2011"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "lang")
## [1] "english" "english"

Extending the fun argument

lowerCaseFun <- function(x) {
  x <- tolower(xmlValue(x))
  x
}

xpathSApply(parsed_fortune, "//div//i", fun = lowerCaseFun)
## [1] "'what we have is nice, but we need something very different'"
## [2] "'r is wonderful, but it cannot work magic'"
dateFun <- function(x) {
  require(stringr)
  date <- xmlGetAttr(node = x, name = "date")
  year <- str_extract(date, "[0-9]{4}")
  year
}

xpathSApply(parsed_fortune, "//div", dateFun)
## Loading required package: stringr
## [1] "2003" "2011"
idFun <- function(x) {
 id <- xmlGetAttr(x, "id")
 id <- ifelse(is.null(id), "not specified", id)
 return(id)
}

xpathSApply(parsed_fortune, "//div", idFun)
## [1] "R Inventor"    "not specified"

XPath helper tool