XPath

Web scraping process

  1. Asking what information we are interested in and identifying where the information is located in a specific document

  2. Tailoring a query to the document and obtaining the desired information

  3. (Re)casting the extracted values into a format that facilitates further analysis

XPath - a query language for web documents

XPath is a query language that is useful for addressing and extracting parts from HTML/XML documents.

url <- "http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html"
fortune <- readLines(url)
library(XML)
parsed_fortune <- htmlParse(file=fortune)
print(parsed_fortune)
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
## 
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
## 
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
## 
## </body>
## </html>
## 
A tree perspective on parsed_fortune

A tree perspective on parsed_fortune

Identifying node sets with XPath

?xpathSApply
## starting httpd help server ... done

Basic structure of an XPath query

  1. Hierarchical addressing mechanism
getwd()
## [1] "C:/Users/CAU/Dropbox/2020_Class/fall/big data methodology/R"
  1. Absolute paths
xpathSApply(doc = parsed_fortune, path = "/html/body/div/p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Relative paths
xpathSApply(parsed_fortune, "//body//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Deciding between relative and absolute paths

Relative path statements result in complete traversals of the document tree, which is rather expensive computationally and decreases the efficiency of the query. So, if speed is an issue to your code execution, it is advisable to express node locations by absolute paths

  1. Wildcard operator
xpathSApply(parsed_fortune, "/html/body/div/*/i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i> 
## 
## [[2]]
## <i>'R is wonderful, but it cannot work magic'</i>
  1. Selection expressions
xpathSApply(parsed_fortune, "//title/..")
## [[1]]
## <head>
##   <title>Collected R wisdoms</title>
## </head>
  1. Multiple paths
xpathSApply(parsed_fortune, "//address | //title")
## [[1]]
## <title>Collected R wisdoms</title> 
## 
## [[2]]
## <address>
##   <a href="http://www.r-datacollectionbook.com">
##     <i>The book homepage</i>
##   </a>
##   <a/>
## </address>

Node relations

  • The family tree analogy

node1/relation::node2

xpathSApply(parsed_fortune, "//a/ancestor::div")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
xpathSApply(parsed_fortune, "//a/ancestor::div//i")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//p/preceding-sibling::h1")
## [[1]]
## <h1>Robert Gentleman</h1> 
## 
## [[2]]
## <h1>Rolf Turner</h1>
xpathSApply(parsed_fortune, "//title/parent::*")
## [[1]]
## <head>
##   <title>Collected R wisdoms</title>
## </head>
Visualizing node relations

Visualizing node relations

XPath predicates

Predicates are simple functions that are applied to a node’s name, value, or attribute, and which evaluate whether a condition is true or false.

After a node(or node set) we specify the predicate in square brackets, node1[predicate]. We select all nodes in the document that comply with the condition formulated by the predicate.

  1. Numerical predicates
xpathSApply(parsed_fortune, "//div/p[position()=1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div/p[last()]")
## [[1]]
## <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p> 
## 
## [[2]]
## <p>
##   <b>Source: </b>
##   <a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a>
## </p>
xpathSApply(parsed_fortune, "//div/p[last()-1]")
## [[1]]
## <p>
##   <i>'What we have is nice, but we need something very different'</i>
## </p> 
## 
## [[2]]
## <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
xpathSApply(parsed_fortune, "//div[count(.//a)>0]")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>

@ element retrieves the attributes from a selected node. The ./@* expression returns all the attributes, regardless of their name, from the currently selected nodes.

xpathSApply(parsed_fortune, "//div[count(./@*)>2]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[not(count(./@*)>2)]")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
  1. Textual predicates

Textual properties of the document are useful predicates for node selection.

xpathSApply(parsed_fortune, "//div[@date='October/2011']")
## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>

string_method(text1, 'text2')

xpathSApply(parsed_fortune, "//*[contains(text(), 'magic')]")
## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>
xpathSApply(parsed_fortune, "//div[starts-with(./@id, 'R')]")
## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
xpathSApply(parsed_fortune, "//div[substring-after(./@date, '/')='2003']//i")
## [[1]]
## <i>'What we have is nice, but we need something very different'</i>

Extracting node elements

XML extractor functions

XML extractor functions

xpathSApply(parsed_fortune, "//title", fun = xmlValue)
## [1] "Collected R wisdoms"
xpathSApply(parsed_fortune, "//div", xmlAttrs)
## [[1]]
##           id         lang         date 
## "R Inventor"    "english"  "June/2003" 
## 
## [[2]]
##           lang           date 
##      "english" "October/2011"
xpathSApply(parsed_fortune, "//div", xmlGetAttr, "lang")
## [1] "english" "english"

Extending the fun argument

lowerCaseFun <- function(x) {
  x <- tolower(xmlValue(x))
  x
}

xpathSApply(parsed_fortune, "//div//i", fun = lowerCaseFun)
## [1] "'what we have is nice, but we need something very different'"
## [2] "'r is wonderful, but it cannot work magic'"
dateFun <- function(x) {
  require(stringr)
  date <- xmlGetAttr(node = x, name = "date")
  year <- str_extract(date, "[0-9]{4}")
  year
}

xpathSApply(parsed_fortune, "//div", dateFun)
## Loading required package: stringr
## [1] "2003" "2011"
idFun <- function(x) {
 id <- xmlGetAttr(x, "id")
 id <- ifelse(is.null(id), "not specified", id)
 return(id)
}

xpathSApply(parsed_fortune, "//div", idFun)
## [1] "R Inventor"    "not specified"

XPath helper tool

Practice with Naver News