W7-1: RWS Ch.2 XML Path Language

Parsing

library(httr)
url <- "http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html"
fortune <- httr::GET(url)
fortune

## Response [http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html]
##   Date: 2021-10-13 06:56
##   Status: 200
##   Content-Type: text/html; charset=UTF-8
##   Size: 776 B
## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html> <head>
## <title>Collected R wisdoms</title>
## </head>
## 
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## ...

class(fortune)

## [1] "response"

library(XML)
parsed_fortune <- htmlParse(fortune)
class(parsed_fortune)

## [1] "HTMLInternalDocument" "HTMLInternalDocument" "XMLInternalDocument" 
## [4] "XMLAbstractDocument"

parsed_fortune

## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
## 
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
## 
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
## 
## </body>
## </html>
##

XPath - a query language for web documents

XPath is a query language that is useful for addressing and extracting parts from HTML/XML documents.

XPath predicates

Predicates are simple functions that are applied to a node’s name, value, or attribute, and which evaluate whether a condition is true or false.

After a node(or node set) we specify the predicate in square brackets, node1[predicate]. We select all nodes in the document that comply with the condition formulated by the predicate.

Textual predicates

Textual properties of the document are useful predicates for node selection.

parsed_fortune

## <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">
## <html>
## <head><title>Collected R wisdoms</title></head>
## <body>
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>
## 
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>
## 
## <address>
## <a href="http://www.r-datacollectionbook.com"><i>The book homepage</i></a><a></a>
## </address>
## 
## </body>
## </html>
##

xpathSApply(parsed_fortune, "//div[@date='October/2011']")

## [[1]]
## <div lang="english" date="October/2011">
##   <h1>Rolf Turner</h1>
##   <p><i>'R is wonderful, but it cannot work magic'</i> <br/><emph>answering a request for automatic generation of 'data from a known mean and 95% CI'</emph></p>
##   <p><b>Source: </b><a href="https://stat.ethz.ch/mailman/listinfo/r-help">R-help</a></p>
## </div>

string_method(text1, 'text2')

xpathSApply(parsed_fortune, "//*[contains(text(), 'magic')]")

## [[1]]
## <i>'R is wonderful, but it cannot work magic'</i>

xpathSApply(parsed_fortune, "//div[starts-with(./@id, 'R')]")

## [[1]]
## <div id="R Inventor" lang="english" date="June/2003">
##   <h1>Robert Gentleman</h1>
##   <p><i>'What we have is nice, but we need something very different'</i></p>
##   <p><b>Source: </b>Statistical Computing 2003, Reisensburg</p>
## </div>

Extracting node elements

xpathSApply(parsed_fortune, "//title")

## [[1]]
## <title>Collected R wisdoms</title>

xpathSApply(parsed_fortune, "//title", fun = xmlValue)

## [1] "Collected R wisdoms"

xpathSApply(parsed_fortune, "//h1", fun = xmlValue)

## [1] "Robert Gentleman" "Rolf Turner"

xpathSApply(parsed_fortune, "//div", xmlAttrs)

## [[1]]
##           id         lang         date 
## "R Inventor"    "english"  "June/2003" 
## 
## [[2]]
##           lang           date 
##      "english" "October/2011"

xpathSApply(parsed_fortune, "//div", xmlGetAttr, "lang")

## [1] "english" "english"

Extending the `fun` argument

xpathSApply(parsed_fortune, "//div//i", xmlValue)

## [1] "'What we have is nice, but we need something very different'"
## [2] "'R is wonderful, but it cannot work magic'"

lowerCaseFun <- function(x) {
  x <- tolower(xmlValue(x))
  return(x)
}

xpathSApply(parsed_fortune, "//div//i", fun = lowerCaseFun)

## [1] "'what we have is nice, but we need something very different'"
## [2] "'r is wonderful, but it cannot work magic'"

xpathSApply(parsed_fortune, "//div", xmlGetAttr, "date")

## [1] "June/2003"    "October/2011"

dateFun <- function(x) {
  require(stringr)
  date <- xmlGetAttr(node = x, name = "date")
  year <- str_extract(date, "[0-9]{4}")
  year
}

xpathSApply(parsed_fortune, "//div", dateFun)

## 필요한 패키지를 로딩중입니다: stringr

## [1] "2003" "2011"

xpathSApply(parsed_fortune, "//div", xmlGetAttr, "id")

## [[1]]
## [1] "R Inventor"
## 
## [[2]]
## NULL

idFun <- function(x) {
 id <- xmlGetAttr(x, "id")
 id <- ifelse(is.null(id), "not specified", id)
 return(id)
}

xpathSApply(parsed_fortune, "//div", idFun)

## [1] "R Inventor"    "not specified"

XPath helper tool

Chrome is a browser that provides a suite of developer tools to help inspect elements in the webpage and create valid XPath statements that can be passed to XML’s node retrieval functions.

http://www.r-datacollection.com/materials/ch-4-xpath/fortunes/fortunes.html

'/html/body/div[2]/p[1]/i'

## [1] "/html/body/div[2]/p[1]/i"

xpathSApply(parsed_fortune, '/html/body/div[2]/p[1]/i', fun=xmlValue)

## [1] "'R is wonderful, but it cannot work magic'"

'/html/body/div[2]/p[2]/a'

## [1] "/html/body/div[2]/p[2]/a"

xpathSApply(parsed_fortune, '/html/body/div[2]/p[2]/a', fun=xmlGetAttr, "href")

## [1] "https://stat.ethz.ch/mailman/listinfo/r-help"

Summary

Knowing how to build expressions from scratch remains a necessary skill
XPath language for querying HTML documents
How to build XPath expressions from scratch
Iterative learning process for constructing an applicable XPath statement
1. Construction stage: We assemble an XPath statement that is believed to return the correct information.
2. Testing stage: We apply the XPath, observe the returned node set or error message, and find that perhaps the returned node set is too broad or too narrow.
3. Learning stage: When the XPath query has failed, we infer a more suitable XPath expression by making it more strict or more lax to obtain only the desired information.

W7-1: RWS Ch.2 XML Path Language

Shin Lee

2021/10/13

Parsing

XPath - a query language for web documents

XPath predicates

Extracting node elements

Extending the `fun` argument

XPath helper tool

Summary

W7-1: RWS Ch.2 XML Path Language

Shin Lee

2021/10/13

Parsing

XPath - a query language for web documents

XPath predicates

Extracting node elements

Extending the fun argument

XPath helper tool

Summary

Extending the `fun` argument